thoth.ingestion.parsers¶

Document parsers for multi-format ingestion.

This module provides a unified interface for parsing different document formats (Markdown, PDF, plain text, Word documents).

Example

>>> from thoth.ingestion.parsers import ParserFactory
>>> from pathlib import Path
>>>
>>> doc = ParserFactory.parse(Path("document.pdf"))
>>> print(doc.content)

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

`DocumentParser`()	Abstract base class for document parsers.
`DocxParser`()	Parser for Word documents using python-docx.
`MarkdownParser`()	Parser for Markdown files.
`PDFParser`()	Parser for PDF files using PyMuPDF.
`ParsedDocument`(content, metadata, ...)	Result of parsing a document.
`ParserFactory`()	Factory for creating and using document parsers.
`Path`(args, *kwargs)	PurePath subclass that can make system calls.
`TextParser`()	Parser for plain text files.

class thoth.ingestion.parsers.DocumentParser[source]¶

Bases: ABC

Abstract base class for document parsers.

All document parsers must implement this interface to ensure consistent behavior across different file formats.

Example

>>> parser = MarkdownParser()
>>> if parser.can_parse(Path("doc.md")):
...     doc = parser.parse(Path("doc.md"))
...     print(doc.content)

abstract property supported_extensions: list[str]¶

Return list of supported file extensions.

Returns:: List of extensions including the dot (e.g., [‘.md’, ‘.markdown’])

abstractmethod parse(file_path: Path) → ParsedDocument[source]¶

Parse a document file and return structured content.

Parameters:

file_path – Path to the document file

Returns:

ParsedDocument with extracted text and metadata

Raises:

ValueError – If file format is not supported
FileNotFoundError – If file doesn’t exist
IOError – If file cannot be read

abstractmethod parse_content(content: bytes, source_path: str) → ParsedDocument[source]¶

Parse document content from bytes.

This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.

Parameters:

content – Raw file content as bytes
source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

can_parse(file_path: Path) → bool[source]¶

Check if this parser can handle the given file.

Parameters:: file_path – Path to check
Returns:: True if this parser supports the file’s extension

property name: str¶

Return the parser name.

Returns:: Human-readable parser name

class thoth.ingestion.parsers.DocxParser[source]¶

Bases: DocumentParser

Parser for Word documents using python-docx.

Supports: - Word documents (.docx) - Paragraph text extraction - Basic metadata extraction (title, author)

Note

Only supports .docx format (Office Open XML). Legacy .doc files are not supported.

property supported_extensions: list[str]¶: Return supported Word document extensions.

parse(file_path: Path) → ParsedDocument[source]¶

Parse a Word document.

Parameters:

file_path – Path to the Word document

Returns:

ParsedDocument with extracted text and metadata

Raises:

FileNotFoundError – If file doesn’t exist
ImportError – If python-docx is not installed

parse_content(content: bytes, source_path: str) → ParsedDocument[source]¶

Parse Word document content from bytes.

Parameters:

content – Raw document content as bytes
source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

class thoth.ingestion.parsers.MarkdownParser[source]¶

Bases: DocumentParser

Parser for Markdown files.

Supports: - Standard Markdown (.md, .markdown, .mdown) - YAML frontmatter extraction - UTF-8 encoding

property supported_extensions: list[str]¶: Return supported Markdown extensions.

parse(file_path: Path) → ParsedDocument[source]¶

Parse a Markdown file.

Parameters:

file_path – Path to the Markdown file

Returns:

ParsedDocument with content and metadata

Raises:

FileNotFoundError – If file doesn’t exist
UnicodeDecodeError – If file isn’t valid UTF-8

parse_content(content: bytes, source_path: str) → ParsedDocument[source]¶

Parse Markdown content from bytes.

Parameters:

content – Raw file content as bytes
source_path – Original source path for metadata

Returns:

ParsedDocument with content and extracted metadata

class thoth.ingestion.parsers.PDFParser[source]¶

Bases: DocumentParser

Parser for PDF files using PyMuPDF.

Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)

property supported_extensions: list[str]¶: Return supported PDF extensions.

parse(file_path: Path) → ParsedDocument[source]¶

Parse a PDF file.

Parameters:

file_path – Path to the PDF file

Returns:

ParsedDocument with extracted text and metadata

Raises:

FileNotFoundError – If file doesn’t exist
ImportError – If PyMuPDF is not installed

parse_content(content: bytes, source_path: str) → ParsedDocument[source]¶

Parse PDF content from bytes.

Parameters:

content – Raw PDF content as bytes
source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

class thoth.ingestion.parsers.ParsedDocument(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '')[source]¶

Bases: object

Result of parsing a document.

content¶

Extracted text content from the document

Type:: str

metadata¶

Dictionary of metadata extracted from the document

Type:: dict[str, Any]

source_path¶

Original file path or identifier

Type:: str

format¶

Document format identifier (e.g., ‘markdown’, ‘pdf’, ‘text’, ‘docx’)

Type:: str

content: str¶

metadata: dict[str, Any]¶

source_path: str = ''¶

format: str = ''¶

__post_init__() → None[source]¶: Validate parsed document after initialization.

__init__(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '') → None¶

class thoth.ingestion.parsers.ParserFactory[source]¶

Bases: object

Factory for creating and using document parsers.

This factory maintains a registry of available parsers and provides methods to parse files using the appropriate parser based on file extension.

Example

>>> # Parse a single file
>>> doc = ParserFactory.parse(Path("notes.md"))
>>>
>>> # Get parser for a specific file
>>> parser = ParserFactory.get_parser(Path("document.pdf"))
>>> if parser:
...     doc = parser.parse(Path("document.pdf"))
>>>
>>> # Check supported extensions
>>> extensions = ParserFactory.supported_extensions()
>>> print(extensions)  # ['.md', '.markdown', '.mdown', '.pdf', '.txt', ...]

classmethod get_parser(file_path: Path) → DocumentParser | None[source]¶

Get appropriate parser for a file.

Parameters:: file_path – Path to the file to parse
Returns:: DocumentParser instance if a suitable parser exists, None otherwise

classmethod parse(file_path: Path) → ParsedDocument[source]¶

Parse a file using the appropriate parser.

Parameters:

file_path – Path to the file to parse

Returns:

ParsedDocument with extracted content and metadata

Raises:

ValueError – If no parser is available for the file type
FileNotFoundError – If file doesn’t exist

classmethod parse_content(content: bytes, source_path: str, extension: str) → ParsedDocument[source]¶

Parse content bytes using a parser for the given extension.

Parameters:

content – Raw file content as bytes
source_path – Original source path for metadata
extension – File extension (e.g., ‘.pdf’)

Returns:

ParsedDocument with extracted content and metadata

Raises:

ValueError – If no parser is available for the extension

classmethod supported_extensions() → list[str][source]¶

Get all supported file extensions.

Returns:: List of supported extensions including the dot (e.g., [‘.md’, ‘.pdf’])

classmethod can_parse(file_path: Path) → bool[source]¶

Check if any parser can handle the given file.

Parameters:: file_path – Path to check
Returns:: True if a parser is available for the file

classmethod register_parser(parser_class: type[DocumentParser]) → None[source]¶

Parameters:: parser_class – Parser class to register

class thoth.ingestion.parsers.TextParser[source]¶

Bases: DocumentParser

Parser for plain text files.

Supports: - Plain text files (.txt, .text) - UTF-8 encoding with fallback to latin-1

property supported_extensions: list[str]¶: Return supported text extensions.

parse(file_path: Path) → ParsedDocument[source]¶

Parse a plain text file.

Parameters:: file_path – Path to the text file
Returns:: ParsedDocument with content
Raises:: FileNotFoundError – If file doesn’t exist

parse_content(content: bytes, source_path: str) → ParsedDocument[source]¶

Parse text content from bytes.

Parameters:

content – Raw file content as bytes
source_path – Original source path for metadata

Returns:

ParsedDocument with content

Modules

`base`	Base classes for document parsers.
`docx`	Word document parser.
`markdown`	Markdown document parser.
`pdf`	PDF document parser.
`text`	Plain text document parser.