thoth.ingestion.parsers¶
Document parsers for multi-format ingestion.
This module provides a unified interface for parsing different document formats (Markdown, PDF, plain text, Word documents).
Example
>>> from thoth.ingestion.parsers import ParserFactory
>>> from pathlib import Path
>>>
>>> doc = ParserFactory.parse(Path("document.pdf"))
>>> print(doc.content)
Functions
|
Create and configure a logger with structured JSON output. |
Classes
Abstract base class for document parsers. |
|
Parser for Word documents using python-docx. |
|
Parser for Markdown files. |
|
Parser for PDF files using PyMuPDF. |
|
|
Result of parsing a document. |
Factory for creating and using document parsers. |
|
|
PurePath subclass that can make system calls. |
Parser for plain text files. |
- class thoth.ingestion.parsers.DocumentParser[source]¶
Bases:
ABCAbstract base class for document parsers.
All document parsers must implement this interface to ensure consistent behavior across different file formats.
Example
>>> parser = MarkdownParser() >>> if parser.can_parse(Path("doc.md")): ... doc = parser.parse(Path("doc.md")) ... print(doc.content)
- abstract property supported_extensions: list[str]¶
Return list of supported file extensions.
- Returns:
List of extensions including the dot (e.g., [‘.md’, ‘.markdown’])
- abstractmethod parse(file_path: Path) ParsedDocument[source]¶
Parse a document file and return structured content.
- Parameters:
file_path – Path to the document file
- Returns:
ParsedDocument with extracted text and metadata
- Raises:
ValueError – If file format is not supported
FileNotFoundError – If file doesn’t exist
IOError – If file cannot be read
- abstractmethod parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse document content from bytes.
This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.
- Parameters:
content – Raw file content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with extracted text and metadata
- class thoth.ingestion.parsers.DocxParser[source]¶
Bases:
DocumentParserParser for Word documents using python-docx.
Supports: - Word documents (.docx) - Paragraph text extraction - Basic metadata extraction (title, author)
Note
Only supports .docx format (Office Open XML). Legacy .doc files are not supported.
- parse(file_path: Path) ParsedDocument[source]¶
Parse a Word document.
- Parameters:
file_path – Path to the Word document
- Returns:
ParsedDocument with extracted text and metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ImportError – If python-docx is not installed
- parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse Word document content from bytes.
- Parameters:
content – Raw document content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with extracted text and metadata
- class thoth.ingestion.parsers.MarkdownParser[source]¶
Bases:
DocumentParserParser for Markdown files.
Supports: - Standard Markdown (.md, .markdown, .mdown) - YAML frontmatter extraction - UTF-8 encoding
- parse(file_path: Path) ParsedDocument[source]¶
Parse a Markdown file.
- Parameters:
file_path – Path to the Markdown file
- Returns:
ParsedDocument with content and metadata
- Raises:
FileNotFoundError – If file doesn’t exist
UnicodeDecodeError – If file isn’t valid UTF-8
- parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse Markdown content from bytes.
- Parameters:
content – Raw file content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with content and extracted metadata
- class thoth.ingestion.parsers.PDFParser[source]¶
Bases:
DocumentParserParser for PDF files using PyMuPDF.
Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)
- parse(file_path: Path) ParsedDocument[source]¶
Parse a PDF file.
- Parameters:
file_path – Path to the PDF file
- Returns:
ParsedDocument with extracted text and metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ImportError – If PyMuPDF is not installed
- parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse PDF content from bytes.
- Parameters:
content – Raw PDF content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with extracted text and metadata
- class thoth.ingestion.parsers.ParsedDocument(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '')[source]¶
Bases:
objectResult of parsing a document.
- class thoth.ingestion.parsers.ParserFactory[source]¶
Bases:
objectFactory for creating and using document parsers.
This factory maintains a registry of available parsers and provides methods to parse files using the appropriate parser based on file extension.
Example
>>> # Parse a single file >>> doc = ParserFactory.parse(Path("notes.md")) >>> >>> # Get parser for a specific file >>> parser = ParserFactory.get_parser(Path("document.pdf")) >>> if parser: ... doc = parser.parse(Path("document.pdf")) >>> >>> # Check supported extensions >>> extensions = ParserFactory.supported_extensions() >>> print(extensions) # ['.md', '.markdown', '.mdown', '.pdf', '.txt', ...]
- classmethod get_parser(file_path: Path) DocumentParser | None[source]¶
Get appropriate parser for a file.
- Parameters:
file_path – Path to the file to parse
- Returns:
DocumentParser instance if a suitable parser exists, None otherwise
- classmethod parse(file_path: Path) ParsedDocument[source]¶
Parse a file using the appropriate parser.
- Parameters:
file_path – Path to the file to parse
- Returns:
ParsedDocument with extracted content and metadata
- Raises:
ValueError – If no parser is available for the file type
FileNotFoundError – If file doesn’t exist
- classmethod parse_content(content: bytes, source_path: str, extension: str) ParsedDocument[source]¶
Parse content bytes using a parser for the given extension.
- Parameters:
content – Raw file content as bytes
source_path – Original source path for metadata
extension – File extension (e.g., ‘.pdf’)
- Returns:
ParsedDocument with extracted content and metadata
- Raises:
ValueError – If no parser is available for the extension
- classmethod supported_extensions() list[str][source]¶
Get all supported file extensions.
- Returns:
List of supported extensions including the dot (e.g., [‘.md’, ‘.pdf’])
- classmethod can_parse(file_path: Path) bool[source]¶
Check if any parser can handle the given file.
- Parameters:
file_path – Path to check
- Returns:
True if a parser is available for the file
- classmethod register_parser(parser_class: type[DocumentParser]) None[source]¶
Register a new parser class.
- Parameters:
parser_class – Parser class to register
- class thoth.ingestion.parsers.TextParser[source]¶
Bases:
DocumentParserParser for plain text files.
Supports: - Plain text files (.txt, .text) - UTF-8 encoding with fallback to latin-1
- parse(file_path: Path) ParsedDocument[source]¶
Parse a plain text file.
- Parameters:
file_path – Path to the text file
- Returns:
ParsedDocument with content
- Raises:
FileNotFoundError – If file doesn’t exist
- parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse text content from bytes.
- Parameters:
content – Raw file content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with content
Modules