thoth.ingestion.parsers.base¶
Base classes for document parsers.
This module defines the abstract interface for document parsers and the ParsedDocument data structure used across all parser implementations.
Functions
|
A decorator indicating abstract methods. |
|
Add dunder methods based on the fields defined in the class. |
|
Return an object to identify dataclass fields. |
Classes
|
Helper class that provides a standard way to create an ABC using inheritance. |
|
Special type indicating an unconstrained type. |
Abstract base class for document parsers. |
|
|
Result of parsing a document. |
|
PurePath subclass that can make system calls. |
- class thoth.ingestion.parsers.base.ParsedDocument(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '')[source]¶
Bases:
objectResult of parsing a document.
- class thoth.ingestion.parsers.base.DocumentParser[source]¶
Bases:
ABCAbstract base class for document parsers.
All document parsers must implement this interface to ensure consistent behavior across different file formats.
Example
>>> parser = MarkdownParser() >>> if parser.can_parse(Path("doc.md")): ... doc = parser.parse(Path("doc.md")) ... print(doc.content)
- abstract property supported_extensions: list[str]¶
Return list of supported file extensions.
- Returns:
List of extensions including the dot (e.g., [‘.md’, ‘.markdown’])
- abstractmethod parse(file_path: Path) ParsedDocument[source]¶
Parse a document file and return structured content.
- Parameters:
file_path – Path to the document file
- Returns:
ParsedDocument with extracted text and metadata
- Raises:
ValueError – If file format is not supported
FileNotFoundError – If file doesn’t exist
IOError – If file cannot be read
- abstractmethod parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse document content from bytes.
This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.
- Parameters:
content – Raw file content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with extracted text and metadata