Base
thoth.ingestion.parsers.base
¶
Base classes for document parsers.
This module defines the abstract interface for document parsers and the ParsedDocument data structure used across all parser implementations.
ParsedDocument
dataclass
¶
Result of parsing a document.
Attributes:
| Name | Type | Description |
|---|---|---|
content |
str
|
Extracted text content from the document |
metadata |
dict[str, Any]
|
Dictionary of metadata extracted from the document |
source_path |
str
|
Original file path or identifier |
format |
str
|
Document format identifier (e.g., 'markdown', 'pdf', 'text', 'docx') |
content: str
instance-attribute
¶
metadata: dict[str, Any] = field(default_factory=dict)
class-attribute
instance-attribute
¶
source_path: str = ''
class-attribute
instance-attribute
¶
format: str = ''
class-attribute
instance-attribute
¶
__init__(content: str, metadata: dict[str, Any] = dict(), source_path: str = '', format: str = '') -> None
¶
__post_init__() -> None
¶
Validate parsed document after initialization.
DocumentParser
¶
Abstract base class for document parsers.
All document parsers must implement this interface to ensure consistent behavior across different file formats.
Example
parser = MarkdownParser() if parser.can_parse(Path("doc.md")): ... doc = parser.parse(Path("doc.md")) ... print(doc.content)
supported_extensions: list[str]
abstractmethod
property
¶
Return list of supported file extensions.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of extensions including the dot (e.g., ['.md', '.markdown']) |
name: str
property
¶
Return the parser name.
Returns:
| Type | Description |
|---|---|
str
|
Human-readable parser name |
parse(file_path: Path) -> ParsedDocument
abstractmethod
¶
Parse a document file and return structured content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the document file |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with extracted text and metadata |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file format is not supported |
FileNotFoundError
|
If file doesn't exist |
IOError
|
If file cannot be read |
parse_content(content: bytes, source_path: str) -> ParsedDocument
abstractmethod
¶
Parse document content from bytes.
This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
bytes
|
Raw file content as bytes |
required |
source_path
|
str
|
Original source path for metadata |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with extracted text and metadata |
can_parse(file_path: Path) -> bool
¶
Check if this parser can handle the given file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to check |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if this parser supports the file's extension |