Base

`thoth.ingestion.parsers.base` ¶

Base classes for document parsers.

This module defines the abstract interface for document parsers and the ParsedDocument data structure used across all parser implementations.

`ParsedDocument` `dataclass` ¶

Result of parsing a document.

Attributes:

Name	Type	Description
`content`	`str`	Extracted text content from the document
`metadata`	`dict[str, Any]`	Dictionary of metadata extracted from the document
`source_path`	`str`	Original file path or identifier
`format`	`str`	Document format identifier (e.g., 'markdown', 'pdf', 'text', 'docx')

`content: str` `instance-attribute` ¶

`metadata: dict[str, Any] = field(default_factory=dict)` `class-attribute` `instance-attribute` ¶

`source_path: str = ''` `class-attribute` `instance-attribute` ¶

`format: str = ''` `class-attribute` `instance-attribute` ¶

`init(content: str, metadata: dict[str, Any] = dict(), source_path: str = '', format: str = '') -> None` ¶

`__post_init__() -> None` ¶

Validate parsed document after initialization.

`DocumentParser` ¶

Abstract base class for document parsers.

All document parsers must implement this interface to ensure consistent behavior across different file formats.

Example

parser = MarkdownParser() if parser.can_parse(Path("doc.md")): ... doc = parser.parse(Path("doc.md")) ... print(doc.content)

`supported_extensions: list[str]` `abstractmethod` `property` ¶

Return list of supported file extensions.

Returns:

Type	Description
`list[str]`	List of extensions including the dot (e.g., ['.md', '.markdown'])

`name: str` `property` ¶

Return the parser name.

Returns:

Type	Description
`str`	Human-readable parser name

`parse(file_path: Path) -> ParsedDocument` `abstractmethod` ¶

Parse a document file and return structured content.

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to the document file	required

Returns:

Type	Description
`ParsedDocument`	ParsedDocument with extracted text and metadata

Raises:

Type	Description
`ValueError`	If file format is not supported
`FileNotFoundError`	If file doesn't exist
`IOError`	If file cannot be read

`parse_content(content: bytes, source_path: str) -> ParsedDocument` `abstractmethod` ¶

Parse document content from bytes.

This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.

Parameters:

Name	Type	Description	Default
`content`	`bytes`	Raw file content as bytes	required
`source_path`	`str`	Original source path for metadata	required

Returns:

Type	Description
`ParsedDocument`	ParsedDocument with extracted text and metadata

`can_parse(file_path: Path) -> bool` ¶

Check if this parser can handle the given file.

Parameters:

Name	Type	Description	Default
`file_path`	`Path`	Path to check	required

Returns:

Type	Description
`bool`	True if this parser supports the file's extension

Base

thoth.ingestion.parsers.base ¶

ParsedDocument dataclass ¶

content: str instance-attribute ¶

metadata: dict[str, Any] = field(default_factory=dict) class-attribute instance-attribute ¶

source_path: str = '' class-attribute instance-attribute ¶

format: str = '' class-attribute instance-attribute ¶

__init__(content: str, metadata: dict[str, Any] = dict(), source_path: str = '', format: str = '') -> None ¶

__post_init__() -> None ¶

DocumentParser ¶

supported_extensions: list[str] abstractmethod property ¶

name: str property ¶

parse(file_path: Path) -> ParsedDocument abstractmethod ¶

parse_content(content: bytes, source_path: str) -> ParsedDocument abstractmethod ¶

can_parse(file_path: Path) -> bool ¶

`thoth.ingestion.parsers.base` ¶

`ParsedDocument` `dataclass` ¶

`content: str` `instance-attribute` ¶

`metadata: dict[str, Any] = field(default_factory=dict)` `class-attribute` `instance-attribute` ¶

`source_path: str = ''` `class-attribute` `instance-attribute` ¶

`format: str = ''` `class-attribute` `instance-attribute` ¶

`init(content: str, metadata: dict[str, Any] = dict(), source_path: str = '', format: str = '') -> None` ¶

`__post_init__() -> None` ¶

`DocumentParser` ¶

`supported_extensions: list[str]` `abstractmethod` `property` ¶

`name: str` `property` ¶

`parse(file_path: Path) -> ParsedDocument` `abstractmethod` ¶

`parse_content(content: bytes, source_path: str) -> ParsedDocument` `abstractmethod` ¶

`can_parse(file_path: Path) -> bool` ¶