Skip to content

Base

thoth.ingestion.parsers.base

Base classes for document parsers.

This module defines the abstract interface for document parsers and the ParsedDocument data structure used across all parser implementations.

ParsedDocument dataclass

Result of parsing a document.

Attributes:

Name Type Description
content str

Extracted text content from the document

metadata dict[str, Any]

Dictionary of metadata extracted from the document

source_path str

Original file path or identifier

format str

Document format identifier (e.g., 'markdown', 'pdf', 'text', 'docx')

content: str instance-attribute

metadata: dict[str, Any] = field(default_factory=dict) class-attribute instance-attribute

source_path: str = '' class-attribute instance-attribute

format: str = '' class-attribute instance-attribute

__init__(content: str, metadata: dict[str, Any] = dict(), source_path: str = '', format: str = '') -> None

__post_init__() -> None

Validate parsed document after initialization.

DocumentParser

Abstract base class for document parsers.

All document parsers must implement this interface to ensure consistent behavior across different file formats.

Example

parser = MarkdownParser() if parser.can_parse(Path("doc.md")): ... doc = parser.parse(Path("doc.md")) ... print(doc.content)

supported_extensions: list[str] abstractmethod property

Return list of supported file extensions.

Returns:

Type Description
list[str]

List of extensions including the dot (e.g., ['.md', '.markdown'])

name: str property

Return the parser name.

Returns:

Type Description
str

Human-readable parser name

parse(file_path: Path) -> ParsedDocument abstractmethod

Parse a document file and return structured content.

Parameters:

Name Type Description Default
file_path Path

Path to the document file

required

Returns:

Type Description
ParsedDocument

ParsedDocument with extracted text and metadata

Raises:

Type Description
ValueError

If file format is not supported

FileNotFoundError

If file doesn't exist

IOError

If file cannot be read

parse_content(content: bytes, source_path: str) -> ParsedDocument abstractmethod

Parse document content from bytes.

This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.

Parameters:

Name Type Description Default
content bytes

Raw file content as bytes

required
source_path str

Original source path for metadata

required

Returns:

Type Description
ParsedDocument

ParsedDocument with extracted text and metadata

can_parse(file_path: Path) -> bool

Check if this parser can handle the given file.

Parameters:

Name Type Description Default
file_path Path

Path to check

required

Returns:

Type Description
bool

True if this parser supports the file's extension