thoth.ingestion.parsers.base¶

Base classes for document parsers.

This module defines the abstract interface for document parsers and the ParsedDocument data structure used across all parser implementations.

Functions

`abstractmethod`(funcobj)	A decorator indicating abstract methods.
`dataclass`([cls, init, repr, eq, order, ...])	Add dunder methods based on the fields defined in the class.
`field`(*[, default, default_factory, init, ...])	Return an object to identify dataclass fields.

Classes

`ABC`()	Helper class that provides a standard way to create an ABC using inheritance.
`Any`(args, *kwargs)	Special type indicating an unconstrained type.
`DocumentParser`()	Abstract base class for document parsers.
`ParsedDocument`(content, metadata, ...)	Result of parsing a document.
`Path`(args, *kwargs)	PurePath subclass that can make system calls.

class thoth.ingestion.parsers.base.ParsedDocument(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '')[source]¶

Bases: object

Result of parsing a document.

content¶

Extracted text content from the document

Type:: str

metadata¶

Dictionary of metadata extracted from the document

Type:: dict[str, Any]

source_path¶

Original file path or identifier

Type:: str

format¶

Document format identifier (e.g., ‘markdown’, ‘pdf’, ‘text’, ‘docx’)

Type:: str

content: str¶

metadata: dict[str, Any]¶

source_path: str = ''¶

format: str = ''¶

__post_init__() → None[source]¶: Validate parsed document after initialization.

__init__(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '') → None¶

class thoth.ingestion.parsers.base.DocumentParser[source]¶

Bases: ABC

Abstract base class for document parsers.

All document parsers must implement this interface to ensure consistent behavior across different file formats.

Example

>>> parser = MarkdownParser()
>>> if parser.can_parse(Path("doc.md")):
...     doc = parser.parse(Path("doc.md"))
...     print(doc.content)

abstract property supported_extensions: list[str]¶

Return list of supported file extensions.

Returns:: List of extensions including the dot (e.g., [‘.md’, ‘.markdown’])

abstractmethod parse(file_path: Path) → ParsedDocument[source]¶

Parse a document file and return structured content.

Parameters:

file_path – Path to the document file

Returns:

ParsedDocument with extracted text and metadata

Raises:

ValueError – If file format is not supported
FileNotFoundError – If file doesn’t exist
IOError – If file cannot be read

abstractmethod parse_content(content: bytes, source_path: str) → ParsedDocument[source]¶

Parse document content from bytes.

This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.

Parameters:

content – Raw file content as bytes
source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

can_parse(file_path: Path) → bool[source]¶

Check if this parser can handle the given file.

Parameters:: file_path – Path to check
Returns:: True if this parser supports the file’s extension

property name: str¶

Return the parser name.

Returns:: Human-readable parser name