thoth.ingestion.parsers.base

Base classes for document parsers.

This module defines the abstract interface for document parsers and the ParsedDocument data structure used across all parser implementations.

Functions

abstractmethod(funcobj)

A decorator indicating abstract methods.

dataclass([cls, init, repr, eq, order, ...])

Add dunder methods based on the fields defined in the class.

field(*[, default, default_factory, init, ...])

Return an object to identify dataclass fields.

Classes

ABC()

Helper class that provides a standard way to create an ABC using inheritance.

Any(*args, **kwargs)

Special type indicating an unconstrained type.

DocumentParser()

Abstract base class for document parsers.

ParsedDocument(content, metadata, ...)

Result of parsing a document.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

class thoth.ingestion.parsers.base.ParsedDocument(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '')[source]

Bases: object

Result of parsing a document.

content

Extracted text content from the document

Type:

str

metadata

Dictionary of metadata extracted from the document

Type:

dict[str, Any]

source_path

Original file path or identifier

Type:

str

format

Document format identifier (e.g., ‘markdown’, ‘pdf’, ‘text’, ‘docx’)

Type:

str

content: str
metadata: dict[str, Any]
source_path: str = ''
format: str = ''
__post_init__() None[source]

Validate parsed document after initialization.

__init__(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '') None
class thoth.ingestion.parsers.base.DocumentParser[source]

Bases: ABC

Abstract base class for document parsers.

All document parsers must implement this interface to ensure consistent behavior across different file formats.

Example

>>> parser = MarkdownParser()
>>> if parser.can_parse(Path("doc.md")):
...     doc = parser.parse(Path("doc.md"))
...     print(doc.content)
abstract property supported_extensions: list[str]

Return list of supported file extensions.

Returns:

List of extensions including the dot (e.g., [‘.md’, ‘.markdown’])

abstractmethod parse(file_path: Path) ParsedDocument[source]

Parse a document file and return structured content.

Parameters:

file_path – Path to the document file

Returns:

ParsedDocument with extracted text and metadata

Raises:
abstractmethod parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse document content from bytes.

This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.

Parameters:
  • content – Raw file content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

can_parse(file_path: Path) bool[source]

Check if this parser can handle the given file.

Parameters:

file_path – Path to check

Returns:

True if this parser supports the file’s extension

property name: str

Return the parser name.

Returns:

Human-readable parser name