thoth.ingestion.parsers

Document parsers for multi-format ingestion.

This module provides a unified interface for parsing different document formats (Markdown, PDF, plain text, Word documents).

Example

>>> from thoth.ingestion.parsers import ParserFactory
>>> from pathlib import Path
>>>
>>> doc = ParserFactory.parse(Path("document.pdf"))
>>> print(doc.content)

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

DocumentParser()

Abstract base class for document parsers.

DocxParser()

Parser for Word documents using python-docx.

MarkdownParser()

Parser for Markdown files.

PDFParser()

Parser for PDF files using PyMuPDF.

ParsedDocument(content, metadata, ...)

Result of parsing a document.

ParserFactory()

Factory for creating and using document parsers.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

TextParser()

Parser for plain text files.

class thoth.ingestion.parsers.DocumentParser[source]

Bases: ABC

Abstract base class for document parsers.

All document parsers must implement this interface to ensure consistent behavior across different file formats.

Example

>>> parser = MarkdownParser()
>>> if parser.can_parse(Path("doc.md")):
...     doc = parser.parse(Path("doc.md"))
...     print(doc.content)
abstract property supported_extensions: list[str]

Return list of supported file extensions.

Returns:

List of extensions including the dot (e.g., [‘.md’, ‘.markdown’])

abstractmethod parse(file_path: Path) ParsedDocument[source]

Parse a document file and return structured content.

Parameters:

file_path – Path to the document file

Returns:

ParsedDocument with extracted text and metadata

Raises:
abstractmethod parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse document content from bytes.

This method allows parsing content that has already been loaded into memory, useful for processing files from cloud storage.

Parameters:
  • content – Raw file content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

can_parse(file_path: Path) bool[source]

Check if this parser can handle the given file.

Parameters:

file_path – Path to check

Returns:

True if this parser supports the file’s extension

property name: str

Return the parser name.

Returns:

Human-readable parser name

class thoth.ingestion.parsers.DocxParser[source]

Bases: DocumentParser

Parser for Word documents using python-docx.

Supports: - Word documents (.docx) - Paragraph text extraction - Basic metadata extraction (title, author)

Note

Only supports .docx format (Office Open XML). Legacy .doc files are not supported.

property supported_extensions: list[str]

Return supported Word document extensions.

parse(file_path: Path) ParsedDocument[source]

Parse a Word document.

Parameters:

file_path – Path to the Word document

Returns:

ParsedDocument with extracted text and metadata

Raises:
parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse Word document content from bytes.

Parameters:
  • content – Raw document content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

class thoth.ingestion.parsers.MarkdownParser[source]

Bases: DocumentParser

Parser for Markdown files.

Supports: - Standard Markdown (.md, .markdown, .mdown) - YAML frontmatter extraction - UTF-8 encoding

property supported_extensions: list[str]

Return supported Markdown extensions.

parse(file_path: Path) ParsedDocument[source]

Parse a Markdown file.

Parameters:

file_path – Path to the Markdown file

Returns:

ParsedDocument with content and metadata

Raises:
parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse Markdown content from bytes.

Parameters:
  • content – Raw file content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with content and extracted metadata

class thoth.ingestion.parsers.PDFParser[source]

Bases: DocumentParser

Parser for PDF files using PyMuPDF.

Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)

property supported_extensions: list[str]

Return supported PDF extensions.

parse(file_path: Path) ParsedDocument[source]

Parse a PDF file.

Parameters:

file_path – Path to the PDF file

Returns:

ParsedDocument with extracted text and metadata

Raises:
parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse PDF content from bytes.

Parameters:
  • content – Raw PDF content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata

class thoth.ingestion.parsers.ParsedDocument(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '')[source]

Bases: object

Result of parsing a document.

content

Extracted text content from the document

Type:

str

metadata

Dictionary of metadata extracted from the document

Type:

dict[str, Any]

source_path

Original file path or identifier

Type:

str

format

Document format identifier (e.g., ‘markdown’, ‘pdf’, ‘text’, ‘docx’)

Type:

str

content: str
metadata: dict[str, Any]
source_path: str = ''
format: str = ''
__post_init__() None[source]

Validate parsed document after initialization.

__init__(content: str, metadata: dict[str, ~typing.Any]=<factory>, source_path: str = '', format: str = '') None
class thoth.ingestion.parsers.ParserFactory[source]

Bases: object

Factory for creating and using document parsers.

This factory maintains a registry of available parsers and provides methods to parse files using the appropriate parser based on file extension.

Example

>>> # Parse a single file
>>> doc = ParserFactory.parse(Path("notes.md"))
>>>
>>> # Get parser for a specific file
>>> parser = ParserFactory.get_parser(Path("document.pdf"))
>>> if parser:
...     doc = parser.parse(Path("document.pdf"))
>>>
>>> # Check supported extensions
>>> extensions = ParserFactory.supported_extensions()
>>> print(extensions)  # ['.md', '.markdown', '.mdown', '.pdf', '.txt', ...]
classmethod get_parser(file_path: Path) DocumentParser | None[source]

Get appropriate parser for a file.

Parameters:

file_path – Path to the file to parse

Returns:

DocumentParser instance if a suitable parser exists, None otherwise

classmethod parse(file_path: Path) ParsedDocument[source]

Parse a file using the appropriate parser.

Parameters:

file_path – Path to the file to parse

Returns:

ParsedDocument with extracted content and metadata

Raises:
classmethod parse_content(content: bytes, source_path: str, extension: str) ParsedDocument[source]

Parse content bytes using a parser for the given extension.

Parameters:
  • content – Raw file content as bytes

  • source_path – Original source path for metadata

  • extension – File extension (e.g., ‘.pdf’)

Returns:

ParsedDocument with extracted content and metadata

Raises:

ValueError – If no parser is available for the extension

classmethod supported_extensions() list[str][source]

Get all supported file extensions.

Returns:

List of supported extensions including the dot (e.g., [‘.md’, ‘.pdf’])

classmethod can_parse(file_path: Path) bool[source]

Check if any parser can handle the given file.

Parameters:

file_path – Path to check

Returns:

True if a parser is available for the file

classmethod register_parser(parser_class: type[DocumentParser]) None[source]

Register a new parser class.

Parameters:

parser_class – Parser class to register

class thoth.ingestion.parsers.TextParser[source]

Bases: DocumentParser

Parser for plain text files.

Supports: - Plain text files (.txt, .text) - UTF-8 encoding with fallback to latin-1

property supported_extensions: list[str]

Return supported text extensions.

parse(file_path: Path) ParsedDocument[source]

Parse a plain text file.

Parameters:

file_path – Path to the text file

Returns:

ParsedDocument with content

Raises:

FileNotFoundError – If file doesn’t exist

parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse text content from bytes.

Parameters:
  • content – Raw file content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with content

Modules

base

Base classes for document parsers.

docx

Word document parser.

markdown

Markdown document parser.

pdf

PDF document parser.

text

Plain text document parser.