thoth.ingestion.parsers.pdf

PDF document parser.

This module provides parsing for PDF files using PyMuPDF (fitz).

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

DocumentParser()

Abstract base class for document parsers.

PDFParser()

Parser for PDF files using PyMuPDF.

ParsedDocument(content, metadata, ...)

Result of parsing a document.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

class thoth.ingestion.parsers.pdf.PDFParser[source]

Bases: DocumentParser

Parser for PDF files using PyMuPDF.

Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)

property supported_extensions: list[str]

Return supported PDF extensions.

parse(file_path: Path) ParsedDocument[source]

Parse a PDF file.

Parameters:

file_path – Path to the PDF file

Returns:

ParsedDocument with extracted text and metadata

Raises:
parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse PDF content from bytes.

Parameters:
  • content – Raw PDF content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata