Skip to content

Pdf

thoth.ingestion.parsers.pdf

PDF document parser.

This module provides parsing for PDF files using PyMuPDF (fitz).

logger = setup_logger(__name__) module-attribute

PDFParser

Parser for PDF files using PyMuPDF.

Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)

supported_extensions: list[str] property

Return supported PDF extensions.

parse(file_path: Path) -> ParsedDocument

Parse a PDF file.

Parameters:

Name Type Description Default
file_path Path

Path to the PDF file

required

Returns:

Type Description
ParsedDocument

ParsedDocument with extracted text and metadata

Raises:

Type Description
FileNotFoundError

If file doesn't exist

ImportError

If PyMuPDF is not installed

parse_content(content: bytes, source_path: str) -> ParsedDocument

Parse PDF content from bytes.

Parameters:

Name Type Description Default
content bytes

Raw PDF content as bytes

required
source_path str

Original source path for metadata

required

Returns:

Type Description
ParsedDocument

ParsedDocument with extracted text and metadata