thoth.ingestion.parsers.pdf¶

PDF document parser.

This module provides parsing for PDF files using PyMuPDF (fitz).

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

`DocumentParser`()	Abstract base class for document parsers.
`PDFParser`()	Parser for PDF files using PyMuPDF.
`ParsedDocument`(content, metadata, ...)	Result of parsing a document.
`Path`(args, *kwargs)	PurePath subclass that can make system calls.

class thoth.ingestion.parsers.pdf.PDFParser[source]¶

Parser for PDF files using PyMuPDF.

Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)

Parse a PDF file.

Parameters:

file_path – Path to the PDF file

Returns:

ParsedDocument with extracted text and metadata

Raises:

parse_content(content: bytes, source_path: str) → ParsedDocument[source]¶

Parse PDF content from bytes.

Parameters:

Returns:

ParsedDocument with extracted text and metadata