thoth.ingestion.parsers.pdf¶
PDF document parser.
This module provides parsing for PDF files using PyMuPDF (fitz).
Functions
|
Create and configure a logger with structured JSON output. |
Classes
|
Abstract base class for document parsers. |
Parser for PDF files using PyMuPDF. |
|
|
Result of parsing a document. |
|
PurePath subclass that can make system calls. |
- class thoth.ingestion.parsers.pdf.PDFParser[source]¶
Bases:
DocumentParserParser for PDF files using PyMuPDF.
Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)
- parse(file_path: Path) ParsedDocument[source]¶
Parse a PDF file.
- Parameters:
file_path – Path to the PDF file
- Returns:
ParsedDocument with extracted text and metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ImportError – If PyMuPDF is not installed
- parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse PDF content from bytes.
- Parameters:
content – Raw PDF content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with extracted text and metadata