thoth.ingestion.parsers.pdf
¶
PDF document parser.
This module provides parsing for PDF files using PyMuPDF (fitz).
logger = setup_logger(__name__)
module-attribute
¶
PDFParser
¶
Parser for PDF files using PyMuPDF.
Supports: - PDF files (.pdf) - Text extraction with page numbers - Basic metadata extraction (title, author, page count)
supported_extensions: list[str]
property
¶
Return supported PDF extensions.
parse(file_path: Path) -> ParsedDocument
¶
Parse a PDF file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the PDF file |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with extracted text and metadata |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If file doesn't exist |
ImportError
|
If PyMuPDF is not installed |
parse_content(content: bytes, source_path: str) -> ParsedDocument
¶
Parse PDF content from bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
bytes
|
Raw PDF content as bytes |
required |
source_path
|
str
|
Original source path for metadata |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with extracted text and metadata |