thoth.ingestion.parsers.docx¶
Word document parser.
This module provides parsing for Word documents using python-docx.
Functions
|
Create and configure a logger with structured JSON output. |
Classes
|
Special type indicating an unconstrained type. |
|
Abstract base class for document parsers. |
Parser for Word documents using python-docx. |
|
|
Result of parsing a document. |
|
PurePath subclass that can make system calls. |
- class thoth.ingestion.parsers.docx.DocxParser[source]¶
Bases:
DocumentParserParser for Word documents using python-docx.
Supports: - Word documents (.docx) - Paragraph text extraction - Basic metadata extraction (title, author)
Note
Only supports .docx format (Office Open XML). Legacy .doc files are not supported.
- parse(file_path: Path) ParsedDocument[source]¶
Parse a Word document.
- Parameters:
file_path – Path to the Word document
- Returns:
ParsedDocument with extracted text and metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ImportError – If python-docx is not installed
- parse_content(content: bytes, source_path: str) ParsedDocument[source]¶
Parse Word document content from bytes.
- Parameters:
content – Raw document content as bytes
source_path – Original source path for metadata
- Returns:
ParsedDocument with extracted text and metadata