Docx
thoth.ingestion.parsers.docx
¶
Word document parser.
This module provides parsing for Word documents using python-docx.
logger = setup_logger(__name__)
module-attribute
¶
DocxParser
¶
Parser for Word documents using python-docx.
Supports: - Word documents (.docx) - Paragraph text extraction - Basic metadata extraction (title, author)
Note
Only supports .docx format (Office Open XML). Legacy .doc files are not supported.
supported_extensions: list[str]
property
¶
Return supported Word document extensions.
parse(file_path: Path) -> ParsedDocument
¶
Parse a Word document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the Word document |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with extracted text and metadata |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If file doesn't exist |
ImportError
|
If python-docx is not installed |
parse_content(content: bytes, source_path: str) -> ParsedDocument
¶
Parse Word document content from bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
bytes
|
Raw document content as bytes |
required |
source_path
|
str
|
Original source path for metadata |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with extracted text and metadata |