Skip to content

Docx

thoth.ingestion.parsers.docx

Word document parser.

This module provides parsing for Word documents using python-docx.

logger = setup_logger(__name__) module-attribute

DocxParser

Parser for Word documents using python-docx.

Supports: - Word documents (.docx) - Paragraph text extraction - Basic metadata extraction (title, author)

Note

Only supports .docx format (Office Open XML). Legacy .doc files are not supported.

supported_extensions: list[str] property

Return supported Word document extensions.

parse(file_path: Path) -> ParsedDocument

Parse a Word document.

Parameters:

Name Type Description Default
file_path Path

Path to the Word document

required

Returns:

Type Description
ParsedDocument

ParsedDocument with extracted text and metadata

Raises:

Type Description
FileNotFoundError

If file doesn't exist

ImportError

If python-docx is not installed

parse_content(content: bytes, source_path: str) -> ParsedDocument

Parse Word document content from bytes.

Parameters:

Name Type Description Default
content bytes

Raw document content as bytes

required
source_path str

Original source path for metadata

required

Returns:

Type Description
ParsedDocument

ParsedDocument with extracted text and metadata