thoth.ingestion.parsers.docx

Word document parser.

This module provides parsing for Word documents using python-docx.

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

DocumentParser()

Abstract base class for document parsers.

DocxParser()

Parser for Word documents using python-docx.

ParsedDocument(content, metadata, ...)

Result of parsing a document.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

class thoth.ingestion.parsers.docx.DocxParser[source]

Bases: DocumentParser

Parser for Word documents using python-docx.

Supports: - Word documents (.docx) - Paragraph text extraction - Basic metadata extraction (title, author)

Note

Only supports .docx format (Office Open XML). Legacy .doc files are not supported.

property supported_extensions: list[str]

Return supported Word document extensions.

parse(file_path: Path) ParsedDocument[source]

Parse a Word document.

Parameters:

file_path – Path to the Word document

Returns:

ParsedDocument with extracted text and metadata

Raises:
parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse Word document content from bytes.

Parameters:
  • content – Raw document content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with extracted text and metadata