Skip to content

Text

thoth.ingestion.parsers.text

Plain text document parser.

This module provides parsing for plain text files.

logger = setup_logger(__name__) module-attribute

TextParser

Parser for plain text files.

Supports: - Plain text files (.txt, .text) - UTF-8 encoding with fallback to latin-1

supported_extensions: list[str] property

Return supported text extensions.

parse(file_path: Path) -> ParsedDocument

Parse a plain text file.

Parameters:

Name Type Description Default
file_path Path

Path to the text file

required

Returns:

Type Description
ParsedDocument

ParsedDocument with content

Raises:

Type Description
FileNotFoundError

If file doesn't exist

parse_content(content: bytes, source_path: str) -> ParsedDocument

Parse text content from bytes.

Parameters:

Name Type Description Default
content bytes

Raw file content as bytes

required
source_path str

Original source path for metadata

required

Returns:

Type Description
ParsedDocument

ParsedDocument with content