thoth.ingestion.parsers.text

Plain text document parser.

This module provides parsing for plain text files.

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

DocumentParser()

Abstract base class for document parsers.

ParsedDocument(content, metadata, ...)

Result of parsing a document.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

TextParser()

Parser for plain text files.

class thoth.ingestion.parsers.text.TextParser[source]

Bases: DocumentParser

Parser for plain text files.

Supports: - Plain text files (.txt, .text) - UTF-8 encoding with fallback to latin-1

property supported_extensions: list[str]

Return supported text extensions.

parse(file_path: Path) ParsedDocument[source]

Parse a plain text file.

Parameters:

file_path – Path to the text file

Returns:

ParsedDocument with content

Raises:

FileNotFoundError – If file doesn’t exist

parse_content(content: bytes, source_path: str) ParsedDocument[source]

Parse text content from bytes.

Parameters:
  • content – Raw file content as bytes

  • source_path – Original source path for metadata

Returns:

ParsedDocument with content