Text
thoth.ingestion.parsers.text
¶
Plain text document parser.
This module provides parsing for plain text files.
logger = setup_logger(__name__)
module-attribute
¶
TextParser
¶
Parser for plain text files.
Supports: - Plain text files (.txt, .text) - UTF-8 encoding with fallback to latin-1
supported_extensions: list[str]
property
¶
Return supported text extensions.
parse(file_path: Path) -> ParsedDocument
¶
Parse a plain text file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the text file |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with content |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If file doesn't exist |
parse_content(content: bytes, source_path: str) -> ParsedDocument
¶
Parse text content from bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
bytes
|
Raw file content as bytes |
required |
source_path
|
str
|
Original source path for metadata |
required |
Returns:
| Type | Description |
|---|---|
ParsedDocument
|
ParsedDocument with content |