thoth.ingestion.chunker¶
Document chunking for multi-format ingestion.
This module provides intelligent chunking of documents that: - Respects document structure (headers, paragraphs, sections) - Maintains context through overlapping chunks - Extracts metadata for each chunk - Produces appropriately sized chunks (500-1000 tokens) - Supports multiple formats via DocumentChunker
Research findings and strategy: - Chunk size: 500-1000 tokens (balances context and granularity) - Overlap: 100-200 tokens (ensures context continuity) - Structure preservation: Split at header/paragraph boundaries when possible - Metadata: File path, header hierarchy, timestamps, chunk IDs, source, format
Functions
|
Add dunder methods based on the fields defined in the class. |
|
Return an object to identify dataclass fields. |
|
Create and configure a logger with structured JSON output. |
Classes
|
Special type indicating an unconstrained type. |
|
Represents a chunk of markdown content with metadata. |
|
Metadata for a document chunk. |
|
Generalized document chunker for multi-format support. |
|
Intelligent markdown-aware chunking. |
|
PurePath subclass that can make system calls. |
|
The year, month and day arguments are required. |
- class thoth.ingestion.chunker.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '')[source]¶
Bases:
objectMetadata for a document chunk.
- to_dict() dict[str, Any][source]¶
Convert metadata to a dict suitable for vector store metadata columns.
Ensures all values are store-compatible types (str, int, float, bool). Lists (e.g., headers) are converted to comma-separated strings.
- Returns:
Dict with chunk_id, file_path, chunk_index, total_chunks, headers (str), start_line, end_line, token_count, char_count, timestamp, overlap flags, source, format.
- __init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '') None¶
- class thoth.ingestion.chunker.Chunk(content: str, metadata: ChunkMetadata)[source]¶
Bases:
objectRepresents a chunk of markdown content with metadata.
- metadata: ChunkMetadata¶
- __init__(content: str, metadata: ChunkMetadata) None¶
- class thoth.ingestion.chunker.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶
Bases:
objectIntelligent markdown-aware chunking.
This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.
- __init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶
Initialize the markdown chunker.
- Parameters:
min_chunk_size – Minimum chunk size in tokens
max_chunk_size – Maximum chunk size in tokens
overlap_size – Number of tokens to overlap between chunks
logger – Logger instance
- chunk_file(file_path: Path) list[Chunk][source]¶
Chunk a markdown file.
- Parameters:
file_path – Path to the markdown file
- Returns:
List of chunks with metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file is empty or invalid
- class thoth.ingestion.chunker.DocumentChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶
Bases:
objectGeneralized document chunker for multi-format support.
This chunker uses MarkdownChunker for markdown files and provides generic paragraph-based chunking for other formats (PDF, text, docx).
Example
>>> from thoth.ingestion.parsers import ParserFactory >>> chunker = DocumentChunker() >>> parsed_doc = ParserFactory.parse(Path("document.pdf")) >>> chunks = chunker.chunk_document(parsed_doc, source="dnd")
- __init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶
Initialize the document chunker.
- Parameters:
min_chunk_size – Minimum chunk size in tokens
max_chunk_size – Maximum chunk size in tokens
overlap_size – Number of tokens to overlap between chunks
logger – Logger instance
- chunk_document(content: str, source_path: str, source: str = '', doc_format: str = '') list[Chunk][source]¶
Chunk a document based on its format.
- Parameters:
content – Document text content
source_path – Source file path for metadata
source – Source identifier (e.g., ‘handbook’, ‘dnd’)
doc_format – Document format (e.g., ‘markdown’, ‘pdf’, ‘text’, ‘docx’)
- Returns:
List of chunks with metadata including source and format