thoth.ingestion.chunker¶
Markdown-aware chunking for handbook content.
This module provides intelligent chunking of markdown files that: - Respects markdown structure (headers, lists, code blocks) - Maintains context through overlapping chunks - Extracts metadata for each chunk - Produces appropriately sized chunks (500-1000 tokens)
Research findings and strategy: - Chunk size: 500-1000 tokens (balances context and granularity) - Overlap: 100-200 tokens (ensures context continuity) - Structure preservation: Split at header boundaries when possible - Metadata: File path, header hierarchy, timestamps, chunk IDs
Functions
|
Add dunder methods based on the fields defined in the class. |
|
Return an object to identify dataclass fields. |
Classes
|
Special type indicating an unconstrained type. |
|
Represents a chunk of markdown content with metadata. |
|
Metadata for a document chunk. |
|
Intelligent markdown-aware chunking. |
|
PurePath subclass that can make system calls. |
|
The year, month and day arguments are required. |
- class thoth.ingestion.chunker.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False)[source]¶
Bases:
objectMetadata for a document chunk.
- class thoth.ingestion.chunker.Chunk(content: str, metadata: ChunkMetadata)[source]¶
Bases:
objectRepresents a chunk of markdown content with metadata.
- metadata: ChunkMetadata¶
- __init__(content: str, metadata: ChunkMetadata) None¶
- class thoth.ingestion.chunker.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]¶
Bases:
objectIntelligent markdown-aware chunking.
This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.
- __init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]¶
Initialize the markdown chunker.
- Parameters:
min_chunk_size – Minimum chunk size in tokens
max_chunk_size – Maximum chunk size in tokens
overlap_size – Number of tokens to overlap between chunks
logger – Logger instance
- chunk_file(file_path: Path) list[Chunk][source]¶
Chunk a markdown file.
- Parameters:
file_path – Path to the markdown file
- Returns:
List of chunks with metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file is empty or invalid