Chunker
thoth.ingestion.chunker
¶
Document chunking for multi-format ingestion.
This module provides intelligent chunking of documents that: - Respects document structure (headers, paragraphs, sections) - Maintains context through overlapping chunks - Extracts metadata for each chunk - Produces appropriately sized chunks (500-1000 tokens) - Supports multiple formats via DocumentChunker
Research findings and strategy: - Chunk size: 500-1000 tokens (balances context and granularity) - Overlap: 100-200 tokens (ensures context continuity) - Structure preservation: Split at header/paragraph boundaries when possible - Metadata: File path, header hierarchy, timestamps, chunk IDs, source, format
DEFAULT_MIN_CHUNK_SIZE = 500
module-attribute
¶
DEFAULT_MAX_CHUNK_SIZE = 1000
module-attribute
¶
DEFAULT_OVERLAP_SIZE = 150
module-attribute
¶
APPROX_TOKENS_PER_CHAR = 0.25
module-attribute
¶
MSG_INVALID_FILE = 'Invalid file path: {path}'
module-attribute
¶
MSG_CHUNK_FAILED = 'Failed to chunk file: {path}'
module-attribute
¶
MSG_EMPTY_CONTENT = 'Empty content provided for chunking'
module-attribute
¶
MSG_INVALID_OVERLAP = 'Overlap size must be less than minimum chunk size'
module-attribute
¶
ChunkMetadata
dataclass
¶
Metadata for a document chunk.
chunk_id: str
instance-attribute
¶
file_path: str
instance-attribute
¶
chunk_index: int
instance-attribute
¶
total_chunks: int
instance-attribute
¶
headers: list[str] = field(default_factory=list)
class-attribute
instance-attribute
¶
start_line: int = 0
class-attribute
instance-attribute
¶
end_line: int = 0
class-attribute
instance-attribute
¶
token_count: int = 0
class-attribute
instance-attribute
¶
char_count: int = 0
class-attribute
instance-attribute
¶
timestamp: str = field(default_factory=(lambda: datetime.now().astimezone().isoformat()))
class-attribute
instance-attribute
¶
overlap_with_previous: bool = False
class-attribute
instance-attribute
¶
overlap_with_next: bool = False
class-attribute
instance-attribute
¶
source: str = ''
class-attribute
instance-attribute
¶
format: str = ''
class-attribute
instance-attribute
¶
__init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = list(), start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = (lambda: datetime.now().astimezone().isoformat())(), overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '') -> None
¶
to_dict() -> dict[str, Any]
¶
Convert metadata to a dict suitable for vector store metadata columns.
Ensures all values are store-compatible types (str, int, float, bool). Lists (e.g., headers) are converted to comma-separated strings.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with chunk_id, file_path, chunk_index, total_chunks, headers (str), |
dict[str, Any]
|
start_line, end_line, token_count, char_count, timestamp, overlap flags, |
dict[str, Any]
|
source, format. |
Chunk
dataclass
¶
MarkdownChunker
¶
Intelligent markdown-aware chunking.
This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.
min_chunk_size = min_chunk_size
instance-attribute
¶
max_chunk_size = max_chunk_size
instance-attribute
¶
overlap_size = overlap_size
instance-attribute
¶
logger = logger or setup_logger(__name__)
instance-attribute
¶
__init__(min_chunk_size: int = DEFAULT_MIN_CHUNK_SIZE, max_chunk_size: int = DEFAULT_MAX_CHUNK_SIZE, overlap_size: int = DEFAULT_OVERLAP_SIZE, logger: logging.Logger | logging.LoggerAdapter | None = None)
¶
Initialize the markdown chunker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_chunk_size
|
int
|
Minimum chunk size in tokens |
DEFAULT_MIN_CHUNK_SIZE
|
max_chunk_size
|
int
|
Maximum chunk size in tokens |
DEFAULT_MAX_CHUNK_SIZE
|
overlap_size
|
int
|
Number of tokens to overlap between chunks |
DEFAULT_OVERLAP_SIZE
|
logger
|
Logger | LoggerAdapter | None
|
Logger instance |
None
|
chunk_file(file_path: Path) -> list[Chunk]
¶
Chunk a markdown file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the markdown file |
required |
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
List of chunks with metadata |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If file doesn't exist |
ValueError
|
If file is empty or invalid |
DocumentChunker
¶
Generalized document chunker for multi-format support.
This chunker uses MarkdownChunker for markdown files and provides generic paragraph-based chunking for other formats (PDF, text, docx).
Example
from thoth.ingestion.parsers import ParserFactory chunker = DocumentChunker() parsed_doc = ParserFactory.parse(Path("document.pdf")) chunks = chunker.chunk_document(parsed_doc, source="dnd")
min_chunk_size = min_chunk_size
instance-attribute
¶
max_chunk_size = max_chunk_size
instance-attribute
¶
overlap_size = overlap_size
instance-attribute
¶
logger = logger or setup_logger(__name__)
instance-attribute
¶
__init__(min_chunk_size: int = DEFAULT_MIN_CHUNK_SIZE, max_chunk_size: int = DEFAULT_MAX_CHUNK_SIZE, overlap_size: int = DEFAULT_OVERLAP_SIZE, logger: logging.Logger | logging.LoggerAdapter | None = None)
¶
Initialize the document chunker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_chunk_size
|
int
|
Minimum chunk size in tokens |
DEFAULT_MIN_CHUNK_SIZE
|
max_chunk_size
|
int
|
Maximum chunk size in tokens |
DEFAULT_MAX_CHUNK_SIZE
|
overlap_size
|
int
|
Number of tokens to overlap between chunks |
DEFAULT_OVERLAP_SIZE
|
logger
|
Logger | LoggerAdapter | None
|
Logger instance |
None
|
chunk_document(content: str, source_path: str, source: str = '', doc_format: str = '') -> list[Chunk]
¶
Chunk a document based on its format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
Document text content |
required |
source_path
|
str
|
Source file path for metadata |
required |
source
|
str
|
Source identifier (e.g., 'handbook', 'dnd') |
''
|
doc_format
|
str
|
Document format (e.g., 'markdown', 'pdf', 'text', 'docx') |
''
|
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
List of chunks with metadata including source and format |
chunk_file(file_path: Path, source: str = '', doc_format: str = 'markdown') -> list[Chunk]
¶
Chunk a file directly (for backward compatibility).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to the file |
required |
source
|
str
|
Source identifier |
''
|
doc_format
|
str
|
Document format |
'markdown'
|
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
List of chunks with metadata |