thoth.ingestion.chunker¶

Document chunking for multi-format ingestion.

This module provides intelligent chunking of documents that: - Respects document structure (headers, paragraphs, sections) - Maintains context through overlapping chunks - Extracts metadata for each chunk - Produces appropriately sized chunks (500-1000 tokens) - Supports multiple formats via DocumentChunker

Research findings and strategy: - Chunk size: 500-1000 tokens (balances context and granularity) - Overlap: 100-200 tokens (ensures context continuity) - Structure preservation: Split at header/paragraph boundaries when possible - Metadata: File path, header hierarchy, timestamps, chunk IDs, source, format

Functions

`dataclass`([cls, init, repr, eq, order, ...])	Add dunder methods based on the fields defined in the class.
`field`(*[, default, default_factory, init, ...])	Return an object to identify dataclass fields.
`setup_logger`(name[, level, simple, json_output])	Create and configure a logger with structured JSON output.

Classes

`Any`(args, *kwargs)	Special type indicating an unconstrained type.
`Chunk`(content, metadata)	Represents a chunk of markdown content with metadata.
`ChunkMetadata`(chunk_id, file_path, ...)	Metadata for a document chunk.
`DocumentChunker`([min_chunk_size, ...])	Generalized document chunker for multi-format support.
`MarkdownChunker`([min_chunk_size, ...])	Intelligent markdown-aware chunking.
`Path`(args, *kwargs)	PurePath subclass that can make system calls.
`datetime`(year, month, day[, hour[, minute[, ...)	The year, month and day arguments are required.

class thoth.ingestion.chunker.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '')[source]¶

Bases: object

Metadata for a document chunk.

chunk_id: str¶

file_path: str¶

chunk_index: int¶

total_chunks: int¶

headers: list[str]¶

start_line: int = 0¶

end_line: int = 0¶

token_count: int = 0¶

char_count: int = 0¶

timestamp: str¶

overlap_with_previous: bool = False¶

overlap_with_next: bool = False¶

source: str = ''¶

format: str = ''¶

to_dict() → dict[str, Any][source]¶

Convert metadata to a dict suitable for vector store metadata columns.

Ensures all values are store-compatible types (str, int, float, bool). Lists (e.g., headers) are converted to comma-separated strings.

Returns:: Dict with chunk_id, file_path, chunk_index, total_chunks, headers (str), start_line, end_line, token_count, char_count, timestamp, overlap flags, source, format.

__init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '') → None¶

class thoth.ingestion.chunker.Chunk(content: str, metadata: ChunkMetadata)[source]¶

Bases: object

Represents a chunk of markdown content with metadata.

content: str¶

metadata: ChunkMetadata¶

to_dict() → dict[str, Any][source]¶: Convert chunk to dictionary.

__init__(content: str, metadata: ChunkMetadata) → None¶

class thoth.ingestion.chunker.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶

Bases: object

Intelligent markdown-aware chunking.

This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.

__init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶

Initialize the markdown chunker.

Parameters:

min_chunk_size – Minimum chunk size in tokens
max_chunk_size – Maximum chunk size in tokens
overlap_size – Number of tokens to overlap between chunks
logger – Logger instance

chunk_file(file_path: Path) → list[Chunk][source]¶

Chunk a markdown file.

Parameters:

file_path – Path to the markdown file

Returns:

List of chunks with metadata

Raises:

FileNotFoundError – If file doesn’t exist
ValueError – If file is empty or invalid

chunk_text(text: str, source_path: str = '') → list[Chunk][source]¶

Chunk markdown text content.

Parameters:

text – Markdown text to chunk
source_path – Source file path for metadata

Returns:

List of chunks with metadata

class thoth.ingestion.chunker.DocumentChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶

Bases: object

Generalized document chunker for multi-format support.

This chunker uses MarkdownChunker for markdown files and provides generic paragraph-based chunking for other formats (PDF, text, docx).

Example

>>> from thoth.ingestion.parsers import ParserFactory
>>> chunker = DocumentChunker()
>>> parsed_doc = ParserFactory.parse(Path("document.pdf"))
>>> chunks = chunker.chunk_document(parsed_doc, source="dnd")

__init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]¶

Initialize the document chunker.

Parameters:

min_chunk_size – Minimum chunk size in tokens
max_chunk_size – Maximum chunk size in tokens
overlap_size – Number of tokens to overlap between chunks
logger – Logger instance

chunk_document(content: str, source_path: str, source: str = '', doc_format: str = '') → list[Chunk][source]¶

Chunk a document based on its format.

Parameters:

content – Document text content
source_path – Source file path for metadata
source – Source identifier (e.g., ‘handbook’, ‘dnd’)
doc_format – Document format (e.g., ‘markdown’, ‘pdf’, ‘text’, ‘docx’)

Returns:

List of chunks with metadata including source and format

chunk_file(file_path: Path, source: str = '', doc_format: str = 'markdown') → list[Chunk][source]¶

Chunk a file directly (for backward compatibility).

Parameters:

file_path – Path to the file
source – Source identifier
doc_format – Document format

Returns:

List of chunks with metadata