Skip to content

Chunker

thoth.ingestion.chunker

Document chunking for multi-format ingestion.

This module provides intelligent chunking of documents that: - Respects document structure (headers, paragraphs, sections) - Maintains context through overlapping chunks - Extracts metadata for each chunk - Produces appropriately sized chunks (500-1000 tokens) - Supports multiple formats via DocumentChunker

Research findings and strategy: - Chunk size: 500-1000 tokens (balances context and granularity) - Overlap: 100-200 tokens (ensures context continuity) - Structure preservation: Split at header/paragraph boundaries when possible - Metadata: File path, header hierarchy, timestamps, chunk IDs, source, format

DEFAULT_MIN_CHUNK_SIZE = 500 module-attribute

DEFAULT_MAX_CHUNK_SIZE = 1000 module-attribute

DEFAULT_OVERLAP_SIZE = 150 module-attribute

APPROX_TOKENS_PER_CHAR = 0.25 module-attribute

MSG_INVALID_FILE = 'Invalid file path: {path}' module-attribute

MSG_CHUNK_FAILED = 'Failed to chunk file: {path}' module-attribute

MSG_EMPTY_CONTENT = 'Empty content provided for chunking' module-attribute

MSG_INVALID_OVERLAP = 'Overlap size must be less than minimum chunk size' module-attribute

ChunkMetadata dataclass

Metadata for a document chunk.

chunk_id: str instance-attribute

file_path: str instance-attribute

chunk_index: int instance-attribute

total_chunks: int instance-attribute

headers: list[str] = field(default_factory=list) class-attribute instance-attribute

start_line: int = 0 class-attribute instance-attribute

end_line: int = 0 class-attribute instance-attribute

token_count: int = 0 class-attribute instance-attribute

char_count: int = 0 class-attribute instance-attribute

timestamp: str = field(default_factory=(lambda: datetime.now().astimezone().isoformat())) class-attribute instance-attribute

overlap_with_previous: bool = False class-attribute instance-attribute

overlap_with_next: bool = False class-attribute instance-attribute

source: str = '' class-attribute instance-attribute

format: str = '' class-attribute instance-attribute

__init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = list(), start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = (lambda: datetime.now().astimezone().isoformat())(), overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '') -> None

to_dict() -> dict[str, Any]

Convert metadata to a dict suitable for vector store metadata columns.

Ensures all values are store-compatible types (str, int, float, bool). Lists (e.g., headers) are converted to comma-separated strings.

Returns:

Type Description
dict[str, Any]

Dict with chunk_id, file_path, chunk_index, total_chunks, headers (str),

dict[str, Any]

start_line, end_line, token_count, char_count, timestamp, overlap flags,

dict[str, Any]

source, format.

Chunk dataclass

Represents a chunk of markdown content with metadata.

content: str instance-attribute

metadata: ChunkMetadata instance-attribute

__init__(content: str, metadata: ChunkMetadata) -> None

to_dict() -> dict[str, Any]

Convert chunk to dictionary.

MarkdownChunker

Intelligent markdown-aware chunking.

This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.

min_chunk_size = min_chunk_size instance-attribute

max_chunk_size = max_chunk_size instance-attribute

overlap_size = overlap_size instance-attribute

logger = logger or setup_logger(__name__) instance-attribute

__init__(min_chunk_size: int = DEFAULT_MIN_CHUNK_SIZE, max_chunk_size: int = DEFAULT_MAX_CHUNK_SIZE, overlap_size: int = DEFAULT_OVERLAP_SIZE, logger: logging.Logger | logging.LoggerAdapter | None = None)

Initialize the markdown chunker.

Parameters:

Name Type Description Default
min_chunk_size int

Minimum chunk size in tokens

DEFAULT_MIN_CHUNK_SIZE
max_chunk_size int

Maximum chunk size in tokens

DEFAULT_MAX_CHUNK_SIZE
overlap_size int

Number of tokens to overlap between chunks

DEFAULT_OVERLAP_SIZE
logger Logger | LoggerAdapter | None

Logger instance

None

chunk_file(file_path: Path) -> list[Chunk]

Chunk a markdown file.

Parameters:

Name Type Description Default
file_path Path

Path to the markdown file

required

Returns:

Type Description
list[Chunk]

List of chunks with metadata

Raises:

Type Description
FileNotFoundError

If file doesn't exist

ValueError

If file is empty or invalid

chunk_text(text: str, source_path: str = '') -> list[Chunk]

Chunk markdown text content.

Parameters:

Name Type Description Default
text str

Markdown text to chunk

required
source_path str

Source file path for metadata

''

Returns:

Type Description
list[Chunk]

List of chunks with metadata

DocumentChunker

Generalized document chunker for multi-format support.

This chunker uses MarkdownChunker for markdown files and provides generic paragraph-based chunking for other formats (PDF, text, docx).

Example

from thoth.ingestion.parsers import ParserFactory chunker = DocumentChunker() parsed_doc = ParserFactory.parse(Path("document.pdf")) chunks = chunker.chunk_document(parsed_doc, source="dnd")

min_chunk_size = min_chunk_size instance-attribute

max_chunk_size = max_chunk_size instance-attribute

overlap_size = overlap_size instance-attribute

logger = logger or setup_logger(__name__) instance-attribute

__init__(min_chunk_size: int = DEFAULT_MIN_CHUNK_SIZE, max_chunk_size: int = DEFAULT_MAX_CHUNK_SIZE, overlap_size: int = DEFAULT_OVERLAP_SIZE, logger: logging.Logger | logging.LoggerAdapter | None = None)

Initialize the document chunker.

Parameters:

Name Type Description Default
min_chunk_size int

Minimum chunk size in tokens

DEFAULT_MIN_CHUNK_SIZE
max_chunk_size int

Maximum chunk size in tokens

DEFAULT_MAX_CHUNK_SIZE
overlap_size int

Number of tokens to overlap between chunks

DEFAULT_OVERLAP_SIZE
logger Logger | LoggerAdapter | None

Logger instance

None

chunk_document(content: str, source_path: str, source: str = '', doc_format: str = '') -> list[Chunk]

Chunk a document based on its format.

Parameters:

Name Type Description Default
content str

Document text content

required
source_path str

Source file path for metadata

required
source str

Source identifier (e.g., 'handbook', 'dnd')

''
doc_format str

Document format (e.g., 'markdown', 'pdf', 'text', 'docx')

''

Returns:

Type Description
list[Chunk]

List of chunks with metadata including source and format

chunk_file(file_path: Path, source: str = '', doc_format: str = 'markdown') -> list[Chunk]

Chunk a file directly (for backward compatibility).

Parameters:

Name Type Description Default
file_path Path

Path to the file

required
source str

Source identifier

''
doc_format str

Document format

'markdown'

Returns:

Type Description
list[Chunk]

List of chunks with metadata