thoth.ingestion.chunker

Markdown-aware chunking for handbook content.

This module provides intelligent chunking of markdown files that: - Respects markdown structure (headers, lists, code blocks) - Maintains context through overlapping chunks - Extracts metadata for each chunk - Produces appropriately sized chunks (500-1000 tokens)

Research findings and strategy: - Chunk size: 500-1000 tokens (balances context and granularity) - Overlap: 100-200 tokens (ensures context continuity) - Structure preservation: Split at header boundaries when possible - Metadata: File path, header hierarchy, timestamps, chunk IDs

Functions

dataclass([cls, init, repr, eq, order, ...])

Add dunder methods based on the fields defined in the class.

field(*[, default, default_factory, init, ...])

Return an object to identify dataclass fields.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

Chunk(content, metadata)

Represents a chunk of markdown content with metadata.

ChunkMetadata(chunk_id, file_path, ...)

Metadata for a document chunk.

MarkdownChunker([min_chunk_size, ...])

Intelligent markdown-aware chunking.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

datetime(year, month, day[, hour[, minute[, ...)

The year, month and day arguments are required.

class thoth.ingestion.chunker.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False)[source]

Bases: object

Metadata for a document chunk.

chunk_id: str
file_path: str
chunk_index: int
total_chunks: int
headers: list[str]
start_line: int = 0
end_line: int = 0
token_count: int = 0
char_count: int = 0
timestamp: str
overlap_with_previous: bool = False
overlap_with_next: bool = False
to_dict() dict[str, Any][source]

Convert metadata to dictionary.

__init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False) None
class thoth.ingestion.chunker.Chunk(content: str, metadata: ChunkMetadata)[source]

Bases: object

Represents a chunk of markdown content with metadata.

content: str
metadata: ChunkMetadata
to_dict() dict[str, Any][source]

Convert chunk to dictionary.

__init__(content: str, metadata: ChunkMetadata) None
class thoth.ingestion.chunker.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]

Bases: object

Intelligent markdown-aware chunking.

This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.

__init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]

Initialize the markdown chunker.

Parameters:
  • min_chunk_size – Minimum chunk size in tokens

  • max_chunk_size – Maximum chunk size in tokens

  • overlap_size – Number of tokens to overlap between chunks

  • logger – Logger instance

chunk_file(file_path: Path) list[Chunk][source]

Chunk a markdown file.

Parameters:

file_path – Path to the markdown file

Returns:

List of chunks with metadata

Raises:
chunk_text(text: str, source_path: str = '') list[Chunk][source]

Chunk markdown text content.

Parameters:
  • text – Markdown text to chunk

  • source_path – Source file path for metadata

Returns:

List of chunks with metadata