thoth.ingestion.chunker

Document chunking for multi-format ingestion.

This module provides intelligent chunking of documents that: - Respects document structure (headers, paragraphs, sections) - Maintains context through overlapping chunks - Extracts metadata for each chunk - Produces appropriately sized chunks (500-1000 tokens) - Supports multiple formats via DocumentChunker

Research findings and strategy: - Chunk size: 500-1000 tokens (balances context and granularity) - Overlap: 100-200 tokens (ensures context continuity) - Structure preservation: Split at header/paragraph boundaries when possible - Metadata: File path, header hierarchy, timestamps, chunk IDs, source, format

Functions

dataclass([cls, init, repr, eq, order, ...])

Add dunder methods based on the fields defined in the class.

field(*[, default, default_factory, init, ...])

Return an object to identify dataclass fields.

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

Chunk(content, metadata)

Represents a chunk of markdown content with metadata.

ChunkMetadata(chunk_id, file_path, ...)

Metadata for a document chunk.

DocumentChunker([min_chunk_size, ...])

Generalized document chunker for multi-format support.

MarkdownChunker([min_chunk_size, ...])

Intelligent markdown-aware chunking.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

datetime(year, month, day[, hour[, minute[, ...)

The year, month and day arguments are required.

class thoth.ingestion.chunker.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '')[source]

Bases: object

Metadata for a document chunk.

chunk_id: str
file_path: str
chunk_index: int
total_chunks: int
headers: list[str]
start_line: int = 0
end_line: int = 0
token_count: int = 0
char_count: int = 0
timestamp: str
overlap_with_previous: bool = False
overlap_with_next: bool = False
source: str = ''
format: str = ''
to_dict() dict[str, Any][source]

Convert metadata to a dict suitable for vector store metadata columns.

Ensures all values are store-compatible types (str, int, float, bool). Lists (e.g., headers) are converted to comma-separated strings.

Returns:

Dict with chunk_id, file_path, chunk_index, total_chunks, headers (str), start_line, end_line, token_count, char_count, timestamp, overlap flags, source, format.

__init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False, source: str = '', format: str = '') None
class thoth.ingestion.chunker.Chunk(content: str, metadata: ChunkMetadata)[source]

Bases: object

Represents a chunk of markdown content with metadata.

content: str
metadata: ChunkMetadata
to_dict() dict[str, Any][source]

Convert chunk to dictionary.

__init__(content: str, metadata: ChunkMetadata) None
class thoth.ingestion.chunker.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]

Bases: object

Intelligent markdown-aware chunking.

This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.

__init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]

Initialize the markdown chunker.

Parameters:
  • min_chunk_size – Minimum chunk size in tokens

  • max_chunk_size – Maximum chunk size in tokens

  • overlap_size – Number of tokens to overlap between chunks

  • logger – Logger instance

chunk_file(file_path: Path) list[Chunk][source]

Chunk a markdown file.

Parameters:

file_path – Path to the markdown file

Returns:

List of chunks with metadata

Raises:
chunk_text(text: str, source_path: str = '') list[Chunk][source]

Chunk markdown text content.

Parameters:
  • text – Markdown text to chunk

  • source_path – Source file path for metadata

Returns:

List of chunks with metadata

class thoth.ingestion.chunker.DocumentChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]

Bases: object

Generalized document chunker for multi-format support.

This chunker uses MarkdownChunker for markdown files and provides generic paragraph-based chunking for other formats (PDF, text, docx).

Example

>>> from thoth.ingestion.parsers import ParserFactory
>>> chunker = DocumentChunker()
>>> parsed_doc = ParserFactory.parse(Path("document.pdf"))
>>> chunks = chunker.chunk_document(parsed_doc, source="dnd")
__init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | LoggerAdapter | None = None)[source]

Initialize the document chunker.

Parameters:
  • min_chunk_size – Minimum chunk size in tokens

  • max_chunk_size – Maximum chunk size in tokens

  • overlap_size – Number of tokens to overlap between chunks

  • logger – Logger instance

chunk_document(content: str, source_path: str, source: str = '', doc_format: str = '') list[Chunk][source]

Chunk a document based on its format.

Parameters:
  • content – Document text content

  • source_path – Source file path for metadata

  • source – Source identifier (e.g., ‘handbook’, ‘dnd’)

  • doc_format – Document format (e.g., ‘markdown’, ‘pdf’, ‘text’, ‘docx’)

Returns:

List of chunks with metadata including source and format

chunk_file(file_path: Path, source: str = '', doc_format: str = 'markdown') list[Chunk][source]

Chunk a file directly (for backward compatibility).

Parameters:
  • file_path – Path to the file

  • source – Source identifier

  • doc_format – Document format

Returns:

List of chunks with metadata