thoth.ingestion

Ingestion module for managing handbook repository.

Classes

Chunk(content, metadata)

Represents a chunk of markdown content with metadata.

ChunkMetadata(chunk_id, file_path, ...)

Metadata for a document chunk.

HandbookRepoManager([repo_url, clone_path, ...])

Manages the GitLab handbook repository.

MarkdownChunker([min_chunk_size, ...])

Intelligent markdown-aware chunking.

class thoth.ingestion.Chunk(content: str, metadata: ChunkMetadata)[source]

Bases: object

Represents a chunk of markdown content with metadata.

__init__(content: str, metadata: ChunkMetadata) None
to_dict() dict[str, Any][source]

Convert chunk to dictionary.

content: str
metadata: ChunkMetadata
class thoth.ingestion.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False)[source]

Bases: object

Metadata for a document chunk.

__init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False) None
char_count: int = 0
end_line: int = 0
overlap_with_next: bool = False
overlap_with_previous: bool = False
start_line: int = 0
to_dict() dict[str, Any][source]

Convert metadata to dictionary.

token_count: int = 0
chunk_id: str
file_path: str
chunk_index: int
total_chunks: int
headers: list[str]
timestamp: str
class thoth.ingestion.HandbookRepoManager(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]

Bases: object

Manages the GitLab handbook repository.

__init__(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]

Initialize the repository manager.

Parameters:
  • repo_url – URL of the GitLab handbook repository

  • clone_path – Local path to clone/store the repository

  • logger – Logger instance for logging messages

clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5) Path[source]

Clone the GitLab handbook repository.

Parameters:
  • force – If True, remove existing repository and re-clone

  • max_retries – Maximum number of clone attempts

  • retry_delay – Delay in seconds between retries

Returns:

Path to the cloned repository

Raises:
  • RuntimeError – If repository exists and force=False

  • GitCommandError – If cloning fails after all retries

get_changed_files(since_commit: str) list[str] | None[source]

Get list of files changed since a specific commit.

Parameters:

since_commit – Commit SHA to compare against

Returns:

List of changed file paths, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

get_current_commit() str | None[source]

Get the current commit SHA of the repository.

Returns:

Commit SHA as string, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

load_metadata() dict[str, Any] | None[source]

Load repository metadata from JSON file.

Returns:

Metadata dictionary with commit_sha, clone_path, repo_url, or None if error

save_metadata(commit_sha: str) bool[source]

Save repository metadata to a JSON file.

Parameters:

commit_sha – Current commit SHA to save

Returns:

True if save successful, False otherwise

update_repository() bool[source]

Update the repository by pulling latest changes.

Returns:

True if update successful, False otherwise

Raises:

RuntimeError – If repository doesn’t exist

class thoth.ingestion.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]

Bases: object

Intelligent markdown-aware chunking.

This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.

__init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]

Initialize the markdown chunker.

Parameters:
  • min_chunk_size – Minimum chunk size in tokens

  • max_chunk_size – Maximum chunk size in tokens

  • overlap_size – Number of tokens to overlap between chunks

  • logger – Logger instance

chunk_file(file_path: Path) list[Chunk][source]

Chunk a markdown file.

Parameters:

file_path – Path to the markdown file

Returns:

List of chunks with metadata

Raises:
chunk_text(text: str, source_path: str = '') list[Chunk][source]

Chunk markdown text content.

Parameters:
  • text – Markdown text to chunk

  • source_path – Source file path for metadata

Returns:

List of chunks with metadata

Modules

chunker

Markdown-aware chunking for handbook content.

repo_manager

Repository manager for cloning and tracking the GitLab handbook.