Ingestion Module

The ingestion module handles repository cloning, tracking, and management for the GitLab handbook.

Overview

The ingestion module provides tools for managing Git repositories, with a focus on the GitLab handbook. It includes features for cloning repositories with retry logic, tracking commit history, managing metadata, and detecting file changes between commits.

Key Features

  • Clone repositories with automatic retry logic for reliability

  • Track commit history to monitor repository changes

  • Save and load metadata for persistent repository state

  • Detect changed files between any two commits

  • Force re-cloning when repository updates are needed

Example Usage

from pathlib import Path
from thoth.ingestion.repo_manager import HandbookRepoManager

# Initialize the repository manager
manager = HandbookRepoManager(
    repo_url="https://gitlab.com/gitlab-com/content-sites/handbook.git",
    clone_path=Path.home() / ".thoth" / "handbook"
)

# Clone the repository
repo_path = manager.clone_handbook()

# Get current commit
commit_sha = manager.get_current_commit()

# Save metadata
manager.save_metadata(commit_sha)

# Update repository
manager.update_repository()

# Get changed files since last commit
metadata = manager.load_metadata()
if metadata:
    changed_files = manager.get_changed_files(metadata["commit_sha"])

Module Contents

Repository Manager

Repository manager for cloning and tracking the GitLab handbook.

class thoth.ingestion.repo_manager.HandbookRepoManager(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]

Bases: object

Manages the GitLab handbook repository.

__init__(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]

Initialize the repository manager.

Parameters:
  • repo_url – URL of the GitLab handbook repository

  • clone_path – Local path to clone/store the repository

  • logger – Logger instance for logging messages

clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5) Path[source]

Clone the GitLab handbook repository.

Parameters:
  • force – If True, remove existing repository and re-clone

  • max_retries – Maximum number of clone attempts

  • retry_delay – Delay in seconds between retries

Returns:

Path to the cloned repository

Raises:
  • RuntimeError – If repository exists and force=False

  • GitCommandError – If cloning fails after all retries

update_repository() bool[source]

Update the repository by pulling latest changes.

Returns:

True if update successful, False otherwise

Raises:

RuntimeError – If repository doesn’t exist

get_current_commit() str | None[source]

Get the current commit SHA of the repository.

Returns:

Commit SHA as string, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

save_metadata(commit_sha: str) bool[source]

Save repository metadata to a JSON file.

Parameters:

commit_sha – Current commit SHA to save

Returns:

True if save successful, False otherwise

load_metadata() dict[str, Any] | None[source]

Load repository metadata from JSON file.

Returns:

Metadata dictionary with commit_sha, clone_path, repo_url, or None if error

get_changed_files(since_commit: str) list[str] | None[source]

Get list of files changed since a specific commit.

Parameters:

since_commit – Commit SHA to compare against

Returns:

List of changed file paths, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

Package Contents

Ingestion module for managing handbook repository.

class thoth.ingestion.Chunk(content: str, metadata: ChunkMetadata)[source]

Bases: object

Represents a chunk of markdown content with metadata.

__init__(content: str, metadata: ChunkMetadata) None
to_dict() dict[str, Any][source]

Convert chunk to dictionary.

content: str
metadata: ChunkMetadata
class thoth.ingestion.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False)[source]

Bases: object

Metadata for a document chunk.

__init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False) None
char_count: int = 0
end_line: int = 0
overlap_with_next: bool = False
overlap_with_previous: bool = False
start_line: int = 0
to_dict() dict[str, Any][source]

Convert metadata to dictionary.

token_count: int = 0
chunk_id: str
file_path: str
chunk_index: int
total_chunks: int
headers: list[str]
timestamp: str
class thoth.ingestion.HandbookRepoManager(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]

Bases: object

Manages the GitLab handbook repository.

__init__(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]

Initialize the repository manager.

Parameters:
  • repo_url – URL of the GitLab handbook repository

  • clone_path – Local path to clone/store the repository

  • logger – Logger instance for logging messages

clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5) Path[source]

Clone the GitLab handbook repository.

Parameters:
  • force – If True, remove existing repository and re-clone

  • max_retries – Maximum number of clone attempts

  • retry_delay – Delay in seconds between retries

Returns:

Path to the cloned repository

Raises:
  • RuntimeError – If repository exists and force=False

  • GitCommandError – If cloning fails after all retries

get_changed_files(since_commit: str) list[str] | None[source]

Get list of files changed since a specific commit.

Parameters:

since_commit – Commit SHA to compare against

Returns:

List of changed file paths, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

get_current_commit() str | None[source]

Get the current commit SHA of the repository.

Returns:

Commit SHA as string, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

load_metadata() dict[str, Any] | None[source]

Load repository metadata from JSON file.

Returns:

Metadata dictionary with commit_sha, clone_path, repo_url, or None if error

save_metadata(commit_sha: str) bool[source]

Save repository metadata to a JSON file.

Parameters:

commit_sha – Current commit SHA to save

Returns:

True if save successful, False otherwise

update_repository() bool[source]

Update the repository by pulling latest changes.

Returns:

True if update successful, False otherwise

Raises:

RuntimeError – If repository doesn’t exist

class thoth.ingestion.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]

Bases: object

Intelligent markdown-aware chunking.

This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.

__init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]

Initialize the markdown chunker.

Parameters:
  • min_chunk_size – Minimum chunk size in tokens

  • max_chunk_size – Maximum chunk size in tokens

  • overlap_size – Number of tokens to overlap between chunks

  • logger – Logger instance

chunk_file(file_path: Path) list[Chunk][source]

Chunk a markdown file.

Parameters:

file_path – Path to the markdown file

Returns:

List of chunks with metadata

Raises:
chunk_text(text: str, source_path: str = '') list[Chunk][source]

Chunk markdown text content.

Parameters:
  • text – Markdown text to chunk

  • source_path – Source file path for metadata

Returns:

List of chunks with metadata