Ingestion Module¶
The ingestion module handles repository cloning, tracking, and management for the GitLab handbook.
Overview¶
The ingestion module provides tools for managing Git repositories, with a focus on the GitLab handbook. It includes features for cloning repositories with retry logic, tracking commit history, managing metadata, and detecting file changes between commits.
Key Features¶
Clone repositories with automatic retry logic for reliability
Track commit history to monitor repository changes
Save and load metadata for persistent repository state
Detect changed files between any two commits
Force re-cloning when repository updates are needed
Example Usage¶
from pathlib import Path
from thoth.ingestion.repo_manager import HandbookRepoManager
# Initialize the repository manager
manager = HandbookRepoManager(
repo_url="https://gitlab.com/gitlab-com/content-sites/handbook.git",
clone_path=Path.home() / ".thoth" / "handbook"
)
# Clone the repository
repo_path = manager.clone_handbook()
# Get current commit
commit_sha = manager.get_current_commit()
# Save metadata
manager.save_metadata(commit_sha)
# Update repository
manager.update_repository()
# Get changed files since last commit
metadata = manager.load_metadata()
if metadata:
changed_files = manager.get_changed_files(metadata["commit_sha"])
Module Contents¶
Repository Manager¶
Repository manager for cloning and tracking the GitLab handbook.
- class thoth.ingestion.repo_manager.HandbookRepoManager(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]
Bases:
objectManages the GitLab handbook repository.
- __init__(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]
Initialize the repository manager.
- Parameters:
repo_url – URL of the GitLab handbook repository
clone_path – Local path to clone/store the repository
logger – Logger instance for logging messages
- clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5) Path[source]
Clone the GitLab handbook repository.
- Parameters:
force – If True, remove existing repository and re-clone
max_retries – Maximum number of clone attempts
retry_delay – Delay in seconds between retries
- Returns:
Path to the cloned repository
- Raises:
RuntimeError – If repository exists and force=False
GitCommandError – If cloning fails after all retries
- update_repository() bool[source]
Update the repository by pulling latest changes.
- Returns:
True if update successful, False otherwise
- Raises:
RuntimeError – If repository doesn’t exist
- get_current_commit() str | None[source]
Get the current commit SHA of the repository.
- Returns:
Commit SHA as string, or None if error occurs
- Raises:
RuntimeError – If repository doesn’t exist
- save_metadata(commit_sha: str) bool[source]
Save repository metadata to a JSON file.
- Parameters:
commit_sha – Current commit SHA to save
- Returns:
True if save successful, False otherwise
Package Contents¶
Ingestion module for managing handbook repository.
- class thoth.ingestion.Chunk(content: str, metadata: ChunkMetadata)[source]
Bases:
objectRepresents a chunk of markdown content with metadata.
- __init__(content: str, metadata: ChunkMetadata) None
- content: str
- metadata: ChunkMetadata
- class thoth.ingestion.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False)[source]
Bases:
objectMetadata for a document chunk.
- __init__(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False) None
- char_count: int = 0
- end_line: int = 0
- overlap_with_next: bool = False
- overlap_with_previous: bool = False
- start_line: int = 0
- token_count: int = 0
- chunk_id: str
- file_path: str
- chunk_index: int
- total_chunks: int
- timestamp: str
- class thoth.ingestion.HandbookRepoManager(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]
Bases:
objectManages the GitLab handbook repository.
- __init__(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]
Initialize the repository manager.
- Parameters:
repo_url – URL of the GitLab handbook repository
clone_path – Local path to clone/store the repository
logger – Logger instance for logging messages
- clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5) Path[source]
Clone the GitLab handbook repository.
- Parameters:
force – If True, remove existing repository and re-clone
max_retries – Maximum number of clone attempts
retry_delay – Delay in seconds between retries
- Returns:
Path to the cloned repository
- Raises:
RuntimeError – If repository exists and force=False
GitCommandError – If cloning fails after all retries
- get_changed_files(since_commit: str) list[str] | None[source]
Get list of files changed since a specific commit.
- Parameters:
since_commit – Commit SHA to compare against
- Returns:
List of changed file paths, or None if error occurs
- Raises:
RuntimeError – If repository doesn’t exist
- get_current_commit() str | None[source]
Get the current commit SHA of the repository.
- Returns:
Commit SHA as string, or None if error occurs
- Raises:
RuntimeError – If repository doesn’t exist
- load_metadata() dict[str, Any] | None[source]
Load repository metadata from JSON file.
- Returns:
Metadata dictionary with commit_sha, clone_path, repo_url, or None if error
- save_metadata(commit_sha: str) bool[source]
Save repository metadata to a JSON file.
- Parameters:
commit_sha – Current commit SHA to save
- Returns:
True if save successful, False otherwise
- update_repository() bool[source]
Update the repository by pulling latest changes.
- Returns:
True if update successful, False otherwise
- Raises:
RuntimeError – If repository doesn’t exist
- class thoth.ingestion.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]
Bases:
objectIntelligent markdown-aware chunking.
This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.
- __init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]
Initialize the markdown chunker.
- Parameters:
min_chunk_size – Minimum chunk size in tokens
max_chunk_size – Maximum chunk size in tokens
overlap_size – Number of tokens to overlap between chunks
logger – Logger instance
- chunk_file(file_path: Path) list[Chunk][source]
Chunk a markdown file.
- Parameters:
file_path – Path to the markdown file
- Returns:
List of chunks with metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file is empty or invalid