thoth.ingestion¶
Ingestion module for managing handbook repository.
Classes
|
Represents a chunk of markdown content with metadata. |
|
Metadata for a document chunk. |
|
Manages the GitLab handbook repository. |
|
Intelligent markdown-aware chunking. |
- class thoth.ingestion.Chunk(content: str, metadata: ChunkMetadata)[source]¶
Bases:
objectRepresents a chunk of markdown content with metadata.
- __init__(content: str, metadata: ChunkMetadata) None¶
- metadata: ChunkMetadata¶
- class thoth.ingestion.ChunkMetadata(chunk_id: str, file_path: str, chunk_index: int, total_chunks: int, headers: list[str] = <factory>, start_line: int = 0, end_line: int = 0, token_count: int = 0, char_count: int = 0, timestamp: str = <factory>, overlap_with_previous: bool = False, overlap_with_next: bool = False)[source]¶
Bases:
objectMetadata for a document chunk.
- class thoth.ingestion.HandbookRepoManager(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]¶
Bases:
objectManages the GitLab handbook repository.
- __init__(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | None = None)[source]¶
Initialize the repository manager.
- Parameters:
repo_url – URL of the GitLab handbook repository
clone_path – Local path to clone/store the repository
logger – Logger instance for logging messages
- clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5) Path[source]¶
Clone the GitLab handbook repository.
- Parameters:
force – If True, remove existing repository and re-clone
max_retries – Maximum number of clone attempts
retry_delay – Delay in seconds between retries
- Returns:
Path to the cloned repository
- Raises:
RuntimeError – If repository exists and force=False
GitCommandError – If cloning fails after all retries
- get_changed_files(since_commit: str) list[str] | None[source]¶
Get list of files changed since a specific commit.
- Parameters:
since_commit – Commit SHA to compare against
- Returns:
List of changed file paths, or None if error occurs
- Raises:
RuntimeError – If repository doesn’t exist
- get_current_commit() str | None[source]¶
Get the current commit SHA of the repository.
- Returns:
Commit SHA as string, or None if error occurs
- Raises:
RuntimeError – If repository doesn’t exist
- load_metadata() dict[str, Any] | None[source]¶
Load repository metadata from JSON file.
- Returns:
Metadata dictionary with commit_sha, clone_path, repo_url, or None if error
- save_metadata(commit_sha: str) bool[source]¶
Save repository metadata to a JSON file.
- Parameters:
commit_sha – Current commit SHA to save
- Returns:
True if save successful, False otherwise
- update_repository() bool[source]¶
Update the repository by pulling latest changes.
- Returns:
True if update successful, False otherwise
- Raises:
RuntimeError – If repository doesn’t exist
- class thoth.ingestion.MarkdownChunker(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]¶
Bases:
objectIntelligent markdown-aware chunking.
This chunker respects markdown structure and maintains context through overlapping chunks. It extracts metadata for each chunk to enable efficient retrieval and context-aware processing.
- __init__(min_chunk_size: int = 500, max_chunk_size: int = 1000, overlap_size: int = 150, logger: Logger | None = None)[source]¶
Initialize the markdown chunker.
- Parameters:
min_chunk_size – Minimum chunk size in tokens
max_chunk_size – Maximum chunk size in tokens
overlap_size – Number of tokens to overlap between chunks
logger – Logger instance
- chunk_file(file_path: Path) list[Chunk][source]¶
Chunk a markdown file.
- Parameters:
file_path – Path to the markdown file
- Returns:
List of chunks with metadata
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If file is empty or invalid
Modules
Markdown-aware chunking for handbook content. |
|
Repository manager for cloning and tracking the GitLab handbook. |