thoth.ingestion.pipeline¶
Ingestion pipeline orchestrator for Thoth.
This module provides the main pipeline coordinator that integrates all ingestion components (repo manager, chunker, embedder, vector store) into a complete end-to-end ingestion workflow with progress tracking, error handling, and resume logic.
Functions
|
Add dunder methods based on the fields defined in the class. |
|
Return an object to identify dataclass fields. |
Classes
|
Special type indicating an unconstrained type. |
|
|
|
Represents a chunk of markdown content with metadata. |
|
Generate embeddings from text using sentence-transformers. |
|
Manages the GitLab handbook repository. |
|
Orchestrates the complete ingestion pipeline. |
|
Intelligent markdown-aware chunking. |
|
PurePath subclass that can make system calls. |
|
Tracks the state of the ingestion pipeline. |
|
Statistics from pipeline execution. |
|
Vector store for managing document embeddings using ChromaDB. |
|
The year, month and day arguments are required. |
|
Fixed offset from UTC implementation of tzinfo. |
- class thoth.ingestion.pipeline.PipelineState(last_commit: str | None = None, processed_files: list[str] = <factory>, failed_files: dict[str, str] = <factory>, total_chunks: int = 0, total_documents: int = 0, start_time: str | None = None, last_update_time: str | None = None, completed: bool = False)[source]¶
Bases:
objectTracks the state of the ingestion pipeline.
- class thoth.ingestion.pipeline.PipelineStats(total_files: int, processed_files: int, failed_files: int, total_chunks: int, total_documents: int, duration_seconds: float, chunks_per_second: float, files_per_second: float)[source]¶
Bases:
objectStatistics from pipeline execution.
- class thoth.ingestion.pipeline.IngestionPipeline(repo_manager: HandbookRepoManager | None = None, chunker: MarkdownChunker | None = None, embedder: Embedder | None = None, vector_store: VectorStore | None = None, state_file: Path | None = None, batch_size: int = 50, logger_instance: Logger | None = None)[source]¶
Bases:
objectOrchestrates the complete ingestion pipeline.
This class coordinates: 1. Repository cloning/updating 2. Markdown file discovery 3. Document chunking 4. Embedding generation 5. Vector store insertion
With features: - Progress tracking and reporting - Resume capability from interruptions - Error handling and logging - Batch processing for efficiency
- __init__(repo_manager: HandbookRepoManager | None = None, chunker: MarkdownChunker | None = None, embedder: Embedder | None = None, vector_store: VectorStore | None = None, state_file: Path | None = None, batch_size: int = 50, logger_instance: Logger | None = None)[source]¶
Initialize the ingestion pipeline.
- Parameters:
repo_manager – Repository manager instance (creates default if None)
chunker – Markdown chunker instance (creates default if None)
embedder – Embedder instance (creates default if None)
vector_store – Vector store instance (creates default if None)
state_file – Path to state file for resume capability
batch_size – Number of files to process in each batch
logger_instance – Logger instance for logging
- run(force_reclone: bool = False, incremental: bool = True, progress_callback: Callable[[int, int, str], None] | None = None) PipelineStats[source]¶
Run the complete ingestion pipeline.
- Parameters:
force_reclone – If True, remove and re-clone the repository
incremental – If True, only process changed files (requires previous state)
progress_callback – Optional callback(current, total, status_msg) for progress
- Returns:
PipelineStats with execution statistics
- Raises:
RuntimeError – If pipeline fails