thoth.ingestion.gcs_repo_sync¶
GCS-based repository synchronization for Cloud Run.
This module handles syncing the GitLab handbook between GCS and local storage: 1. Clone repository to GCS (once, or on updates) 2. Sync from GCS to local /tmp on Cloud Run startup
Functions
|
Create and configure a logger with structured JSON output. |
Classes
|
Special type indicating an unconstrained type. |
|
Manages repository synchronization between GCS and local storage. |
|
PurePath subclass that can make system calls. |
|
Represents a git repository and allows you to query references, create commit information, generate diffs, create and clone repositories, and query the log. |
|
- class thoth.ingestion.gcs_repo_sync.GCSRepoSync(bucket_name: str, repo_url: str, gcs_prefix: str = 'handbook', local_path: Path | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]¶
Bases:
objectManages repository synchronization between GCS and local storage.
- __init__(bucket_name: str, repo_url: str, gcs_prefix: str = 'handbook', local_path: Path | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]¶
Initialize GCS repository sync.
- Parameters:
bucket_name – GCS bucket name
repo_url – Git repository URL
gcs_prefix – Prefix/folder in GCS bucket for repository files
local_path – Local path to sync to (defaults to /tmp/handbook)
logger_instance – Optional logger instance to use.
- clone_to_gcs(force: bool = False) dict[str, Any][source]¶
Clone repository and upload to GCS.
This should be run once initially, or when you want to refresh the repository in GCS.
- Parameters:
force – If True, re-clone even if files exist in GCS
- Returns:
Dictionary with stats about the clone operation
- sync_to_local(force: bool = False) dict[str, Any][source]¶
Sync repository from GCS to local storage.
This is called on Cloud Run startup to get the latest repository files.
- Parameters:
force – If True, delete and re-download even if local files exist
- Returns:
Dictionary with stats about the sync operation
- list_files_in_gcs() list[str][source]¶
List all files in GCS without downloading.
- Returns:
List of relative file paths (e.g., [‘docs/setup.md’, ‘api/README.md’])
- download_batch_files(file_list: list[str]) Path[source]¶
Download only specific files for a batch from GCS.
Uses parallel downloads (ThreadPoolExecutor) for faster performance.
- Parameters:
file_list – List of relative file paths to download
- Returns:
Path to local directory with downloaded files