Gcs repo sync
thoth.ingestion.gcs_repo_sync
¶
GCS-based repository synchronization for Cloud Run.
This module handles syncing the GitLab handbook between GCS and local storage: 1. Clone repository to GCS (once, or on updates) 2. Sync from GCS to local /tmp on Cloud Run startup
logger = setup_logger(__name__)
module-attribute
¶
GCSRepoSync
¶
Manages repository synchronization between GCS and local storage.
bucket_name = bucket_name
instance-attribute
¶
repo_url = repo_url
instance-attribute
¶
gcs_prefix = gcs_prefix.strip('/')
instance-attribute
¶
local_path = local_path or Path('/tmp/handbook')
instance-attribute
¶
storage_client = storage.Client()
instance-attribute
¶
bucket = self.storage_client.bucket(bucket_name)
instance-attribute
¶
logger = logger_instance or logger
instance-attribute
¶
__init__(bucket_name: str, repo_url: str, gcs_prefix: str = 'handbook', local_path: Path | None = None, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)
¶
Initialize GCS repository sync.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bucket_name
|
str
|
GCS bucket name |
required |
repo_url
|
str
|
Git repository URL |
required |
gcs_prefix
|
str
|
Prefix/folder in GCS bucket for repository files |
'handbook'
|
local_path
|
Path | None
|
Local path to sync to (defaults to /tmp/handbook) |
None
|
logger_instance
|
Logger | LoggerAdapter | None
|
Optional logger instance to use. |
None
|
clone_to_gcs(force: bool = False) -> dict[str, Any]
¶
Clone repository and upload to GCS.
This should be run once initially, or when you want to refresh the repository in GCS.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
force
|
bool
|
If True, re-clone even if files exist in GCS |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with stats about the clone operation |
sync_to_local(force: bool = False) -> dict[str, Any]
¶
Sync repository from GCS to local storage.
This is called on Cloud Run startup to get the latest repository files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
force
|
bool
|
If True, delete and re-download even if local files exist |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with stats about the sync operation |
list_files_in_gcs() -> list[str]
¶
List all files in GCS without downloading.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of relative file paths (e.g., ['docs/setup.md', 'api/README.md']) |
download_batch_files(file_list: list[str]) -> Path
¶
Download only specific files for a batch from GCS.
Uses parallel downloads (ThreadPoolExecutor) for faster performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_list
|
list[str]
|
List of relative file paths to download |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to local directory with downloaded files |
get_local_path() -> Path
¶
Get the local path where repository is synced.
Returns:
| Type | Description |
|---|---|
Path
|
Path to local repository |
is_synced() -> bool
¶
Check if repository is synced locally.
Returns:
| Type | Description |
|---|---|
bool
|
True if local repository exists and sync is complete |