Skip to content

Gcs repo sync

thoth.ingestion.gcs_repo_sync

GCS-based repository synchronization for Cloud Run.

This module handles syncing the GitLab handbook between GCS and local storage: 1. Clone repository to GCS (once, or on updates) 2. Sync from GCS to local /tmp on Cloud Run startup

logger = setup_logger(__name__) module-attribute

GCSRepoSync

Manages repository synchronization between GCS and local storage.

bucket_name = bucket_name instance-attribute

repo_url = repo_url instance-attribute

gcs_prefix = gcs_prefix.strip('/') instance-attribute

local_path = local_path or Path('/tmp/handbook') instance-attribute

storage_client = storage.Client() instance-attribute

bucket = self.storage_client.bucket(bucket_name) instance-attribute

logger = logger_instance or logger instance-attribute

__init__(bucket_name: str, repo_url: str, gcs_prefix: str = 'handbook', local_path: Path | None = None, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)

Initialize GCS repository sync.

Parameters:

Name Type Description Default
bucket_name str

GCS bucket name

required
repo_url str

Git repository URL

required
gcs_prefix str

Prefix/folder in GCS bucket for repository files

'handbook'
local_path Path | None

Local path to sync to (defaults to /tmp/handbook)

None
logger_instance Logger | LoggerAdapter | None

Optional logger instance to use.

None

clone_to_gcs(force: bool = False) -> dict[str, Any]

Clone repository and upload to GCS.

This should be run once initially, or when you want to refresh the repository in GCS.

Parameters:

Name Type Description Default
force bool

If True, re-clone even if files exist in GCS

False

Returns:

Type Description
dict[str, Any]

Dictionary with stats about the clone operation

sync_to_local(force: bool = False) -> dict[str, Any]

Sync repository from GCS to local storage.

This is called on Cloud Run startup to get the latest repository files.

Parameters:

Name Type Description Default
force bool

If True, delete and re-download even if local files exist

False

Returns:

Type Description
dict[str, Any]

Dictionary with stats about the sync operation

list_files_in_gcs() -> list[str]

List all files in GCS without downloading.

Returns:

Type Description
list[str]

List of relative file paths (e.g., ['docs/setup.md', 'api/README.md'])

download_batch_files(file_list: list[str]) -> Path

Download only specific files for a batch from GCS.

Uses parallel downloads (ThreadPoolExecutor) for faster performance.

Parameters:

Name Type Description Default
file_list list[str]

List of relative file paths to download

required

Returns:

Type Description
Path

Path to local directory with downloaded files

get_local_path() -> Path

Get the local path where repository is synced.

Returns:

Type Description
Path

Path to local repository

is_synced() -> bool

Check if repository is synced locally.

Returns:

Type Description
bool

True if local repository exists and sync is complete