thoth.ingestion.gcs_repo_sync

GCS-based repository synchronization for Cloud Run.

This module handles syncing the GitLab handbook between GCS and local storage: 1. Clone repository to GCS (once, or on updates) 2. Sync from GCS to local /tmp on Cloud Run startup

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

GCSRepoSync(bucket_name, repo_url[, ...])

Manages repository synchronization between GCS and local storage.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

Repo(path, odbt, search_parent_directories, ...)

Represents a git repository and allows you to query references, create commit information, generate diffs, create and clone repositories, and query the log.

ThreadPoolExecutor([max_workers, ...])

class thoth.ingestion.gcs_repo_sync.GCSRepoSync(bucket_name: str, repo_url: str, gcs_prefix: str = 'handbook', local_path: Path | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]

Bases: object

Manages repository synchronization between GCS and local storage.

__init__(bucket_name: str, repo_url: str, gcs_prefix: str = 'handbook', local_path: Path | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]

Initialize GCS repository sync.

Parameters:
  • bucket_name – GCS bucket name

  • repo_url – Git repository URL

  • gcs_prefix – Prefix/folder in GCS bucket for repository files

  • local_path – Local path to sync to (defaults to /tmp/handbook)

  • logger_instance – Optional logger instance to use.

clone_to_gcs(force: bool = False) dict[str, Any][source]

Clone repository and upload to GCS.

This should be run once initially, or when you want to refresh the repository in GCS.

Parameters:

force – If True, re-clone even if files exist in GCS

Returns:

Dictionary with stats about the clone operation

sync_to_local(force: bool = False) dict[str, Any][source]

Sync repository from GCS to local storage.

This is called on Cloud Run startup to get the latest repository files.

Parameters:

force – If True, delete and re-download even if local files exist

Returns:

Dictionary with stats about the sync operation

list_files_in_gcs() list[str][source]

List all files in GCS without downloading.

Returns:

List of relative file paths (e.g., [‘docs/setup.md’, ‘api/README.md’])

download_batch_files(file_list: list[str]) Path[source]

Download only specific files for a batch from GCS.

Uses parallel downloads (ThreadPoolExecutor) for faster performance.

Parameters:

file_list – List of relative file paths to download

Returns:

Path to local directory with downloaded files

get_local_path() Path[source]

Get the local path where repository is synced.

Returns:

Path to local repository

is_synced() bool[source]

Check if repository is synced locally.

Returns:

True if local repository exists and sync is complete