thoth.shared.gcs_sync

Google Cloud Storage sync for vector database persistence.

This module provides upload/download of local vector database directories (e.g., LanceDB data) to/from a GCS bucket for backup and restore. Used by the ingestion worker and CLI when not using a direct gs:// LanceDB URI.

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

GCSSync(bucket_name[, project_id, ...])

Manages sync of local vector DB directories to/from Google Cloud Storage.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

datetime(year, month, day[, hour[, minute[, ...)

The year, month and day arguments are required.

Exceptions

GCSSyncError

Raised when GCS sync operations fail.

GoogleCloudError

exception thoth.shared.gcs_sync.GCSSyncError[source]

Bases: Exception

Raised when GCS sync operations fail.

class thoth.shared.gcs_sync.GCSSync(bucket_name: str, project_id: str | None = None, credentials_path: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]

Bases: object

Manages sync of local vector DB directories to/from Google Cloud Storage.

Handles uploading a local directory (e.g., LanceDB persistence path) to a GCS prefix and downloading it back for restore. Verifies bucket existence on init and uses Application Default Credentials or a provided credentials path.

__init__(bucket_name: str, project_id: str | None = None, credentials_path: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]

Initialize GCS sync manager.

Parameters:
  • bucket_name – Name of the GCS bucket for storage

  • project_id – Optional GCP project ID (defaults to environment)

  • credentials_path – Optional path to service account JSON key file If not provided, uses Application Default Credentials

  • logger_instance – Optional logger instance to use.

Raises:

GCSSyncError – If google-cloud-storage is not installed

upload_directory(local_path: str | Path, gcs_prefix: str = 'lancedb', exclude_patterns: list[str] | None = None) int[source]

Upload a local directory to GCS.

Parameters:
  • local_path – Path to local directory to upload

  • gcs_prefix – Prefix (folder path) in GCS bucket

  • exclude_patterns – Optional list of filename patterns to exclude

Returns:

Number of files uploaded

Raises:

GCSSyncError – If upload fails

download_directory(gcs_prefix: str, local_path: str | Path, clean_local: bool = False) int[source]

Download a directory from GCS to local storage.

Parameters:
  • gcs_prefix – Prefix (folder path) in GCS bucket

  • local_path – Path to local directory for download

  • clean_local – If True, remove local directory before download

Returns:

Number of files downloaded

Raises:

GCSSyncError – If download fails

sync_to_gcs(local_path: str | Path, gcs_prefix: str = 'lancedb') dict[str, int | str][source]

Sync local LanceDB directory to GCS (upload).

Parameters:
  • local_path – Path to local LanceDB directory

  • gcs_prefix – Prefix in GCS bucket

Returns:

Dictionary with sync statistics

Raises:

GCSSyncError – If sync fails

sync_from_gcs(gcs_prefix: str, local_path: str | Path, clean_local: bool = False) dict[str, int | str][source]

Sync LanceDB directory from GCS to local (download).

Parameters:
  • gcs_prefix – Prefix in GCS bucket

  • local_path – Path to local LanceDB directory

  • clean_local – If True, remove local directory before sync

Returns:

Dictionary with sync statistics

Raises:

GCSSyncError – If sync fails

backup_to_gcs(local_path: str | Path, backup_name: str | None = None) str[source]

Create a timestamped backup in GCS.

Parameters:
  • local_path – Path to local LanceDB directory

  • backup_name – Optional backup name (defaults to timestamp)

Returns:

GCS prefix of the backup

Raises:

GCSSyncError – If backup fails

restore_from_backup(backup_name: str, local_path: str | Path, clean_local: bool = True) int[source]

Restore LanceDB from a GCS backup.

Parameters:
  • backup_name – Name of the backup to restore

  • local_path – Path to local LanceDB directory

  • clean_local – If True, remove local directory before restore

Returns:

Number of files restored

Raises:

GCSSyncError – If restore fails

list_backups() list[str][source]

List available backups in GCS.

Returns:

List of backup names

Raises:

GCSSyncError – If listing fails