thoth.shared.gcs_sync¶
Google Cloud Storage sync for vector database persistence.
This module provides upload/download of local vector database directories (e.g., LanceDB data) to/from a GCS bucket for backup and restore. Used by the ingestion worker and CLI when not using a direct gs:// LanceDB URI.
Functions
|
Create and configure a logger with structured JSON output. |
Classes
|
Manages sync of local vector DB directories to/from Google Cloud Storage. |
|
PurePath subclass that can make system calls. |
|
The year, month and day arguments are required. |
Exceptions
Raised when GCS sync operations fail. |
|
|
- exception thoth.shared.gcs_sync.GCSSyncError[source]¶
Bases:
ExceptionRaised when GCS sync operations fail.
- class thoth.shared.gcs_sync.GCSSync(bucket_name: str, project_id: str | None = None, credentials_path: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]¶
Bases:
objectManages sync of local vector DB directories to/from Google Cloud Storage.
Handles uploading a local directory (e.g., LanceDB persistence path) to a GCS prefix and downloading it back for restore. Verifies bucket existence on init and uses Application Default Credentials or a provided credentials path.
- __init__(bucket_name: str, project_id: str | None = None, credentials_path: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]¶
Initialize GCS sync manager.
- Parameters:
bucket_name – Name of the GCS bucket for storage
project_id – Optional GCP project ID (defaults to environment)
credentials_path – Optional path to service account JSON key file If not provided, uses Application Default Credentials
logger_instance – Optional logger instance to use.
- Raises:
GCSSyncError – If google-cloud-storage is not installed
- upload_directory(local_path: str | Path, gcs_prefix: str = 'lancedb', exclude_patterns: list[str] | None = None) → int[source]¶
Upload a local directory to GCS.
- Parameters:
local_path – Path to local directory to upload
gcs_prefix – Prefix (folder path) in GCS bucket
exclude_patterns – Optional list of filename patterns to exclude
- Returns:
Number of files uploaded
- Raises:
GCSSyncError – If upload fails
- download_directory(gcs_prefix: str, local_path: str | Path, clean_local: bool = False) → int[source]¶
Download a directory from GCS to local storage.
- Parameters:
gcs_prefix – Prefix (folder path) in GCS bucket
local_path – Path to local directory for download
clean_local – If True, remove local directory before download
- Returns:
Number of files downloaded
- Raises:
GCSSyncError – If download fails
- sync_to_gcs(local_path: str | Path, gcs_prefix: str = 'lancedb') → dict[str, int | str][source]¶
Sync local LanceDB directory to GCS (upload).
- Parameters:
local_path – Path to local LanceDB directory
gcs_prefix – Prefix in GCS bucket
- Returns:
Dictionary with sync statistics
- Raises:
GCSSyncError – If sync fails
- sync_from_gcs(gcs_prefix: str, local_path: str | Path, clean_local: bool = False) → dict[str, int | str][source]¶
Sync LanceDB directory from GCS to local (download).
- Parameters:
gcs_prefix – Prefix in GCS bucket
local_path – Path to local LanceDB directory
clean_local – If True, remove local directory before sync
- Returns:
Dictionary with sync statistics
- Raises:
GCSSyncError – If sync fails
- backup_to_gcs(local_path: str | Path, backup_name: str | None = None) → str[source]¶
Create a timestamped backup in GCS.
- Parameters:
local_path – Path to local LanceDB directory
backup_name – Optional backup name (defaults to timestamp)
- Returns:
GCS prefix of the backup
- Raises:
GCSSyncError – If backup fails
- restore_from_backup(backup_name: str, local_path: str | Path, clean_local: bool = True) → int[source]¶
Restore LanceDB from a GCS backup.
- Parameters:
backup_name – Name of the backup to restore
local_path – Path to local LanceDB directory
clean_local – If True, remove local directory before restore
- Returns:
Number of files restored
- Raises:
GCSSyncError – If restore fails
- list_backups() → list[str][source]¶
List available backups in GCS.
- Returns:
List of backup names
- Raises:
GCSSyncError – If listing fails