thoth.shared.vector_store

Vector store module for managing document embeddings using LanceDB.

This module provides a wrapper around LanceDB for storing and querying document embeddings with CRUD operations and native GCS support.

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

Embedder([model_name, device, batch_size, ...])

Generate embeddings from text using sentence-transformers.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

VectorStore([persist_directory, ...])

Vector store for document embeddings using LanceDB.

class thoth.shared.vector_store.VectorStore(persist_directory: str = './lancedb', collection_name: str = 'thoth_documents', embedder: Embedder | None = None, gcs_bucket_name: str | None = None, gcs_project_id: str | None = None, gcs_prefix_override: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]

Bases: object

Vector store for document embeddings using LanceDB.

Provides add/search/delete/get operations for document chunks with metadata (file_path, section, chunk_index, source, format). Supports local paths or GCS via gs:// URIs. Uses an Embedder for query and document embeddings; defaults to sentence-transformers all-MiniLM-L6-v2.

__init__(persist_directory: str = './lancedb', collection_name: str = 'thoth_documents', embedder: Embedder | None = None, gcs_bucket_name: str | None = None, gcs_project_id: str | None = None, gcs_prefix_override: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]

Initialize the LanceDB vector store.

Parameters:
  • persist_directory – Local path or base path for LanceDB. Ignored when gcs_bucket_name is set (then URI is gs://bucket/lancedb or override).

  • collection_name – Name of the table (collection).

  • embedder – Optional Embedder instance. If not provided, a default Embedder with all-MiniLM-L6-v2 will be created.

  • gcs_bucket_name – Optional GCS bucket; when set, store uses gs://bucket/…

  • gcs_project_id – Optional GCP project ID (unused; kept for API compatibility).

  • gcs_prefix_override – Optional GCS path under bucket (e.g. lancedb_batch_xyz). When set with gcs_bucket_name, URI is gs://bucket/gcs_prefix_override.

  • logger_instance – Optional logger instance.

add_documents(documents: list[str], metadatas: list[dict[str, Any]] | None = None, ids: list[str] | None = None, embeddings: list[list[float]] | None = None) None[source]

Add or update documents in the table.

Parameters:
  • documents – List of document texts.

  • metadatas – Optional list of metadata dicts per document.

  • ids – Optional list of IDs; auto-generated if not provided.

  • embeddings – Optional pre-computed embeddings.

Raises:

ValueError – If list lengths do not match.

search_similar(query: str, n_results: int = 5, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, query_embedding: list[float] | None = None) dict[str, Any][source]

Search for similar documents by embedding.

Parameters:
  • query – Query text.

  • n_results – Maximum number of results.

  • where – Optional metadata filter (Chroma-style dict).

  • where_document – Unused; kept for API compatibility.

  • query_embedding – Optional pre-computed query embedding.

Returns:

Dict with ids, documents, metadatas, distances.

delete_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None) None[source]

Delete documents by ids or where filter.

Parameters:
  • ids – Optional list of document IDs.

  • where – Optional metadata filter.

Raises:

ValueError – If neither ids nor where is provided.

delete_by_file_path(file_path: str) int[source]

Delete all documents with the given file_path metadata.

Parameters:

file_path – File path to match.

Returns:

Number of documents deleted.

get_document_count() int[source]

Return the number of documents (rows) in the table.

Returns:

Non-negative integer count of rows.

get_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None, limit: int | None = None) dict[str, Any][source]

Retrieve documents by ids, where filter, or full scan with limit.

Parameters:
  • ids – Optional list of IDs.

  • where – Optional metadata filter.

  • limit – Optional maximum number of documents.

Returns:

Dict with ids, documents, metadatas.

reset() None[source]

Drop and recreate the table (all data removed).

backup_to_gcs(backup_name: str | None = None) str | None[source]

No-op when using GCS URI; data is already in GCS. Returns URI or None.

restore_from_gcs(backup_name: str | None = None, gcs_prefix: str | None = None) int[source]

Reconnect to store; when URI is GCS, data is already current. Returns doc count.

sync_to_gcs(gcs_prefix: str = 'lancedb') dict | None[source]

When using GCS URI, sync is implicit. Returns status dict or None.

list_gcs_backups() list[str][source]

No discrete backups when using LanceDB on GCS; return empty list.