Skip to content

Vector store

thoth.shared.vector_store

Vector store module for managing document embeddings using LanceDB.

This module provides a wrapper around LanceDB for storing and querying document embeddings with CRUD operations and native GCS support.

logger = setup_logger(__name__) module-attribute

VectorStore

Vector store for document embeddings using LanceDB.

Provides add/search/delete/get operations for document chunks with metadata (file_path, section, chunk_index, source, format). Supports local paths or GCS via gs:// URIs. Uses an Embedder for query and document embeddings; defaults to sentence-transformers all-MiniLM-L6-v2.

collection_name = collection_name instance-attribute

logger = logger_instance or logger instance-attribute

embedder = embedder or Embedder(model_name='all-MiniLM-L6-v2', logger_instance=(self.logger)) instance-attribute

uri = f'gs://{gcs_bucket_name}/{path}' instance-attribute

db = lancedb.connect(self.uri) instance-attribute

table = self.db.create_table(self.collection_name, schema=schema, mode='create') instance-attribute

__init__(persist_directory: str = './lancedb', collection_name: str = 'thoth_documents', embedder: Embedder | None = None, gcs_bucket_name: str | None = None, gcs_project_id: str | None = None, gcs_prefix_override: str | None = None, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)

Initialize the LanceDB vector store.

Parameters:

Name Type Description Default
persist_directory str

Local path or base path for LanceDB. Ignored when gcs_bucket_name is set (then URI is gs://bucket/lancedb or override).

'./lancedb'
collection_name str

Name of the table (collection).

'thoth_documents'
embedder Embedder | None

Optional Embedder instance. If not provided, a default Embedder with all-MiniLM-L6-v2 will be created.

None
gcs_bucket_name str | None

Optional GCS bucket; when set, store uses gs://bucket/...

None
gcs_project_id str | None

Optional GCP project ID (unused; kept for API compatibility).

None
gcs_prefix_override str | None

Optional GCS path under bucket (e.g. lancedb_batch_xyz). When set with gcs_bucket_name, URI is gs://bucket/gcs_prefix_override.

None
logger_instance Logger | LoggerAdapter | None

Optional logger instance.

None

add_documents(documents: list[str], metadatas: list[dict[str, Any]] | None = None, ids: list[str] | None = None, embeddings: list[list[float]] | None = None) -> None

Add or update documents in the table.

Parameters:

Name Type Description Default
documents list[str]

List of document texts.

required
metadatas list[dict[str, Any]] | None

Optional list of metadata dicts per document.

None
ids list[str] | None

Optional list of IDs; auto-generated if not provided.

None
embeddings list[list[float]] | None

Optional pre-computed embeddings.

None

Raises:

Type Description
ValueError

If list lengths do not match.

search_similar(query: str, n_results: int = 5, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, query_embedding: list[float] | None = None) -> dict[str, Any]

Search for similar documents by embedding.

Parameters:

Name Type Description Default
query str

Query text.

required
n_results int

Maximum number of results.

5
where dict[str, Any] | None

Optional metadata filter (Chroma-style dict).

None
where_document dict[str, Any] | None

Unused; kept for API compatibility.

None
query_embedding list[float] | None

Optional pre-computed query embedding.

None

Returns:

Type Description
dict[str, Any]

Dict with ids, documents, metadatas, distances.

delete_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None) -> None

Delete documents by ids or where filter.

Parameters:

Name Type Description Default
ids list[str] | None

Optional list of document IDs.

None
where dict[str, Any] | None

Optional metadata filter.

None

Raises:

Type Description
ValueError

If neither ids nor where is provided.

delete_by_file_path(file_path: str) -> int

Delete all documents with the given file_path metadata.

Parameters:

Name Type Description Default
file_path str

File path to match.

required

Returns:

Type Description
int

Number of documents deleted.

get_document_count() -> int

Return the number of documents (rows) in the table.

Returns:

Type Description
int

Non-negative integer count of rows.

get_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None, limit: int | None = None) -> dict[str, Any]

Retrieve documents by ids, where filter, or full scan with limit.

Parameters:

Name Type Description Default
ids list[str] | None

Optional list of IDs.

None
where dict[str, Any] | None

Optional metadata filter.

None
limit int | None

Optional maximum number of documents.

None

Returns:

Type Description
dict[str, Any]

Dict with ids, documents, metadatas.

reset() -> None

Drop and recreate the table (all data removed).

backup_to_gcs(backup_name: str | None = None) -> str | None

No-op when using GCS URI; data is already in GCS. Returns URI or None.

restore_from_gcs(backup_name: str | None = None, gcs_prefix: str | None = None) -> int

Reconnect to store; when URI is GCS, data is already current. Returns doc count.

sync_to_gcs(gcs_prefix: str = 'lancedb') -> dict | None

When using GCS URI, sync is implicit. Returns status dict or None.

list_gcs_backups() -> list[str]

No discrete backups when using LanceDB on GCS; return empty list.