Vector store
thoth.shared.vector_store
¶
Vector store module for managing document embeddings using LanceDB.
This module provides a wrapper around LanceDB for storing and querying document embeddings with CRUD operations and native GCS support.
logger = setup_logger(__name__)
module-attribute
¶
VectorStore
¶
Vector store for document embeddings using LanceDB.
Provides add/search/delete/get operations for document chunks with metadata (file_path, section, chunk_index, source, format). Supports local paths or GCS via gs:// URIs. Uses an Embedder for query and document embeddings; defaults to sentence-transformers all-MiniLM-L6-v2.
collection_name = collection_name
instance-attribute
¶
logger = logger_instance or logger
instance-attribute
¶
embedder = embedder or Embedder(model_name='all-MiniLM-L6-v2', logger_instance=(self.logger))
instance-attribute
¶
uri = f'gs://{gcs_bucket_name}/{path}'
instance-attribute
¶
db = lancedb.connect(self.uri)
instance-attribute
¶
table = self.db.create_table(self.collection_name, schema=schema, mode='create')
instance-attribute
¶
__init__(persist_directory: str = './lancedb', collection_name: str = 'thoth_documents', embedder: Embedder | None = None, gcs_bucket_name: str | None = None, gcs_project_id: str | None = None, gcs_prefix_override: str | None = None, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)
¶
Initialize the LanceDB vector store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
persist_directory
|
str
|
Local path or base path for LanceDB. Ignored when gcs_bucket_name is set (then URI is gs://bucket/lancedb or override). |
'./lancedb'
|
collection_name
|
str
|
Name of the table (collection). |
'thoth_documents'
|
embedder
|
Embedder | None
|
Optional Embedder instance. If not provided, a default Embedder with all-MiniLM-L6-v2 will be created. |
None
|
gcs_bucket_name
|
str | None
|
Optional GCS bucket; when set, store uses gs://bucket/... |
None
|
gcs_project_id
|
str | None
|
Optional GCP project ID (unused; kept for API compatibility). |
None
|
gcs_prefix_override
|
str | None
|
Optional GCS path under bucket (e.g. lancedb_batch_xyz). When set with gcs_bucket_name, URI is gs://bucket/gcs_prefix_override. |
None
|
logger_instance
|
Logger | LoggerAdapter | None
|
Optional logger instance. |
None
|
add_documents(documents: list[str], metadatas: list[dict[str, Any]] | None = None, ids: list[str] | None = None, embeddings: list[list[float]] | None = None) -> None
¶
Add or update documents in the table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents
|
list[str]
|
List of document texts. |
required |
metadatas
|
list[dict[str, Any]] | None
|
Optional list of metadata dicts per document. |
None
|
ids
|
list[str] | None
|
Optional list of IDs; auto-generated if not provided. |
None
|
embeddings
|
list[list[float]] | None
|
Optional pre-computed embeddings. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If list lengths do not match. |
search_similar(query: str, n_results: int = 5, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, query_embedding: list[float] | None = None) -> dict[str, Any]
¶
Search for similar documents by embedding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query text. |
required |
n_results
|
int
|
Maximum number of results. |
5
|
where
|
dict[str, Any] | None
|
Optional metadata filter (Chroma-style dict). |
None
|
where_document
|
dict[str, Any] | None
|
Unused; kept for API compatibility. |
None
|
query_embedding
|
list[float] | None
|
Optional pre-computed query embedding. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with ids, documents, metadatas, distances. |
delete_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None) -> None
¶
Delete documents by ids or where filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
list[str] | None
|
Optional list of document IDs. |
None
|
where
|
dict[str, Any] | None
|
Optional metadata filter. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If neither ids nor where is provided. |
delete_by_file_path(file_path: str) -> int
¶
Delete all documents with the given file_path metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
File path to match. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Number of documents deleted. |
get_document_count() -> int
¶
Return the number of documents (rows) in the table.
Returns:
| Type | Description |
|---|---|
int
|
Non-negative integer count of rows. |
get_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None, limit: int | None = None) -> dict[str, Any]
¶
Retrieve documents by ids, where filter, or full scan with limit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
list[str] | None
|
Optional list of IDs. |
None
|
where
|
dict[str, Any] | None
|
Optional metadata filter. |
None
|
limit
|
int | None
|
Optional maximum number of documents. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with ids, documents, metadatas. |
reset() -> None
¶
Drop and recreate the table (all data removed).
backup_to_gcs(backup_name: str | None = None) -> str | None
¶
No-op when using GCS URI; data is already in GCS. Returns URI or None.
restore_from_gcs(backup_name: str | None = None, gcs_prefix: str | None = None) -> int
¶
Reconnect to store; when URI is GCS, data is already current. Returns doc count.
sync_to_gcs(gcs_prefix: str = 'lancedb') -> dict | None
¶
When using GCS URI, sync is implicit. Returns status dict or None.
list_gcs_backups() -> list[str]
¶
No discrete backups when using LanceDB on GCS; return empty list.