thoth.shared.vector_store¶
Vector store module for managing document embeddings using LanceDB.
This module provides a wrapper around LanceDB for storing and querying document embeddings with CRUD operations and native GCS support.
Functions
|
Create and configure a logger with structured JSON output. |
Classes
|
Special type indicating an unconstrained type. |
|
Generate embeddings from text using sentence-transformers. |
|
PurePath subclass that can make system calls. |
|
Vector store for document embeddings using LanceDB. |
- class thoth.shared.vector_store.VectorStore(persist_directory: str = './lancedb', collection_name: str = 'thoth_documents', embedder: Embedder | None = None, gcs_bucket_name: str | None = None, gcs_project_id: str | None = None, gcs_prefix_override: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]¶
Bases:
objectVector store for document embeddings using LanceDB.
Provides add/search/delete/get operations for document chunks with metadata (file_path, section, chunk_index, source, format). Supports local paths or GCS via gs:// URIs. Uses an Embedder for query and document embeddings; defaults to sentence-transformers all-MiniLM-L6-v2.
- __init__(persist_directory: str = './lancedb', collection_name: str = 'thoth_documents', embedder: Embedder | None = None, gcs_bucket_name: str | None = None, gcs_project_id: str | None = None, gcs_prefix_override: str | None = None, logger_instance: Logger | LoggerAdapter | None = None)[source]¶
Initialize the LanceDB vector store.
- Parameters:
persist_directory – Local path or base path for LanceDB. Ignored when gcs_bucket_name is set (then URI is gs://bucket/lancedb or override).
collection_name – Name of the table (collection).
embedder – Optional Embedder instance. If not provided, a default Embedder with all-MiniLM-L6-v2 will be created.
gcs_bucket_name – Optional GCS bucket; when set, store uses gs://bucket/…
gcs_project_id – Optional GCP project ID (unused; kept for API compatibility).
gcs_prefix_override – Optional GCS path under bucket (e.g. lancedb_batch_xyz). When set with gcs_bucket_name, URI is gs://bucket/gcs_prefix_override.
logger_instance – Optional logger instance.
- add_documents(documents: list[str], metadatas: list[dict[str, Any]] | None = None, ids: list[str] | None = None, embeddings: list[list[float]] | None = None) → None[source]¶
Add or update documents in the table.
- Parameters:
documents – List of document texts.
metadatas – Optional list of metadata dicts per document.
ids – Optional list of IDs; auto-generated if not provided.
embeddings – Optional pre-computed embeddings.
- Raises:
ValueError – If list lengths do not match.
- search_similar(query: str, n_results: int = 5, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, query_embedding: list[float] | None = None) → dict[str, Any][source]¶
Search for similar documents by embedding.
- Parameters:
query – Query text.
n_results – Maximum number of results.
where – Optional metadata filter (Chroma-style dict).
where_document – Unused; kept for API compatibility.
query_embedding – Optional pre-computed query embedding.
- Returns:
Dict with ids, documents, metadatas, distances.
- delete_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None) → None[source]¶
Delete documents by ids or where filter.
- Parameters:
ids – Optional list of document IDs.
where – Optional metadata filter.
- Raises:
ValueError – If neither ids nor where is provided.
- delete_by_file_path(file_path: str) → int[source]¶
Delete all documents with the given file_path metadata.
- Parameters:
file_path – File path to match.
- Returns:
Number of documents deleted.
- get_document_count() → int[source]¶
Return the number of documents (rows) in the table.
- Returns:
Non-negative integer count of rows.
- get_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None, limit: int | None = None) → dict[str, Any][source]¶
Retrieve documents by ids, where filter, or full scan with limit.
- Parameters:
ids – Optional list of IDs.
where – Optional metadata filter.
limit – Optional maximum number of documents.
- Returns:
Dict with ids, documents, metadatas.
- backup_to_gcs(backup_name: str | None = None) → str | None[source]¶
No-op when using GCS URI; data is already in GCS. Returns URI or None.
- restore_from_gcs(backup_name: str | None = None, gcs_prefix: str | None = None) → int[source]¶
Reconnect to store; when URI is GCS, data is already current. Returns doc count.