thoth.ingestion.vector_store¶
Vector store module for managing document embeddings using ChromaDB.
This module provides a wrapper around ChromaDB for storing and querying document embeddings with CRUD operations.
Classes
|
Special type indicating an unconstrained type. |
|
Generate embeddings from text using sentence-transformers. |
|
PurePath subclass that can make system calls. |
|
|
|
Vector store for managing document embeddings using ChromaDB. |
- class thoth.ingestion.vector_store.VectorStore(persist_directory: str = './chroma_db', collection_name: str = 'thoth_documents', embedder: Embedder | None = None)[source]¶
Bases:
objectVector store for managing document embeddings using ChromaDB.
Provides CRUD operations for document storage and similarity search.
- __init__(persist_directory: str = './chroma_db', collection_name: str = 'thoth_documents', embedder: Embedder | None = None)[source]¶
Initialize the ChromaDB vector store.
- Parameters:
persist_directory – Directory path for ChromaDB persistence
collection_name – Name of the ChromaDB collection
embedder – Optional Embedder instance for generating embeddings. If not provided, a default Embedder with all-MiniLM-L6-v2 will be created.
- add_documents(documents: list[str], metadatas: list[dict[str, Any]] | None = None, ids: list[str] | None = None, embeddings: list[list[float]] | None = None) None[source]¶
Add documents to the vector store.
- Parameters:
documents – List of document texts to add
metadatas – Optional list of metadata dicts for each document
ids – Optional list of unique IDs for each document. If not provided, IDs will be auto-generated.
embeddings – Optional pre-computed embeddings. If not provided, embeddings will be generated using the configured Embedder.
- Raises:
ValueError – If list lengths don’t match
- search_similar(query: str, n_results: int = 5, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, query_embedding: list[float] | None = None) dict[str, Any][source]¶
Search for similar documents using semantic similarity.
- Parameters:
query – Query text to search for
n_results – Number of results to return (default: 5)
where – Optional metadata filter conditions
where_document – Optional document content filter conditions
query_embedding – Optional pre-computed query embedding. If not provided, embedding will be generated from the query text.
- Returns:
ids: List of document IDs
documents: List of document texts
metadatas: List of metadata dicts
distances: List of distance scores
- Return type:
Dict containing
- delete_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None) None[source]¶
Delete documents from the vector store.
- Parameters:
ids – Optional list of document IDs to delete
where – Optional metadata filter for documents to delete
- Raises:
ValueError – If neither ids nor where is provided
- get_document_count() int[source]¶
Get the total number of documents in the collection.
- Returns:
Number of documents in the collection
- get_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None, limit: int | None = None) dict[str, Any][source]¶
Retrieve documents from the vector store.
- Parameters:
ids – Optional list of document IDs to retrieve
where – Optional metadata filter
limit – Optional maximum number of documents to return
- Returns:
ids: List of document IDs
documents: List of document texts
metadatas: List of metadata dicts
- Return type:
Dict containing