thoth.ingestion.vector_store¶

Vector store module for managing document embeddings using ChromaDB.

This module provides a wrapper around ChromaDB for storing and querying document embeddings with CRUD operations.

Classes

`Any`(args, *kwargs)	Special type indicating an unconstrained type.
`Embedder`([model_name, device, batch_size])	Generate embeddings from text using sentence-transformers.
`Path`(args, *kwargs)	PurePath subclass that can make system calls.
`Settings`([_env_file, _env_file_encoding, ...])
`VectorStore`([persist_directory, ...])	Vector store for managing document embeddings using ChromaDB.

class thoth.ingestion.vector_store.VectorStore(persist_directory: str = './chroma_db', collection_name: str = 'thoth_documents', embedder: Embedder | None = None)[source]¶

Bases: object

Vector store for managing document embeddings using ChromaDB.

Provides CRUD operations for document storage and similarity search.

__init__(persist_directory: str = './chroma_db', collection_name: str = 'thoth_documents', embedder: Embedder | None = None)[source]¶

Initialize the ChromaDB vector store.

Parameters:

persist_directory – Directory path for ChromaDB persistence
collection_name – Name of the ChromaDB collection
embedder – Optional Embedder instance for generating embeddings. If not provided, a default Embedder with all-MiniLM-L6-v2 will be created.

add_documents(documents: list[str], metadatas: list[dict[str, Any]] | None = None, ids: list[str] | None = None, embeddings: list[list[float]] | None = None) → None[source]¶

Add documents to the vector store.

Parameters:

documents – List of document texts to add
metadatas – Optional list of metadata dicts for each document
ids – Optional list of unique IDs for each document. If not provided, IDs will be auto-generated.
embeddings – Optional pre-computed embeddings. If not provided, embeddings will be generated using the configured Embedder.

Raises:

ValueError – If list lengths don’t match

search_similar(query: str, n_results: int = 5, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, query_embedding: list[float] | None = None) → dict[str, Any][source]¶

Search for similar documents using semantic similarity.

Parameters:

query – Query text to search for
n_results – Number of results to return (default: 5)
where – Optional metadata filter conditions
where_document – Optional document content filter conditions
query_embedding – Optional pre-computed query embedding. If not provided, embedding will be generated from the query text.

Returns:

ids: List of document IDs
documents: List of document texts
metadatas: List of metadata dicts
distances: List of distance scores

Return type:

Dict containing

delete_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None) → None[source]¶

Delete documents from the vector store.

Parameters:

ids – Optional list of document IDs to delete
where – Optional metadata filter for documents to delete

Raises:

ValueError – If neither ids nor where is provided

get_document_count() → int[source]¶

Get the total number of documents in the collection.

Returns:: Number of documents in the collection

get_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None, limit: int | None = None) → dict[str, Any][source]¶

Retrieve documents from the vector store.

Parameters:

ids – Optional list of document IDs to retrieve
where – Optional metadata filter
limit – Optional maximum number of documents to return

Returns:

ids: List of document IDs
documents: List of document texts
metadatas: List of metadata dicts

Return type:

Dict containing

reset() → None[source]¶

Reset the collection by deleting all documents.

Warning: This operation cannot be undone.

thoth.ingestion.vector_store¶

Table of Contents

Previous topic

Next topic

This Page