thoth.ingestion.vector_store

Vector store module for managing document embeddings using ChromaDB.

This module provides a wrapper around ChromaDB for storing and querying document embeddings with CRUD operations.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

Embedder([model_name, device, batch_size])

Generate embeddings from text using sentence-transformers.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

Settings([_env_file, _env_file_encoding, ...])

VectorStore([persist_directory, ...])

Vector store for managing document embeddings using ChromaDB.

class thoth.ingestion.vector_store.VectorStore(persist_directory: str = './chroma_db', collection_name: str = 'thoth_documents', embedder: Embedder | None = None)[source]

Bases: object

Vector store for managing document embeddings using ChromaDB.

Provides CRUD operations for document storage and similarity search.

__init__(persist_directory: str = './chroma_db', collection_name: str = 'thoth_documents', embedder: Embedder | None = None)[source]

Initialize the ChromaDB vector store.

Parameters:
  • persist_directory – Directory path for ChromaDB persistence

  • collection_name – Name of the ChromaDB collection

  • embedder – Optional Embedder instance for generating embeddings. If not provided, a default Embedder with all-MiniLM-L6-v2 will be created.

add_documents(documents: list[str], metadatas: list[dict[str, Any]] | None = None, ids: list[str] | None = None, embeddings: list[list[float]] | None = None) None[source]

Add documents to the vector store.

Parameters:
  • documents – List of document texts to add

  • metadatas – Optional list of metadata dicts for each document

  • ids – Optional list of unique IDs for each document. If not provided, IDs will be auto-generated.

  • embeddings – Optional pre-computed embeddings. If not provided, embeddings will be generated using the configured Embedder.

Raises:

ValueError – If list lengths don’t match

search_similar(query: str, n_results: int = 5, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, query_embedding: list[float] | None = None) dict[str, Any][source]

Search for similar documents using semantic similarity.

Parameters:
  • query – Query text to search for

  • n_results – Number of results to return (default: 5)

  • where – Optional metadata filter conditions

  • where_document – Optional document content filter conditions

  • query_embedding – Optional pre-computed query embedding. If not provided, embedding will be generated from the query text.

Returns:

  • ids: List of document IDs

  • documents: List of document texts

  • metadatas: List of metadata dicts

  • distances: List of distance scores

Return type:

Dict containing

delete_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None) None[source]

Delete documents from the vector store.

Parameters:
  • ids – Optional list of document IDs to delete

  • where – Optional metadata filter for documents to delete

Raises:

ValueError – If neither ids nor where is provided

get_document_count() int[source]

Get the total number of documents in the collection.

Returns:

Number of documents in the collection

get_documents(ids: list[str] | None = None, where: dict[str, Any] | None = None, limit: int | None = None) dict[str, Any][source]

Retrieve documents from the vector store.

Parameters:
  • ids – Optional list of document IDs to retrieve

  • where – Optional metadata filter

  • limit – Optional maximum number of documents to return

Returns:

  • ids: List of document IDs

  • documents: List of document texts

  • metadatas: List of metadata dicts

Return type:

Dict containing

reset() None[source]

Reset the collection by deleting all documents.

Warning: This operation cannot be undone.