thoth.ingestion.embedder¶

Embedder module for generating document embeddings.

This module provides the Embedder class for generating embeddings from text chunks using sentence-transformers models with batch processing support.

Classes

`Any`(args, *kwargs)	Special type indicating an unconstrained type.
`Embedder`([model_name, device, batch_size])	Generate embeddings from text using sentence-transformers.
`SentenceTransformer`([model_name_or_path, ...])	Loads or creates a SentenceTransformer model that can be used to map sentences / text to embeddings.

class thoth.ingestion.embedder.Embedder(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32)[source]¶

Generate embeddings from text using sentence-transformers.

Supports batch processing with progress tracking for efficient embedding generation.

__init__(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32)[source]¶

Initialize the Embedder with a sentence-transformers model.

Parameters:

model_name – Name of the sentence-transformers model to use. Default is ‘all-MiniLM-L6-v2’ for a good balance of speed and quality. Other options: ‘all-mpnet-base-v2’ (higher quality, slower).
device – Device to use for inference (‘cuda’, ‘cpu’, or None for auto-detect).
batch_size – Number of texts to process in each batch (default: 32).

embed(texts: list[str], show_progress: bool = False, normalize: bool = True) → list[list[float]][source]¶

Generate embeddings for a list of texts.

Parameters:

texts – List of text strings to embed.
show_progress – Whether to show a progress bar during batch processing.
normalize – Whether to normalize embeddings to unit length (default: True). Normalized embeddings work better with cosine similarity.

Returns:

List of embedding vectors, where each vector is a list of floats.

Raises:

ValueError – If texts list is empty or contains empty/whitespace-only strings.

embed_single(text: str, normalize: bool = True) → list[float][source]¶

Generate embedding for a single text.

Parameters:

Returns:

Embedding vector as a list of floats.

Raises:

ValueError – If text is empty.

get_embedding_dimension() → int[source]¶

Get the dimension of embeddings produced by this model.

get_model_info() → dict[str, Any][source]¶

Get information about the loaded model.

Returns:

Return type:

Dictionary containing model metadata