thoth.ingestion.embedder¶
Embedder module for generating document embeddings.
This module provides the Embedder class for generating embeddings from text chunks using sentence-transformers models with batch processing support.
Classes
|
Special type indicating an unconstrained type. |
|
Generate embeddings from text using sentence-transformers. |
|
Loads or creates a SentenceTransformer model that can be used to map sentences / text to embeddings. |
- class thoth.ingestion.embedder.Embedder(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32)[source]¶
Bases:
objectGenerate embeddings from text using sentence-transformers.
Supports batch processing with progress tracking for efficient embedding generation.
- __init__(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32)[source]¶
Initialize the Embedder with a sentence-transformers model.
- Parameters:
model_name – Name of the sentence-transformers model to use. Default is ‘all-MiniLM-L6-v2’ for a good balance of speed and quality. Other options: ‘all-mpnet-base-v2’ (higher quality, slower).
device – Device to use for inference (‘cuda’, ‘cpu’, or None for auto-detect).
batch_size – Number of texts to process in each batch (default: 32).
- embed(texts: list[str], show_progress: bool = False, normalize: bool = True) list[list[float]][source]¶
Generate embeddings for a list of texts.
- Parameters:
texts – List of text strings to embed.
show_progress – Whether to show a progress bar during batch processing.
normalize – Whether to normalize embeddings to unit length (default: True). Normalized embeddings work better with cosine similarity.
- Returns:
List of embedding vectors, where each vector is a list of floats.
- Raises:
ValueError – If texts list is empty or contains empty/whitespace-only strings.
- embed_single(text: str, normalize: bool = True) list[float][source]¶
Generate embedding for a single text.
- Parameters:
text – Text string to embed.
normalize – Whether to normalize embedding to unit length (default: True).
- Returns:
Embedding vector as a list of floats.
- Raises:
ValueError – If text is empty.
- get_embedding_dimension() int[source]¶
Get the dimension of embeddings produced by this model.
- Returns:
Integer dimension of the embedding vectors.
- get_model_info() dict[str, Any][source]¶
Get information about the loaded model.
- Returns:
model_name: Name of the model
embedding_dimension: Dimension of embeddings
max_seq_length: Maximum sequence length the model can handle
device: Device the model is running on
batch_size: Configured batch size for processing
- Return type:
Dictionary containing model metadata