Embedder
thoth.shared.embedder
¶
Embedder module for generating document embeddings.
This module provides the Embedder class for generating embeddings from text chunks using sentence-transformers models with batch processing support.
logger = setup_logger(__name__)
module-attribute
¶
Embedder
¶
Generate embeddings from text using sentence-transformers.
Supports batch processing with progress tracking for efficient embedding generation.
model_name = model_name
instance-attribute
¶
batch_size = batch_size
instance-attribute
¶
logger = logger_instance or logger
instance-attribute
¶
model = SentenceTransformer(model_name, device=device)
instance-attribute
¶
__init__(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)
¶
Initialize the Embedder with a sentence-transformers model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
Name of the sentence-transformers model to use. Default is 'all-MiniLM-L6-v2' for a good balance of speed and quality. Other options: 'all-mpnet-base-v2' (higher quality, slower). |
'all-MiniLM-L6-v2'
|
device
|
str | None
|
Device to use for inference ('cuda', 'cpu', or None for auto-detect). |
None
|
batch_size
|
int
|
Number of texts to process in each batch (default: 32). |
32
|
logger_instance
|
Logger | LoggerAdapter | None
|
Optional logger instance to use. |
None
|
embed(texts: list[str], show_progress: bool = False, normalize: bool = True) -> list[list[float]]
¶
Generate embeddings for a list of texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of text strings to embed. |
required |
show_progress
|
bool
|
Whether to show a progress bar during batch processing. |
False
|
normalize
|
bool
|
Whether to normalize embeddings to unit length (default: True). Normalized embeddings work better with cosine similarity. |
True
|
Returns:
| Type | Description |
|---|---|
list[list[float]]
|
List of embedding vectors, where each vector is a list of floats. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If texts list is empty or contains empty/whitespace-only strings. |
embed_single(text: str, normalize: bool = True) -> list[float]
¶
Generate embedding for a single text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text string to embed. |
required |
normalize
|
bool
|
Whether to normalize embedding to unit length (default: True). |
True
|
Returns:
| Type | Description |
|---|---|
list[float]
|
Embedding vector as a list of floats. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If text is empty. |
get_embedding_dimension() -> int
¶
Get the dimension of embeddings produced by this model.
Returns:
| Type | Description |
|---|---|
int
|
Integer dimension of the embedding vectors. |
get_model_info() -> dict[str, Any]
¶
Get information about the loaded model.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing model metadata: - model_name: Name of the model - embedding_dimension: Dimension of embeddings - max_seq_length: Maximum sequence length the model can handle - device: Device the model is running on - batch_size: Configured batch size for processing |