Skip to content

Embedder

thoth.shared.embedder

Embedder module for generating document embeddings.

This module provides the Embedder class for generating embeddings from text chunks using sentence-transformers models with batch processing support.

logger = setup_logger(__name__) module-attribute

Embedder

Generate embeddings from text using sentence-transformers.

Supports batch processing with progress tracking for efficient embedding generation.

model_name = model_name instance-attribute

batch_size = batch_size instance-attribute

logger = logger_instance or logger instance-attribute

model = SentenceTransformer(model_name, device=device) instance-attribute

__init__(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)

Initialize the Embedder with a sentence-transformers model.

Parameters:

Name Type Description Default
model_name str

Name of the sentence-transformers model to use. Default is 'all-MiniLM-L6-v2' for a good balance of speed and quality. Other options: 'all-mpnet-base-v2' (higher quality, slower).

'all-MiniLM-L6-v2'
device str | None

Device to use for inference ('cuda', 'cpu', or None for auto-detect).

None
batch_size int

Number of texts to process in each batch (default: 32).

32
logger_instance Logger | LoggerAdapter | None

Optional logger instance to use.

None

embed(texts: list[str], show_progress: bool = False, normalize: bool = True) -> list[list[float]]

Generate embeddings for a list of texts.

Parameters:

Name Type Description Default
texts list[str]

List of text strings to embed.

required
show_progress bool

Whether to show a progress bar during batch processing.

False
normalize bool

Whether to normalize embeddings to unit length (default: True). Normalized embeddings work better with cosine similarity.

True

Returns:

Type Description
list[list[float]]

List of embedding vectors, where each vector is a list of floats.

Raises:

Type Description
ValueError

If texts list is empty or contains empty/whitespace-only strings.

embed_single(text: str, normalize: bool = True) -> list[float]

Generate embedding for a single text.

Parameters:

Name Type Description Default
text str

Text string to embed.

required
normalize bool

Whether to normalize embedding to unit length (default: True).

True

Returns:

Type Description
list[float]

Embedding vector as a list of floats.

Raises:

Type Description
ValueError

If text is empty.

get_embedding_dimension() -> int

Get the dimension of embeddings produced by this model.

Returns:

Type Description
int

Integer dimension of the embedding vectors.

get_model_info() -> dict[str, Any]

Get information about the loaded model.

Returns:

Type Description
dict[str, Any]

Dictionary containing model metadata: - model_name: Name of the model - embedding_dimension: Dimension of embeddings - max_seq_length: Maximum sequence length the model can handle - device: Device the model is running on - batch_size: Configured batch size for processing