Embedder

`thoth.shared.embedder` ¶

Embedder module for generating document embeddings.

This module provides the Embedder class for generating embeddings from text chunks using sentence-transformers models with batch processing support.

`logger = setup_logger(name)` `module-attribute` ¶

`Embedder` ¶

Generate embeddings from text using sentence-transformers.

Supports batch processing with progress tracking for efficient embedding generation.

`model_name = model_name` `instance-attribute` ¶

`batch_size = batch_size` `instance-attribute` ¶

`logger = logger_instance or logger` `instance-attribute` ¶

`model = SentenceTransformer(model_name, device=device)` `instance-attribute` ¶

`init(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)` ¶

Initialize the Embedder with a sentence-transformers model.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name of the sentence-transformers model to use. Default is 'all-MiniLM-L6-v2' for a good balance of speed and quality. Other options: 'all-mpnet-base-v2' (higher quality, slower).	`'all-MiniLM-L6-v2'`
`device`	`str \| None`	Device to use for inference ('cuda', 'cpu', or None for auto-detect).	`None`
`batch_size`	`int`	Number of texts to process in each batch (default: 32).	`32`
`logger_instance`	`Logger \| LoggerAdapter \| None`	Optional logger instance to use.	`None`

`embed(texts: list[str], show_progress: bool = False, normalize: bool = True) -> list[list[float]]` ¶

Generate embeddings for a list of texts.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	List of text strings to embed.	required
`show_progress`	`bool`	Whether to show a progress bar during batch processing.	`False`
`normalize`	`bool`	Whether to normalize embeddings to unit length (default: True). Normalized embeddings work better with cosine similarity.	`True`

Returns:

Type	Description
`list[list[float]]`	List of embedding vectors, where each vector is a list of floats.

Raises:

Type	Description
`ValueError`	If texts list is empty or contains empty/whitespace-only strings.

`embed_single(text: str, normalize: bool = True) -> list[float]` ¶

Generate embedding for a single text.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text string to embed.	required
`normalize`	`bool`	Whether to normalize embedding to unit length (default: True).	`True`

Returns:

Type	Description
`list[float]`	Embedding vector as a list of floats.

Raises:

Type	Description
`ValueError`	If text is empty.

`get_embedding_dimension() -> int` ¶

Get the dimension of embeddings produced by this model.

Returns:

Type	Description
`int`	Integer dimension of the embedding vectors.

`get_model_info() -> dict[str, Any]` ¶

Get information about the loaded model.

Returns:

Type	Description
`dict[str, Any]`	Dictionary containing model metadata: - model_name: Name of the model - embedding_dimension: Dimension of embeddings - max_seq_length: Maximum sequence length the model can handle - device: Device the model is running on - batch_size: Configured batch size for processing

Embedder

thoth.shared.embedder ¶

logger = setup_logger(__name__) module-attribute ¶

Embedder ¶

model_name = model_name instance-attribute ¶

batch_size = batch_size instance-attribute ¶

logger = logger_instance or logger instance-attribute ¶

model = SentenceTransformer(model_name, device=device) instance-attribute ¶

__init__(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32, logger_instance: logging.Logger | logging.LoggerAdapter | None = None) ¶

embed(texts: list[str], show_progress: bool = False, normalize: bool = True) -> list[list[float]] ¶

embed_single(text: str, normalize: bool = True) -> list[float] ¶

get_embedding_dimension() -> int ¶

get_model_info() -> dict[str, Any] ¶

`thoth.shared.embedder` ¶

`logger = setup_logger(name)` `module-attribute` ¶

`Embedder` ¶

`model_name = model_name` `instance-attribute` ¶

`batch_size = batch_size` `instance-attribute` ¶

`logger = logger_instance or logger` `instance-attribute` ¶

`model = SentenceTransformer(model_name, device=device)` `instance-attribute` ¶

`init(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32, logger_instance: logging.Logger | logging.LoggerAdapter | None = None)` ¶

`embed(texts: list[str], show_progress: bool = False, normalize: bool = True) -> list[list[float]]` ¶

`embed_single(text: str, normalize: bool = True) -> list[float]` ¶

`get_embedding_dimension() -> int` ¶

`get_model_info() -> dict[str, Any]` ¶