thoth.ingestion.embedder

Embedder module for generating document embeddings.

This module provides the Embedder class for generating embeddings from text chunks using sentence-transformers models with batch processing support.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

Embedder([model_name, device, batch_size])

Generate embeddings from text using sentence-transformers.

SentenceTransformer([model_name_or_path, ...])

Loads or creates a SentenceTransformer model that can be used to map sentences / text to embeddings.

class thoth.ingestion.embedder.Embedder(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32)[source]

Bases: object

Generate embeddings from text using sentence-transformers.

Supports batch processing with progress tracking for efficient embedding generation.

__init__(model_name: str = 'all-MiniLM-L6-v2', device: str | None = None, batch_size: int = 32)[source]

Initialize the Embedder with a sentence-transformers model.

Parameters:
  • model_name – Name of the sentence-transformers model to use. Default is ‘all-MiniLM-L6-v2’ for a good balance of speed and quality. Other options: ‘all-mpnet-base-v2’ (higher quality, slower).

  • device – Device to use for inference (‘cuda’, ‘cpu’, or None for auto-detect).

  • batch_size – Number of texts to process in each batch (default: 32).

embed(texts: list[str], show_progress: bool = False, normalize: bool = True) list[list[float]][source]

Generate embeddings for a list of texts.

Parameters:
  • texts – List of text strings to embed.

  • show_progress – Whether to show a progress bar during batch processing.

  • normalize – Whether to normalize embeddings to unit length (default: True). Normalized embeddings work better with cosine similarity.

Returns:

List of embedding vectors, where each vector is a list of floats.

Raises:

ValueError – If texts list is empty or contains empty/whitespace-only strings.

embed_single(text: str, normalize: bool = True) list[float][source]

Generate embedding for a single text.

Parameters:
  • text – Text string to embed.

  • normalize – Whether to normalize embedding to unit length (default: True).

Returns:

Embedding vector as a list of floats.

Raises:

ValueError – If text is empty.

get_embedding_dimension() int[source]

Get the dimension of embeddings produced by this model.

Returns:

Integer dimension of the embedding vectors.

get_model_info() dict[str, Any][source]

Get information about the loaded model.

Returns:

  • model_name: Name of the model

  • embedding_dimension: Dimension of embeddings

  • max_seq_length: Maximum sequence length the model can handle

  • device: Device the model is running on

  • batch_size: Configured batch size for processing

Return type:

Dictionary containing model metadata