Skip to content

Config

thoth.shared.sources.config

Source configuration for multi-source ingestion.

This module provides configuration management for different data sources (handbook, D&D, personal documents) with configurable GCS locations and supported file formats.

logger = setup_logger(__name__) module-attribute

DEFAULT_SOURCES: dict[str, SourceConfig] = {'handbook': SourceConfig(name='handbook', collection_name='handbook_documents', gcs_prefix='handbook', supported_formats=['.md'], description='GitLab Handbook documentation'), 'dnd': SourceConfig(name='dnd', collection_name='dnd_documents', gcs_prefix='dnd', supported_formats=['.md', '.pdf', '.txt'], description='D&D game materials and rulebooks'), 'personal': SourceConfig(name='personal', collection_name='personal_documents', gcs_prefix='personal', supported_formats=['.md', '.pdf', '.txt', '.docx'], description='Personal documents and notes')} module-attribute

SourceConfig dataclass

Configuration for a single data source (handbook, D&D, personal, etc.).

Each source has a unique name, a LanceDB table (collection) name, a GCS prefix for stored files, and a list of supported file extensions. Used by the ingestion pipeline and MCP server to route and filter documents.

Attributes:

Name Type Description
name str

Unique identifier for the source (e.g., 'handbook', 'dnd', 'personal').

collection_name str

LanceDB table name for this source (e.g., 'handbook_documents').

gcs_prefix str

GCS path prefix where source files are stored in the bucket.

supported_formats list[str]

File extensions supported for ingestion (e.g., ['.md', '.pdf']).

description str

Human-readable description of the source for logging and UI.

name: str instance-attribute

collection_name: str instance-attribute

gcs_prefix: str instance-attribute

supported_formats: list[str] = field(default_factory=list) class-attribute instance-attribute

description: str = '' class-attribute instance-attribute

__init__(name: str, collection_name: str, gcs_prefix: str, supported_formats: list[str] = list(), description: str = '') -> None

supports_format(extension: str) -> bool

Check if this source supports a file format.

Parameters:

Name Type Description Default
extension str

File extension including dot (e.g., '.md')

required

Returns:

Type Description
bool

True if format is supported

SourceRegistry

Registry for managing data source configurations.

The registry loads default configurations and supports environment variable overrides for GCS prefixes.

Environment variables

THOTH_SOURCE_{NAME}GCS_PREFIX: Override GCS prefix for a source THOTH_SOURCE_COLLECTION: Override collection name for a source

Example

THOTH_SOURCE_HANDBOOK_GCS_PREFIX=custom_handbook THOTH_SOURCE_DND_COLLECTION=my_dnd_collection

__init__() -> None

Initialize the source registry with defaults and overrides.

get(name: str) -> SourceConfig | None

Get source configuration by name.

Parameters:

Name Type Description Default
name str

Source identifier (e.g., 'handbook', 'dnd', 'personal')

required

Returns:

Type Description
SourceConfig | None

SourceConfig if found, None otherwise

list_sources() -> list[str]

List all registered source names.

Returns:

Type Description
list[str]

List of source names

list_configs() -> list[SourceConfig]

List all source configurations.

Returns:

Type Description
list[SourceConfig]

List of SourceConfig instances

register(config: SourceConfig) -> None

Register a new source configuration.

Parameters:

Name Type Description Default
config SourceConfig

SourceConfig to register

required

Raises:

Type Description
ValueError

If source with same name already exists

update(config: SourceConfig) -> None

Update an existing source configuration.

Parameters:

Name Type Description Default
config SourceConfig

SourceConfig with updated values

required

get_all_collections() -> list[str]

Get all collection names.

Returns:

Type Description
list[str]

List of collection names from all sources