Config
thoth.shared.sources.config
¶
Source configuration for multi-source ingestion.
This module provides configuration management for different data sources (handbook, D&D, personal documents) with configurable GCS locations and supported file formats.
logger = setup_logger(__name__)
module-attribute
¶
DEFAULT_SOURCES: dict[str, SourceConfig] = {'handbook': SourceConfig(name='handbook', collection_name='handbook_documents', gcs_prefix='handbook', supported_formats=['.md'], description='GitLab Handbook documentation'), 'dnd': SourceConfig(name='dnd', collection_name='dnd_documents', gcs_prefix='dnd', supported_formats=['.md', '.pdf', '.txt'], description='D&D game materials and rulebooks'), 'personal': SourceConfig(name='personal', collection_name='personal_documents', gcs_prefix='personal', supported_formats=['.md', '.pdf', '.txt', '.docx'], description='Personal documents and notes')}
module-attribute
¶
SourceConfig
dataclass
¶
Configuration for a single data source (handbook, D&D, personal, etc.).
Each source has a unique name, a LanceDB table (collection) name, a GCS prefix for stored files, and a list of supported file extensions. Used by the ingestion pipeline and MCP server to route and filter documents.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Unique identifier for the source (e.g., 'handbook', 'dnd', 'personal'). |
collection_name |
str
|
LanceDB table name for this source (e.g., 'handbook_documents'). |
gcs_prefix |
str
|
GCS path prefix where source files are stored in the bucket. |
supported_formats |
list[str]
|
File extensions supported for ingestion (e.g., ['.md', '.pdf']). |
description |
str
|
Human-readable description of the source for logging and UI. |
name: str
instance-attribute
¶
collection_name: str
instance-attribute
¶
gcs_prefix: str
instance-attribute
¶
supported_formats: list[str] = field(default_factory=list)
class-attribute
instance-attribute
¶
description: str = ''
class-attribute
instance-attribute
¶
__init__(name: str, collection_name: str, gcs_prefix: str, supported_formats: list[str] = list(), description: str = '') -> None
¶
supports_format(extension: str) -> bool
¶
Check if this source supports a file format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extension
|
str
|
File extension including dot (e.g., '.md') |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if format is supported |
SourceRegistry
¶
Registry for managing data source configurations.
The registry loads default configurations and supports environment variable overrides for GCS prefixes.
Environment variables
THOTH_SOURCE_{NAME}GCS_PREFIX: Override GCS prefix for a source THOTH_SOURCE_COLLECTION: Override collection name for a source
Example
THOTH_SOURCE_HANDBOOK_GCS_PREFIX=custom_handbook THOTH_SOURCE_DND_COLLECTION=my_dnd_collection
__init__() -> None
¶
Initialize the source registry with defaults and overrides.
get(name: str) -> SourceConfig | None
¶
Get source configuration by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Source identifier (e.g., 'handbook', 'dnd', 'personal') |
required |
Returns:
| Type | Description |
|---|---|
SourceConfig | None
|
SourceConfig if found, None otherwise |
list_sources() -> list[str]
¶
List all registered source names.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of source names |
list_configs() -> list[SourceConfig]
¶
List all source configurations.
Returns:
| Type | Description |
|---|---|
list[SourceConfig]
|
List of SourceConfig instances |
register(config: SourceConfig) -> None
¶
Register a new source configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SourceConfig
|
SourceConfig to register |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If source with same name already exists |
update(config: SourceConfig) -> None
¶
Update an existing source configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SourceConfig
|
SourceConfig with updated values |
required |
get_all_collections() -> list[str]
¶
Get all collection names.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of collection names from all sources |