thoth.shared.utils.logger

Structured Logging Utilities for GCP Cloud Logging and Grafana Loki.

This module provides a structured JSON logging framework that is compatible with: - Google Cloud Logging (Cloud Run, GKE, Cloud Functions) - Grafana Loki - Any JSON-aware log aggregation system

Key Features: - Structured JSON output with consistent field schema - GCP Cloud Logging special fields (sourceLocation, trace, labels) - Job/request correlation via JobLoggerAdapter - Automatic sensitive data redaction - Verbose source location (file, line, function) - Metrics-ready numeric fields for dashboards

Example

>>> from thoth.shared.utils.logger import setup_logger, get_job_logger
>>>
>>> # Basic usage
>>> logger = setup_logger("myapp")
>>> logger.info("Server started", extra={"port": 8080})
>>>
>>> # Job-scoped logging
>>> job_logger = get_job_logger(logger, job_id="job_123", source="handbook")
>>> job_logger.info("Processing file", extra={"file_path": "docs/readme.md"})

Functions

cast(typ, val)

Cast a value to a type.

configure_root_logger([level, json_output])

Configure the root logger for the application.

extract_trace_id_from_header(header_value)

Extract trace ID from X-Cloud-Trace-Context header.

get_job_logger(base_logger, job_id[, ...])

Create a job-scoped logger adapter.

get_trace_context()

Get the current trace context.

set_trace_context(trace_id[, project_id])

Set the trace context for the current request/task.

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

ContextVar

GCPStructuredFormatter(*args, **kwargs)

JSON formatter compatible with GCP Cloud Logging and Grafana Loki.

JobLoggerAdapter(logger, job_id[, source, ...])

Logger adapter that automatically includes job context in all log messages.

MutableMapping()

A MutableMapping is a generic container for associating key/value pairs.

SecureLogger(name[, level])

Legacy SecureLogger class for backward compatibility.

SensitiveDataFormatter

SimpleFormatter([fmt])

Simple text formatter for local development/debugging.

datetime(year, month, day[, hour[, minute[, ...)

The year, month and day arguments are required.

jsonlogger_JsonFormatter

thoth.shared.utils.logger.set_trace_context(trace_id: str | None, project_id: str | None = None) None[source]

Set the trace context for the current request/task.

Call this at the start of each request handler to enable log correlation.

Parameters:
  • trace_id – The trace ID from X-Cloud-Trace-Context header

  • project_id – GCP project ID for constructing full trace URL

thoth.shared.utils.logger.get_trace_context() str | None[source]

Get the current trace context.

thoth.shared.utils.logger.extract_trace_id_from_header(header_value: str | None) str | None[source]

Extract trace ID from X-Cloud-Trace-Context header.

The header format is: TRACE_ID/SPAN_ID;o=TRACE_TRUE

Parameters:

header_value – The X-Cloud-Trace-Context header value

Returns:

The trace ID portion, or None if header is missing/invalid

class thoth.shared.utils.logger.GCPStructuredFormatter(*args: Any, **kwargs: Any)[source]

Bases: JsonFormatter

JSON formatter compatible with GCP Cloud Logging and Grafana Loki.

This formatter produces structured JSON logs with: - Standard fields (timestamp, severity, message, logger) - Verbose source location (pathname, filename, lineno, funcName) - GCP special fields (sourceLocation, trace, labels) - Custom context fields (job_id, source, operation, etc.) - Automatic sensitive data redaction

The output is compatible with: - GCP Cloud Logging (jsonPayload with special field recognition) - Grafana Loki (JSON parsing and label extraction) - Any JSON-aware log aggregation system

Example output:
{

“timestamp”: “2026-01-30T10:15:30.123456Z”, “severity”: “INFO”, “message”: “Processing file”, “logger”: “thoth.ingestion.pipeline”, “pathname”: “/app/thoth/ingestion/pipeline.py”, “filename”: “pipeline.py”, “lineno”: 456, “funcName”: “_process_file”, “module”: “pipeline”, “logging.googleapis.com/sourceLocation”: {

“file”: “thoth/ingestion/pipeline.py”, “line”: “456”, “function”: “_process_file”

}, “job_id”: “job_xyz789”, “source”: “handbook”

}

SENSITIVE_KEYWORDS: ClassVar[list[str]] = ['password', 'passwd', 'pwd', 'secret', 'token', 'apikey', 'api_key', 'auth', 'authorization', 'credential', 'key', 'private', 'session', 'cookie', 'jwt', 'bearer', 'oauth']
__init__(*args: Any, **kwargs: Any) None[source]

Initialize the formatter with GCP-compatible settings.

add_fields(log_record: dict[str, Any], record: LogRecord, message_dict: dict[str, Any]) None[source]

Add custom fields to the JSON log record.

This method is called by python-json-logger to populate the log record. We add all our custom fields here.

class thoth.shared.utils.logger.SimpleFormatter(fmt: str | None = None, **kwargs: Any)[source]

Bases: Formatter

Simple text formatter for local development/debugging.

Uses a human-readable format without JSON structure. Still includes sensitive data redaction.

SENSITIVE_KEYWORDS: ClassVar[list[str]] = ['password', 'passwd', 'pwd', 'secret', 'token', 'apikey', 'api_key', 'auth', 'authorization', 'credential', 'key', 'private', 'session', 'cookie', 'jwt', 'bearer', 'oauth']
__init__(fmt: str | None = None, **kwargs: Any) None[source]

Initialize with default format if not provided.

format(record: LogRecord) str[source]

Format the record with sensitive data redaction.

class thoth.shared.utils.logger.JobLoggerAdapter(logger: Logger, job_id: str, source: str | None = None, collection: str | None = None, **extra_context: Any)[source]

Bases: LoggerAdapter

Logger adapter that automatically includes job context in all log messages.

This adapter enriches log messages with job-specific context like job_id, source, and collection. Use this when processing a specific job to ensure all logs can be correlated.

Example

>>> base_logger = setup_logger("thoth.worker")
>>> job_logger = JobLoggerAdapter(base_logger, job_id="job_123", source="handbook")
>>> job_logger.info("Processing started")
>>> job_logger.info("File processed", extra={"file_path": "readme.md"})
__init__(logger: Logger, job_id: str, source: str | None = None, collection: str | None = None, **extra_context: Any) None[source]

Initialize the job logger adapter.

Parameters:
  • logger – The base logger to wrap

  • job_id – Unique identifier for the job/run

  • source – Source being processed (e.g., “handbook”, “dnd”)

  • collection – Collection name being used

  • **extra_context – Additional context to include in all logs

process(msg: str, kwargs: MutableMapping[str, Any]) tuple[str, MutableMapping[str, Any]][source]

Process the log message to include job context.

Parameters:
  • msg – The log message

  • kwargs – Keyword arguments for the log call

Returns:

Tuple of (message, kwargs) with context added to extra

with_operation(operation: str) JobLoggerAdapter[source]

Create a child logger for a specific operation.

Parameters:

operation – The operation name (e.g., “chunking”, “embedding”, “storing”)

Returns:

A new JobLoggerAdapter with the operation context added

class thoth.shared.utils.logger.SecureLogger(name: str, level: int = 0)[source]

Bases: Logger

Legacy SecureLogger class for backward compatibility.

New code should use setup_logger() which returns a standard Logger with GCPStructuredFormatter attached.

This class is maintained for backward compatibility with existing code that checks isinstance(logger, SecureLogger).

SENSITIVE_KEYWORDS: ClassVar[list[str]] = ['password', 'passwd', 'pwd', 'secret', 'token', 'apikey', 'api_key', 'auth', 'authorization', 'credential', 'key', 'private', 'session', 'cookie', 'jwt', 'bearer', 'oauth']
__init__(name: str, level: int = 0) None[source]

Initialize the SecureLogger.

debug(msg: Any, *args: Any, **kwargs: Any) None[source]

Log a debug message with safe formatting.

info(msg: Any, *args: Any, **kwargs: Any) None[source]

Log an info message with safe formatting.

warning(msg: Any, *args: Any, **kwargs: Any) None[source]

Log a warning message with safe formatting.

error(msg: Any, *args: Any, **kwargs: Any) None[source]

Log an error message with safe formatting.

critical(msg: Any, *args: Any, **kwargs: Any) None[source]

Log a critical message with safe formatting.

thoth.shared.utils.logger.SensitiveDataFormatter

alias of SimpleFormatter

thoth.shared.utils.logger.setup_logger(name: str, level: int = 20, simple: bool = False, json_output: bool | None = None) Logger[source]

Create and configure a logger with structured JSON output.

This function creates a logger that outputs structured JSON logs compatible with GCP Cloud Logging and Grafana Loki. By default, it auto-detects whether to use JSON output based on the environment.

Parameters:
  • name – Name of the logger (typically __name__)

  • level – Logging level (default: INFO)

  • simple – If True, use simple text format instead of JSON (for local dev)

  • json_output – Explicit control over JSON output. If None, auto-detects: - True in Cloud Run (GCS_BUCKET_NAME set) - True if LOG_FORMAT=json - False otherwise (local development)

Returns:

Configured logger instance

Example

>>> logger = setup_logger(__name__)
>>> logger.info("Server started", extra={"port": 8080})
>>> # With job context
>>> logger.info("Processing", extra={"job_id": "abc123", "source": "handbook"})
thoth.shared.utils.logger.get_job_logger(base_logger: Logger, job_id: str, source: str | None = None, collection: str | None = None, **extra_context: Any) JobLoggerAdapter[source]

Create a job-scoped logger adapter.

This is the recommended way to create loggers for job processing. All log messages will automatically include the job context.

Parameters:
  • base_logger – The base logger (from setup_logger)

  • job_id – Unique identifier for the job

  • source – Source being processed (e.g., “handbook”)

  • collection – Collection name

  • **extra_context – Additional context fields

Returns:

JobLoggerAdapter with job context

Example

>>> logger = setup_logger("thoth.worker")
>>> job_logger = get_job_logger(logger, job_id="job_123", source="handbook")
>>> job_logger.info("Starting ingestion")
>>> job_logger.info("Processed file", extra={"file_path": "readme.md", "chunks_created": 15})
thoth.shared.utils.logger.configure_root_logger(level: int = 20, json_output: bool | None = None) None[source]

Configure the root logger for the application.

Call this once at application startup to configure global logging behavior.

Parameters:
  • level – Root logging level

  • json_output – Whether to use JSON output (auto-detects if None)