thoth.ingestion.repo_manager

Repository manager for cloning and tracking the GitLab handbook.

Functions

setup_logger(name[, level, simple, json_output])

Create and configure a logger with structured JSON output.

Classes

Any(*args, **kwargs)

Special type indicating an unconstrained type.

CloneProgress(logger)

Progress handler for git clone operations.

HandbookRepoManager([repo_url, clone_path, ...])

Manages the GitLab handbook repository.

Path(*args, **kwargs)

PurePath subclass that can make system calls.

RemoteProgress()

Handler providing an interface to parse progress information emitted by git-push(1) and git-fetch(1) and to dispatch callbacks allowing subclasses to react to the progress.

Repo(path, odbt, search_parent_directories, ...)

Represents a git repository and allows you to query references, create commit information, generate diffs, create and clone repositories, and query the log.

Exceptions

GitCommandError(command[, status, stderr, ...])

Thrown if execution of the git command fails with non-zero status code.

InvalidGitRepositoryError

Thrown if the given repository appears to have an invalid format.

class thoth.ingestion.repo_manager.CloneProgress(logger: Logger | LoggerAdapter)[source]

Bases: RemoteProgress

Progress handler for git clone operations.

Logs progress updates during clone/fetch operations to provide visibility into long-running git operations.

OP_NAMES: ClassVar[dict[int, str]] = {4: 'Counting objects', 8: 'Compressing objects', 16: 'Writing objects', 32: 'Receiving objects', 64: 'Resolving deltas', 128: 'Finding sources', 256: 'Checking out files'}
__init__(logger: Logger | LoggerAdapter) None[source]

Initialize the progress handler.

Parameters:

logger – Logger instance for progress messages

update(op_code: int, cur_count: str | float, max_count: str | float | None = None, message: str = '') None[source]

Called for each progress update from git.

Parameters:
  • op_code – Operation code indicating the current stage

  • cur_count – Current progress count

  • max_count – Maximum count (if known)

  • message – Optional message from git

error_lines: List[str]
other_lines: List[str]
class thoth.ingestion.repo_manager.HandbookRepoManager(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | LoggerAdapter | None = None)[source]

Bases: object

Manages the GitLab handbook repository.

__init__(repo_url: str = 'https://gitlab.com/gitlab-com/content-sites/handbook.git', clone_path: Path | None = None, logger: Logger | LoggerAdapter | None = None)[source]

Initialize the repository manager.

Parameters:
  • repo_url – URL of the GitLab handbook repository

  • clone_path – Local path to clone/store the repository

  • logger – Logger instance for logging messages

logger: Logger | LoggerAdapter
is_valid_repo() bool[source]

Check if clone_path contains a valid git repository.

Returns:

True if valid repo exists, False otherwise

clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5, shallow: bool = True) Path[source]

Clone the GitLab handbook repository.

Parameters:
  • force – If True, remove existing repository and re-clone

  • max_retries – Maximum number of clone attempts

  • retry_delay – Delay in seconds between retries

  • shallow – If True, perform shallow clone (depth=1) for faster cloning. Shallow clones only fetch the latest commit, significantly reducing clone time for large repositories.

Returns:

Path to the cloned repository

Raises:
  • RuntimeError – If repository exists and force=False

  • GitCommandError – If cloning fails after all retries

update_repository() bool[source]

Update the repository by pulling latest changes.

For shallow clones, this fetches only the latest changes while maintaining the shallow history.

Returns:

True if update successful, False otherwise

Raises:

RuntimeError – If repository doesn’t exist

get_current_commit() str | None[source]

Get the current commit SHA of the repository.

Returns:

Commit SHA as string, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

save_metadata(commit_sha: str) bool[source]

Save repository metadata to a JSON file.

Parameters:

commit_sha – Current commit SHA to save

Returns:

True if save successful, False otherwise

load_metadata() dict[str, Any] | None[source]

Load repository metadata from JSON file.

Returns:

Metadata dictionary with commit_sha, clone_path, repo_url, or None if error

get_changed_files(since_commit: str) list[str] | None[source]

Get list of files changed since a specific commit.

Note: For shallow clones, this may fail if the comparison commit is not in the shallow history. In this case, None is returned and callers should fall back to full processing.

Parameters:

since_commit – Commit SHA to compare against

Returns:

List of changed file paths, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist

get_file_changes(since_commit: str) dict[str, list[str]] | None[source]

Get categorized file changes since a specific commit.

Note: For shallow clones, this may fail if the comparison commit is not in the shallow history. In this case, None is returned and callers should fall back to full processing.

Parameters:

since_commit – Commit SHA to compare against

Returns:

Dictionary with keys ‘added’, ‘modified’, ‘deleted’ containing lists of file paths, or None if error occurs

Raises:

RuntimeError – If repository doesn’t exist