Skip to content

Repo manager

thoth.ingestion.repo_manager

Repository manager for cloning and tracking the GitLab handbook.

DEFAULT_REPO_URL = 'https://gitlab.com/gitlab-com/content-sites/handbook.git' module-attribute

DEFAULT_CLONE_PATH = Path.home() / '.thoth' / 'handbook' module-attribute

METADATA_FILE = 'repo_metadata.json' module-attribute

MSG_REPO_EXISTS = 'Repository already exists at {path}. Use force=True to re-clone.' module-attribute

MSG_CLONE_FAILED = 'Failed to clone repository after {attempts} attempts' module-attribute

MSG_UPDATE_FAILED = 'Failed to update repository' module-attribute

MSG_NO_REPO = 'No repository found at {path}. Clone the repository first.' module-attribute

MSG_METADATA_SAVE_FAILED = 'Failed to save metadata' module-attribute

MSG_METADATA_LOAD_FAILED = 'Failed to load metadata' module-attribute

MSG_DIFF_FAILED = 'Failed to get changed files' module-attribute

CloneProgress

Progress handler for git clone operations.

Logs progress updates during clone/fetch operations to provide visibility into long-running git operations.

OP_NAMES: dict[int, str] = {RemoteProgress.COUNTING: 'Counting objects', RemoteProgress.COMPRESSING: 'Compressing objects', RemoteProgress.WRITING: 'Writing objects', RemoteProgress.RECEIVING: 'Receiving objects', RemoteProgress.RESOLVING: 'Resolving deltas', RemoteProgress.FINDING_SOURCES: 'Finding sources', RemoteProgress.CHECKING_OUT: 'Checking out files'} class-attribute

logger = logger instance-attribute

__init__(logger: logging.Logger | logging.LoggerAdapter) -> None

Initialize the progress handler.

Parameters:

Name Type Description Default
logger Logger | LoggerAdapter

Logger instance for progress messages

required

update(op_code: int, cur_count: str | float, max_count: str | float | None = None, message: str = '') -> None

Called for each progress update from git.

Parameters:

Name Type Description Default
op_code int

Operation code indicating the current stage

required
cur_count str | float

Current progress count

required
max_count str | float | None

Maximum count (if known)

None
message str

Optional message from git

''

HandbookRepoManager

Manages the GitLab handbook repository.

repo_url = repo_url instance-attribute

clone_path = clone_path or DEFAULT_CLONE_PATH instance-attribute

metadata_path = self.clone_path.parent / METADATA_FILE instance-attribute

logger: logging.Logger | logging.LoggerAdapter = logger or setup_logger(__name__) instance-attribute

__init__(repo_url: str = DEFAULT_REPO_URL, clone_path: Path | None = None, logger: logging.Logger | logging.LoggerAdapter | None = None)

Initialize the repository manager.

Parameters:

Name Type Description Default
repo_url str

URL of the GitLab handbook repository

DEFAULT_REPO_URL
clone_path Path | None

Local path to clone/store the repository

None
logger Logger | LoggerAdapter | None

Logger instance for logging messages

None

is_valid_repo() -> bool

Check if clone_path contains a valid git repository.

Returns:

Type Description
bool

True if valid repo exists, False otherwise

clone_handbook(force: bool = False, max_retries: int = 3, retry_delay: int = 5, shallow: bool = True) -> Path

Clone the GitLab handbook repository.

Parameters:

Name Type Description Default
force bool

If True, remove existing repository and re-clone

False
max_retries int

Maximum number of clone attempts

3
retry_delay int

Delay in seconds between retries

5
shallow bool

If True, perform shallow clone (depth=1) for faster cloning. Shallow clones only fetch the latest commit, significantly reducing clone time for large repositories.

True

Returns:

Type Description
Path

Path to the cloned repository

Raises:

Type Description
RuntimeError

If repository exists and force=False

GitCommandError

If cloning fails after all retries

update_repository() -> bool

Update the repository by pulling latest changes.

For shallow clones, this fetches only the latest changes while maintaining the shallow history.

Returns:

Type Description
bool

True if update successful, False otherwise

Raises:

Type Description
RuntimeError

If repository doesn't exist

get_current_commit() -> str | None

Get the current commit SHA of the repository.

Returns:

Type Description
str | None

Commit SHA as string, or None if error occurs

Raises:

Type Description
RuntimeError

If repository doesn't exist

save_metadata(commit_sha: str) -> bool

Save repository metadata to a JSON file.

Parameters:

Name Type Description Default
commit_sha str

Current commit SHA to save

required

Returns:

Type Description
bool

True if save successful, False otherwise

load_metadata() -> dict[str, Any] | None

Load repository metadata from JSON file.

Returns:

Type Description
dict[str, Any] | None

Metadata dictionary with commit_sha, clone_path, repo_url, or None if error

get_changed_files(since_commit: str) -> list[str] | None

Get list of files changed since a specific commit.

Note: For shallow clones, this may fail if the comparison commit is not in the shallow history. In this case, None is returned and callers should fall back to full processing.

Parameters:

Name Type Description Default
since_commit str

Commit SHA to compare against

required

Returns:

Type Description
list[str] | None

List of changed file paths, or None if error occurs

Raises:

Type Description
RuntimeError

If repository doesn't exist

get_file_changes(since_commit: str) -> dict[str, list[str]] | None

Get categorized file changes since a specific commit.

Note: For shallow clones, this may fail if the comparison commit is not in the shallow history. In this case, None is returned and callers should fall back to full processing.

Parameters:

Name Type Description Default
since_commit str

Commit SHA to compare against

required

Returns:

Type Description
dict[str, list[str]] | None

Dictionary with keys 'added', 'modified', 'deleted' containing

dict[str, list[str]] | None

lists of file paths, or None if error occurs

Raises:

Type Description
RuntimeError

If repository doesn't exist