diff --git a/cmd/auto-config-brancher/README.md b/cmd/auto-config-brancher/README.md index 8732eb58606..046de9811dd 100644 --- a/cmd/auto-config-brancher/README.md +++ b/cmd/auto-config-brancher/README.md @@ -1,5 +1,57 @@ # auto-config-brancher +## What +Orchestrator that runs a sequence of CI configuration tools in order, commits each tool's changes separately, pushes the result, and creates a PR to openshift/release. This is the periodic automation that keeps release branch configs, job definitions, and private org mirrors up to date. + +## How it works — tool sequence + +Executes these tools in order (each one commits its own changes if any): + +| Step | Tool | What it does | +|---|---|---| +| 1 | `config-brancher` | Branch ci-operator configs for future releases (`--skip-periodics`) | +| 2 | `ci-operator-config-mirror` | Mirror configs to openshift-priv (`--to-org openshift-priv --only-org openshift`) | +| 3 | `determinize-ci-operator` | Normalize ci-operator config YAML formatting | +| 4 | `ci-operator-prowgen` | Generate Prow jobs from ci-operator configs | +| 5 | `private-prow-configs-mirror` | Mirror Prow configs to private org | +| 6 | `determinize-prow-config` | Normalize Prow config YAML | +| 7 | `sanitize-prow-jobs` | Validate and format Prow job configs | +| 8 | `clusterimageset-updater` | Update Hive ClusterImageSet resources | +| 9 | `promoted-image-governor` | Validate image promotion configs (dry-run) | + +If `--rebalancer-cron` is set and the current time is within +/-1 hour of the cron schedule, `rebalancer` is prepended as step 0. + +### Change detection +1. Records git HEAD SHA before running tools +2. Each tool runs, and if `git diff-index --quiet HEAD` shows changes, stages all and commits with message `"{tool} {args}"` +3. After all tools: compares overall `git diff` from start SHA to current HEAD +4. If no overall diff (changes cancelled out): skips push +5. If changes: pushes to remote branch `auto-config-brancher` + +### Authentication +- **Token auth** (`--github-token-path`): pushes via `https://{login}:{token}@github.com/...` +- **GitHub App auth** (no token): uses `GitClientFactory` for authenticated push via app installation + +### PR creation +- Title: `"Automate config brancher by auto-config-brancher job at {timestamp}"` +- Labels: `tide/merge-method-merge`, `rehearsals-ack`, `priority/ci-critical` +- If `--self-approve`: adds `approved` and `lgtm` labels +- Body: `/cc @{assign}` +- Updates existing PR if one already exists (matched by title prefix) + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--target-dir` | — | Root directory of target repository | +| `--github-login` | `openshift-bot` | GitHub username for PR/push | +| `--assign` | `openshift/test-platform` | PR assignee | +| `--rebalancer-cron` | — | Cron expression for rebalancer (+/-1h window) | +| `--self-approve` | false | Auto-add approved+lgtm labels | +| `--current-release` | — | Current OCP release version | +| `--future-release` | — | Target release versions (repeatable) | + +## Deployment +Periodic Prow job. `auto-config-brancher` is a tool that reconciles various parts of CI config in [openshift/release](https://github.com/openshift/release/) repository. diff --git a/cmd/auto-peribolos-sync/README.md b/cmd/auto-peribolos-sync/README.md index 1993de1a4df..97b96d6dde7 100644 --- a/cmd/auto-peribolos-sync/README.md +++ b/cmd/auto-peribolos-sync/README.md @@ -1,5 +1,53 @@ # auto-peribolos-sync +## What +Automation wrapper that runs `private-org-peribolos-sync`, detects changes, commits them, and creates or updates a pull request on the `openshift/config` repository. This is the outer orchestration layer; the actual Peribolos config generation is delegated to `private-org-peribolos-sync`. + +## How it works -- full flow + +1. Initialize GitHub client and secrets +2. Shell out to `/usr/bin/private-org-peribolos-sync` with the provided arguments: + - `--destination-org openshift-priv` + - `--peribolos-config {path}` + - `--release-repo-path {path}` + - `--github-token-path {path}` + - Optionally `--whitelist-file`, `--only-org`, `--flatten-org` +3. After the subprocess completes, check for git changes using `bumper.HasChanges()` +4. If no changes, exit cleanly +5. If changes exist: + - Create a commit with title: `"Automate peribolos configuration sync {RFC1123 timestamp}"` + - Push to the `auto-peribolos-sync` branch on the `openshift/config` repo using HTTPS with token auth + - Create or update a pull request via `bumper.UpdatePullRequestWithLabels()`: + - Target: `openshift/config` default branch + - Source: `{github-login}:auto-peribolos-sync` + - Description: "Updates the repositories of the openshift-priv organization" + - If `--self-approve` is set, add `approved` and `lgtm` labels to the PR + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `true` | When true, uses API tokens but does not create the PR | +| `--self-approve` | `false` | Add `approved` and `lgtm` labels to the PR | +| `--github-login` | `openshift-bot` | GitHub username for push and PR creation | +| `--git-name` | `""` | Git commit author name (must pair with `--git-email`) | +| `--git-email` | `""` | Git commit author email (must pair with `--git-name`) | +| `--peribolos-config` | (required) | Path to the Peribolos config file to update | +| `--release-repo-path` | (required) | Path to openshift/release repository directory | +| `--whitelist-file` | `""` | Path to whitelist file, passed through to `private-org-peribolos-sync` | +| `--only-org` | `""` | Only sync repos from this org, passed through | +| `--flatten-org` | (repeatable) | Additional flatten orgs, passed through | +| GitHub flags | | Standard Prow GitHub options | + +## Key files +- `cmd/auto-peribolos-sync/main.go` -- all logic in this single file +- `cmd/private-org-peribolos-sync/main.go` -- the actual config generation tool it shells out to + +## Deployment +Periodic Prow job. The container image (`ci_auto-peribolos-sync_latest`) bundles both `auto-peribolos-sync` and `private-org-peribolos-sync` binaries. The PR is created against the `openshift/config` repository (not `openshift/release`). + +## Related +- `cmd/private-org-peribolos-sync` -- the tool this wrapper executes +- The PR targets `openshift/config` which holds the Peribolos config for all OpenShift GitHub orgs `auto-peribolos-sync` is a like wrapper over the [private-org-peribolos-sync](../private-org-peribolos-sync) tool. ## What it does diff --git a/cmd/auto-testgrid-generator/README.md b/cmd/auto-testgrid-generator/README.md new file mode 100644 index 00000000000..d1d9aecc940 --- /dev/null +++ b/cmd/auto-testgrid-generator/README.md @@ -0,0 +1,54 @@ +# auto-testgrid-generator + +## What +Orchestrator that runs `testgrid-config-generator` to produce updated TestGrid dashboard configurations, then creates or updates a pull request against `kubernetes/test-infra` with the changes. This is the automation that keeps TestGrid dashboards in sync with the current set of OpenShift CI periodic jobs. + +## How it works -- full flow + +1. **Parse flags**: collects paths to the testgrid config directory, release controller config, Prow jobs directory, allow-list file, and git/PR creation options. + +2. **Run testgrid-config-generator**: executes `/usr/bin/testgrid-config-generator` as a subprocess with the following arguments: + - `-testgrid-config {dir}` -- where to write TestGrid YAML + - `-release-config {dir}` -- release controller configuration + - `-prow-jobs-dir {dir}` -- Prow periodic job definitions + - `-allow-list {file}` -- job classification overrides + +3. **Create or update PR**: uses `prcreation.PRCreationOptions.UpsertPR()` to: + - Check the git working directory for changes + - Create a branch and commit if changes exist + - Push to the fork and create a PR against `kubernetes/test-infra` (configurable with `--github-org`) + - If a PR with matching title already exists, force-push to update it + - PR title format: `Update OpenShift testgrid definitions by auto-testgrid-generator job at {timestamp}` + - Assigns the PR to `--assign` (default: `openshift/test-platform`) + +### PR matching +The tool uses title matching (`matchTitle`) to find existing PRs. If a PR with the prefix "Update OpenShift testgrid definitions by auto-testgrid-generator job" already exists, it updates that PR rather than creating a new one. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--testgrid-config` | (none) | Directory where TestGrid output YAML is stored | +| `--release-config` | (none) | Directory of release controller config files | +| `--prow-jobs-dir` | (none) | Directory of Prow job config files | +| `--allow-list` | (none) | File with release-type overrides | +| `--working-dir` | `.` | Git working directory | +| `--github-login` | `openshift-bot` | GitHub username for PR creation | +| `--github-org` | `kubernetes` | GitHub org (override for testing) | +| `--upstream-branch` | `master` | Target branch for the PR | +| `--assign` | `openshift/test-platform` | GitHub user or team to assign the PR to | +| `--git-name` | (from GitAuthorOptions) | Git author name for commits | +| `--git-email` | (from GitAuthorOptions) | Git author email for commits | +| (PRCreationOptions flags) | | GitHub token path, etc. | + +## Key files +- `cmd/auto-testgrid-generator/main.go` -- entry point, subprocess execution of testgrid-config-generator, PR creation +- `cmd/testgrid-config-generator/main.go` -- the actual TestGrid config generation logic (invoked as a binary) +- `pkg/github/prcreation/` -- PR creation/update library + +## Deployment +Periodic Prow job. The container image includes both the `auto-testgrid-generator` and `testgrid-config-generator` binaries (the latter at `/usr/bin/testgrid-config-generator`). The job runs with a GitHub token for PR creation against `kubernetes/test-infra`. + +## Related +- `cmd/testgrid-config-generator` -- the actual generation logic, invoked as a subprocess +- Target repo: `kubernetes/test-infra` (TestGrid configuration lives there) +- TestGrid dashboards: `https://testgrid.k8s.io/redhat-openshift-*` diff --git a/cmd/autoowners/README.md b/cmd/autoowners/README.md index 84f5ddbfe76..36d284c49dc 100644 --- a/cmd/autoowners/README.md +++ b/cmd/autoowners/README.md @@ -1,7 +1,57 @@ -# Populating `OWNERS` and `OWNERS_ALIASES` +# autoowners -This utility updates the `OWNERS` files from remote OpenShift repositories. +## What +Batch job that syncs OWNERS files from upstream source repositories into the openshift/release repository. For every org/repo that has CI configuration in openshift/release (under `ci-operator/jobs`, `ci-operator/config`, or `ci-operator/templates`), autoowners fetches the root-level `OWNERS` and `OWNERS_ALIASES` files from the upstream repo on GitHub, resolves aliases, filters out users who are not members of the downstream GitHub organization, and writes the resolved OWNERS files into the corresponding directories in openshift/release. It then commits and opens (or updates) a pull request with the changes. +This ensures that the people who own code upstream also control CI job approvals downstream, without manual synchronization. + +## How it works -- full flow + +1. **Discover repos**: Walk the `ci-operator/{jobs,config,templates}` subdirectories (and any `--extra-config-dir` paths) under `--target-dir` to build a list of org/repo pairs that have CI configuration. Skip the target repo itself (openshift/release) and any repos or orgs on the blocklist (`--ignore-repo`, `--ignore-org`). Each org/repo maps to one or more directories where an OWNERS file should be written. + +2. **Fetch upstream OWNERS**: For each discovered org/repo, use the GitHub API (`GetFile`) to fetch the root `OWNERS` file and `OWNERS_ALIASES` file. If no OWNERS file exists upstream, log a warning and skip the repo. Strip `@` prefixes from usernames (common in upstream OWNERS files). Quote purely numeric GitHub usernames with double quotes so YAML parsers treat them as strings instead of integers. + +3. **Resolve aliases**: If an `OWNERS_ALIASES` file was found, expand all alias references in the OWNERS file to their constituent usernames. The plugin supports both simple OWNERS format (flat approvers/reviewers lists) and full OWNERS format (filter-based with path patterns). + +4. **Filter to org members**: Query the downstream GitHub org (default: `openshift`) for its complete member list via `ListOrgMembers(org, "all")`. Remove any user from the resolved OWNERS who is not a member of the downstream org. All comparisons are case-insensitive. + +5. **Set reviewers fallback**: If the resolved OWNERS has approvers but an empty reviewers list, copy the approvers into reviewers. This matches the behavior Prow expects. + +6. **Write OWNERS files**: For each directory associated with the org/repo, delete the existing OWNERS file and write the resolved content. Prepend a header comment with five lines: the "DO NOT EDIT" warning, the source URL, a note about alias expansion, a note about org member filtering, and a link to the OWNERS docs. + +7. **Detect changes**: Run `git status --porcelain` to find modified OWNERS files. Verify that only OWNERS files were modified (error out if anything else changed). If nothing changed, exit cleanly. + +8. **Commit and push**: Use `bumper.GitCommitSignoffAndPush` to commit changes and push to a remote branch (`autoowners`) on the bot's fork. The commit message includes the PR title with a timestamp. + +9. **Create or update PR**: Call `bumper.UpdatePullRequestWithLabels` to open or update a pull request against the target repo's base branch. The PR body lists all directories that had OWNERS changes and `/cc`s the assignee. The `rehearsals-ack` label is always added (to skip pj-rehearse). If `--self-approve` is set, `approved` and `lgtm` labels are also added for auto-merge. The PR body is truncated to 65535 characters if it exceeds GitHub's limit. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `true` | When true, do not actually create the PR | +| `--github-login` | `openshift-bot` | GitHub username for push and PR creation | +| `--org` | `openshift` | Downstream GitHub org name (used for member filtering) | +| `--repo` | `release` | Downstream GitHub repository name | +| `--git-name` | (system default) | Git author name for commits | +| `--git-email` | (system default) | Git author email for commits | +| `--git-signoff` | `false` | Whether to add `Signed-off-by` to commits | +| `--assign` | `openshift/test-platform` | GitHub user or team to `/cc` on the PR | +| `--target-dir` | (required) | Path to the local clone of the target repo | +| `--target-subdir` | `ci-operator` | Subdirectory under target-dir where configs live | +| `--config-subdir` | `jobs,config,templates` | Subdirectories to scan for org/repo dirs (repeatable) | +| `--extra-config-dir` | (none) | Additional directories to scan (repeatable) | +| `--ignore-repo` | (none) | Repos to skip, in `org/repo` format (repeatable) | +| `--ignore-org` | (none) | Entire orgs to skip (repeatable) | +| `--debug-mode` | `false` | Enable DEBUG-level logging | +| `--self-approve` | `false` | Add `approved` and `lgtm` labels to the PR | +| `--pr-base-branch` | `master` | Base branch for the PR | +| `--plugin-config` | (none) | Path to Prow plugin config (for custom OWNERS filenames per repo) | + +## Key files +- `cmd/autoowners/main.go` -- entire implementation: repo discovery, OWNERS fetching, alias resolution, member filtering, PR creation + +## Deployment +Runs as a periodic Prow job (not a long-lived service). Typically scheduled to run on a regular cadence (e.g., daily) against a checkout of openshift/release. ```console $ autoowners -h Usage of autoowners: diff --git a/cmd/autopublicizeconfig/README.md b/cmd/autopublicizeconfig/README.md index f1451ebcf25..6a82c9fe102 100644 --- a/cmd/autopublicizeconfig/README.md +++ b/cmd/autopublicizeconfig/README.md @@ -1,4 +1,61 @@ -# Auto publicize config +# autopublicizeconfig -This tool re-generates the publicize plugin configuration file and -submits a pull request against the openshift/release repository. +## What +Automation tool that generates the configuration file for the `publicize` plugin by discovering all repos that need private-to-public mirroring. It scans ci-operator configs and the whitelist to find repos building official images, computes the `openshift-priv/{repo}` to `{org}/{repo}` mapping, writes the config, and creates a PR on `openshift/release`. + +This is the config generator for the `publicize` webhook plugin. It keeps the publicize config in sync as repos are added to or removed from the private org. + +## How it works -- full flow + +1. Initialize GitHub client and secrets +2. Scan ci-operator configs at `{release-repo-path}/ci-operator/config/` for repos that build official images (`api.BuildsAnyOfficialImages` with `WithoutOKD`) +3. Add all repos from the whitelist file +4. For each discovered `{org}/{repo}`: + - Compute the private repo name using `MirroredRepoName()` with the flattened orgs set + - Create the mapping: `openshift-priv/{mirroredName}` -> `{org}/{repo}` +5. Marshal the config as YAML and write to `--publicize-config` path, creating directories as needed +6. Check for git changes using `bumper.HasChanges()` +7. If no changes, exit cleanly +8. If running in dry-run mode, log and exit +9. If changes exist and not dry-run: + - Create a commit with title: `"Automate publicize configuration sync {RFC1123 timestamp}"` + - Push to the `auto-publicize-sync` branch on `openshift/release` + - Create or update a PR via `bumper.UpdatePullRequestWithLabels()`: + - Target: `openshift/release` default branch + - Source: `{github-login}:auto-publicize-sync` + - Description: "Updates the publicize plugin configuration" + - If `--self-approve` is set, add `approved` and `lgtm` labels + +## Generated config format +```yaml +repositories: + openshift-priv/installer: openshift/installer + openshift-priv/cluster-version-operator: openshift/cluster-version-operator + openshift-priv/stolostron-multicloud-operators-subscription: stolostron/multicloud-operators-subscription +``` + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `true` | When true, writes config but does not create PR | +| `--self-approve` | `false` | Add `approved` and `lgtm` labels to the PR | +| `--github-login` | `openshift-bot` | GitHub username for push and PR creation | +| `--git-name` | `""` | Git commit author name (must pair with `--git-email`) | +| `--git-email` | `""` | Git commit author email (must pair with `--git-name`) | +| `--publicize-config` | (required) | Path where the generated publicize config will be written | +| `--release-repo-path` | (required) | Path to openshift/release repository directory | +| `--flatten-org` | (repeatable) | Additional orgs whose repos should not have org prefix | +| `--whitelist-file` | `""` | Path to YAML file listing repos to include | +| GitHub flags | | Standard Prow GitHub options | + +## Key files +- `cmd/autopublicizeconfig/main.go` -- all logic in this single file +- `cmd/publicize/` -- the webhook plugin that consumes the generated config +- `pkg/privateorg/flatten.go` -- `MirroredRepoName()` naming logic + +## Deployment +Periodic Prow job. Creates PRs against the `openshift/release` repository on branch `auto-publicize-sync`. + +## Related +- `cmd/publicize` -- the webhook plugin that uses the generated config +- `cmd/ci-operator-config-mirror` -- uses the same repo discovery logic for a different purpose diff --git a/cmd/backport-verifier/README.md b/cmd/backport-verifier/README.md new file mode 100644 index 00000000000..72e8c6f8d8e --- /dev/null +++ b/cmd/backport-verifier/README.md @@ -0,0 +1,71 @@ +# backport-verifier + +## What +Prow webhook plugin that validates backport pull requests by checking whether their commits reference merged PRs in a configured upstream repository. It enforces the convention that each backport commit includes `UPSTREAM: :` at the start of its commit message, then verifies that the referenced upstream PR actually exists and has merged. + +This provides automated provenance checking: downstream maintainers can see at a glance whether a backport legitimately comes from a merged upstream change. + +## User-facing commands + +| Command | What it does | +|---|---| +| `/validate-backports` | Manually trigger backport validation on the current PR | + +Validation also runs automatically on PR open (`opened`) and on every push (`synchronize`). + +## How it works -- full flow + +### Configuration +The plugin is configured with a YAML file mapping downstream `org/repo` to upstream `org/repo`: +```yaml +repositories: + openshift/kubernetes: kubernetes/kubernetes +``` +The config file is hot-reloaded via a ConfigMap mount watcher (`prowconfig.GetCMMountWatcher`). Changes are picked up automatically without restarts. + +### On PR open or push +1. Check if the PR's repo has an upstream configured in the `repositories` map. If not, do nothing silently. +2. List all commits in the PR via `ListPullRequestCommits`. +3. For each commit, take the first line of the commit message and match against the regex `^UPSTREAM: ([0-9]+): `. +4. If a commit does not match the pattern, classify it as **invalid** with the reason "does not specify an upstream backport in the commit message." +5. For commits that do match, extract the upstream PR number and call `GetPullRequest` on the upstream repo. +6. Classify each commit: + - **valid**: upstream PR exists and has merged (`pr.Merged == true`) + - **invalid**: upstream PR does not exist (404) or has not yet merged + - **error**: non-404 API failure when fetching the upstream PR + +### On `/validate-backports` command +Same flow as above, but if the repo has no upstream configured, posts a comment saying so and applies the `unvalidated` label. + +### Labels +Based on the validation result, the plugin ensures exactly one of two labels is present: + +| Label | Meaning | +|---|---| +| `backports/validated-commits` | All commits reference merged upstream PRs and no errors occurred | +| `backports/unvalidated-commits` | At least one commit is invalid or encountered an error | + +When one label is added, the other is removed. The validated label is only applied when there are zero invalid commits AND zero errors. + +### Comment format +The plugin posts a detailed comment listing each commit with its validation status, organized into three sections: +- **Valid commits** ("are valid"): each with a link to the merged upstream PR +- **Invalid commits** ("could not be validated and must be approved by a top-level approver"): with explanation (PR not found, not merged, or missing pattern) +- **Errored commits** ("could not be processed"): with the error message + +Each commit entry includes a shortened SHA link and the first line of the commit message. The comment ends with instructions to re-run `/validate-backports` when upstream PRs merge. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-path` | (required) | Path to the YAML config mapping downstream to upstream repos | +| `--hmac-secret-file` | (required) | GitHub webhook HMAC secret for signature verification | + +Standard Prow GitHub flags and `githubeventserver.Options` are also supported. + +## Key files +- `cmd/backport-verifier/main.go` -- entry point, flag parsing, config loading with hot-reload via ConfigMap mount watcher +- `cmd/backport-verifier/server.go` -- webhook handlers for issue comments and PR events, commit validation logic, upstream PR verification, label management, comment formatting + +## Deployment +Long-lived webhook Deployment. Listens for GitHub `issue_comment` (created) and `pull_request` (opened, synchronize) events. diff --git a/cmd/blocking-issue-creator/README.md b/cmd/blocking-issue-creator/README.md index bf19bf872b0..7982eed3c47 100644 --- a/cmd/blocking-issue-creator/README.md +++ b/cmd/blocking-issue-creator/README.md @@ -1,5 +1,58 @@ # blocking-issue-creator +## What +Creates and maintains GitHub issues labeled `tide/merge-blocker` on OCP repositories whose future release branches are frozen for merging. Tide treats these issues as hard merge blockers -- no PRs can merge into the affected branches while the issue exists. This prevents accidental merges into branches that are being fast-forwarded from the development branch. + +## How it works -- full flow + +1. **Discover repos and branches**: Iterates over all ci-operator config files in the config directory (via `promotion.FutureOptions.OperateOnCIOperatorConfigDir`). For each config that promotes images (excluding OKD), determines which branches correspond to `--future-release` versions using `promotion.DetermineReleaseBranch()`. Skips branches that are the same as the current development branch (those are open for merges). + +2. **Search for existing blocker issues**: For each repo, queries GitHub for open issues labeled `tide/merge-blocker` authored by the bot user. The query: `is:issue state:open label:"tide/merge-blocker" repo:{org}/{repo} author:{botLogin}`. + +3. **Reconcile issue state**: + - **No frozen branches**: close all existing blocker issues for the repo. + - **Frozen branches exist, >1 blocker issue**: close all but the most recently updated one, then update or leave it. + - **Frozen branches exist, 1 blocker issue**: compare title and body. Update if they changed, otherwise do nothing. + - **Frozen branches exist, 0 blocker issues**: create a new issue with the `tide/merge-blocker` label. + +4. **Rate limiting**: Sleeps 5 seconds between repos to stay within GitHub's 30 requests/minute secondary rate limit. + +### Issue format +- **Title**: `Future Release Branches Frozen For Merging | branch:release-4.18 branch:release-4.19` +- **Body**: Lists all frozen branches with a link to the branching documentation. +- **Labels**: `tide/merge-blocker` + +### How branch selection works +The tool uses the same branch-naming logic as `repo-brancher`: +- `master`/`main` branches map to `release-{futureVersion}` +- `openshift-{currentVersion}` branches map to `openshift-{futureVersion}` +- Other branch naming patterns are handled by `promotion.DetermineReleaseBranch()` + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--current-release` | (required) | Current OCP development version, e.g. `4.17` | +| `--future-release` | (required, repeatable) | Future release versions to create blockers for | +| `--config-dir` | (required, via `FutureOptions`) | Path to ci-operator configuration directory | +| `--current-promotion-namespace` | (empty) | Promotion namespace filter | +| `--confirm` | false | Actually write changes to disk (dry-run by default) | +| `--dry-run` | true | Use dry-run mode for GitHub API (creates client but does not mutate) | +| `--github-token-path` | (Prow default) | Path to GitHub token | +| `--github-endpoint` | (Prow default) | GitHub API endpoint | + +## Key files +- `cmd/blocking-issue-creator/main.go` -- full implementation (single file) +- `pkg/promotion/promotion.go` -- `FutureOptions`, `DetermineReleaseBranch()` +- `pkg/config/options.go` -- `ConfirmableOptions` base + +## Deployment +Runs as the periodic Prow job [`periodic-openshift-release-merge-blockers`](https://prow.ci.openshift.org/?job=periodic-openshift-release-merge-blockers). Defined in `ci-operator/jobs/infra-periodics.yaml` in the openshift/release repo. + +GitHub API throttle: 300 requests/minute burst, 300 sustained. + +## Related +- `cmd/repo-brancher` -- the tool that actually fast-forwards the frozen branches +- `cmd/branchingconfigmanagers/fast-forwarding-config-manager` -- updates repo-brancher's job args ## What it does The `blocking-issue-creator` tool maintains diff --git a/cmd/branchingconfigmanagers/README.md b/cmd/branchingconfigmanagers/README.md index 3c76fdff7ed..b1c111039e6 100644 --- a/cmd/branchingconfigmanagers/README.md +++ b/cmd/branchingconfigmanagers/README.md @@ -1,11 +1,256 @@ -# Branching Config Managers +# branchingconfigmanagers -This directory contains individual components of the Automated Branch Cuts project, specifically the so-called "Config -Managers". Config Managers are programs that reconcile CI configuration based on the current state of the OCP lifecycle. -This document describes the contract and conventions for individual Config Managers, so they fit into the wider -Automated Branch Cuts system. In other words, Config Manager authors should refer to this document for guidance about -what are Config Managers expected to do. +## What +A family of config managers that reconcile CI configuration in the openshift/release repository based on the current phase of the OCP product lifecycle. Each sub-manager owns a specific, independent area of CI config (Bugzilla plugin settings, Tide merge criteria, fast-forward job args, release controller configs, RPM mirror repos, test frequencies, and release gating job configs). They are the automation backbone of OpenShift's branching and release lifecycle. +All sub-managers follow a common contract: given the current CI configuration (openshift/release working copy) and product lifecycle data, update the configuration to match the policy expected at the current point in time. They never modify git state -- committing and PR creation is handled separately by `prcreator`. + +## Shared concepts + +### OCP lifecycle data +All managers consume a lifecycle YAML file describing events per OCP version: + +```yaml +ocp: + "4.17": + - event: end-of-life + when: "2025-11-15T16:00:00Z" + - event: generally-available + when: "2024-10-15T16:00:00Z" + - event: code-freeze + when: "2024-09-20T02:00:00Z" + - event: feature-freeze + when: "2024-07-19T02:00:00Z" + - event: open + when: "2024-05-13T00:00:00Z" +``` + +Events are sorted descending by time. An event without a `when` field is assumed to be in the future unless provably in the past (a later event already occurred). The lifecycle config is confidential -- managers must avoid leaking dates. + +### `--overwrite-time` +Most managers accept `--overwrite-time` (RFC3339) to simulate running at a different point in time, useful for testing lifecycle transitions without waiting for them. + +--- + +## Sub-managers + +### bugzilla-config-manager + +#### What +Reconciles the Prow Bugzilla plugin configuration (`_pluginconfig.yaml`) to match the current release lifecycle. Sets `targetRelease`, `dependentBugStates`, `dependentBugTargetReleases`, and `validateByDefault` per branch. + +#### How it works +1. Loads lifecycle config and Prow plugin config from the sharded plugin config directory. +2. Classifies each OCP version into one of three categories: + - **Development version**: the version whose most recent past event is `open` (exactly one expected). Sets `targetRelease` on `main`/`master` branches to `X.Y.0`. + - **GA versions**: versions whose most recent past event is `generally-available`. For each, sets `targetRelease` to `X.Y.z`, `dependentBugTargetReleases` to `X.(Y+1).0`, and `validateByDefault: true`. Non-latest GA versions also add `X.(Y+1).z` to dependent targets. + - **Non-EOL, non-GA versions** (pre-GA, e.g. in feature/code freeze): sets `dependentBugStates` to `[MODIFIED, ON_QA, VERIFIED]`, `dependentBugTargetReleases` to `X.(Y+1).0`, `targetRelease` to `X.Y.0`, and `validateByDefault: true`. +3. Writes both `release-X.Y` and `openshift-X.Y` branch configs for each version. +4. Writes the updated sharded plugin config back to disk. + +#### Flags +| Flag | Default | What it controls | +|---|---|---| +| `--lifecycle-config` | (required) | Path to lifecycle YAML | +| `--prow-plugin-config-dir` | (required) | Path to Prow plugin config directory | +| `--overwrite-time` | (now) | Simulate a different current time (RFC3339) | + +#### Key files +- `cmd/branchingconfigmanagers/bugzilla-config-manager/main.go` +- `pkg/api/ocplifecycle/` -- lifecycle config types and parsing +- `pkg/prowconfigsharding/` -- sharded plugin config read/write + +--- + +### tide-config-manager + +#### What +Adjusts Prow Tide merge query labels and branch targeting to enforce merge criteria appropriate for each OCP lifecycle phase. This is how labels like `staff-eng-approved`, `backport-risk-assessed`, `acknowledge-critical-fixes-only`, and `verified` get added to or removed from Tide queries at the right time. + +#### How it works +1. Loads the main Prow config and sharded Prow configs. +2. Based on `--lifecycle-phase`, creates the appropriate event handler and calls `shardprowconfig.ShardProwConfig()` which iterates over all Tide queries per org/repo and invokes `ModifyQuery()` on each. +3. Writes the modified sharded configs back to disk. + +#### Lifecycle phases and their effects + +| Phase | Effect on Tide queries | +|---|---| +| `branching` | On `release-X.Y` / `openshift-X.Y` branches: replace `staff-eng-approved` with `backport-risk-assessed` | +| `pre-general-availability` | On `release-X.Y` / `openshift-X.Y` branches: if `backport-risk-assessed` is present, also add `staff-eng-approved` | +| `general-availability` | Remove `backport-risk-assessed` from current release branch queries. Move `staff-eng-approved` queries from current to future release branch. Update excluded branches to include current and future versions. | +| `acknowledge-critical-fixes-only` | Add `acknowledge-critical-fixes-only` label to `main`/`master` branch queries for repos listed in the guard file | +| `revert-critical-fixes-only` | Remove `acknowledge-critical-fixes-only` label from `main`/`master` branch queries | +| `verified` | Add `verified` label to `main`/`master` and versioned branch queries for repos that promote to `ocp` namespace (auto-discovered from ci-operator configs), plus explicit opt-in repos, minus opt-out repos | + +#### Flags +| Flag | Default | What it controls | +|---|---|---| +| `--prow-config-dir` | (required) | Path to Prow configuration directory | +| `--sharded-prow-config-base-dir` | (required) | Base dir for sharded prow config output | +| `--lifecycle-phase` | (required) | One of: `branching`, `pre-general-availability`, `general-availability`, `acknowledge-critical-fixes-only`, `revert-critical-fixes-only`, `verified` | +| `--current-release` | (required for branching/pre-GA/GA) | Current OCP version, e.g. `4.17` | +| `--excluded-repos-config` | (empty) | Path to GA excluded repos config (repos allowed to skip future-branch exclusion checks) | +| `--repos-guarded-by-ack-critical-fixes` | (required for ack-critical-fixes) | Path to newline-separated list of repos | +| `--verified-opt-in` | (required for verified) | YAML file of repos to opt into verified label (`org: [repo1, repo2]`) | +| `--verified-opt-out` | (required for verified) | YAML file of repos to opt out of verified label | +| `--ci-operator-config-dir` | (required for verified) | Path to ci-operator config dir for auto-discovering OCP-promoting repos | + +#### Key files +- `cmd/branchingconfigmanagers/tide-config-manager/main.go` +- `pkg/api/shardprowconfig/shardprowconfig.go` -- `ShardProwConfig()` and `ShardProwConfigFunctors` interface + +--- + +### fast-forwarding-config-manager + +#### What +Updates the `periodic-openshift-release-fast-forward` periodic Prow job's `--current-release` and `--future-release` arguments based on which OCP version is currently in the "open" development phase. This keeps the fast-forward job (run by `repo-brancher`) pointing at the correct versions automatically. + +#### How it works +1. Loads the lifecycle config and reads the infra periodics job config file. +2. Finds the `periodic-openshift-release-fast-forward` job by name. +3. Builds a timeline of `open` and `feature-freeze` events using `lifecycleConfig.GetTimeline()`. +4. Determines where "now" falls in the timeline: + - If the next event is `open` for version X.Y: append `--future-release=X.Y` to the job args (a new version is about to open). + - If the current/previous event is `open` or `feature-freeze`: replace both `--current-release` and `--future-release` with the version from that event. +5. Writes the updated job config back to disk. + +#### Flags +| Flag | Default | What it controls | +|---|---|---| +| `--lifecycle-config` | (required) | Path to lifecycle YAML | +| `--infra-periodics-path` | (empty) | Path to the infra periodic jobs config file | +| `--overwrite-time` | (now) | Simulate a different current time (RFC3339) | + +#### Key files +- `cmd/branchingconfigmanagers/fast-forwarding-config-manager/main.go` +- `pkg/api/ocplifecycle/` -- `GetTimeline()`, `DeterminePlaceInTime()` + +--- + +### release-controller-config-manager + +#### What +Bumps release controller configuration files for a new OCP version. Takes existing release controller configs for the current version and creates configs for the next version by replacing version references throughout the file (name, message, mirror prefix, CLI image, check/publish/verification sections). + +#### How it works +1. Parses the current release version from `--current-release`. +2. Scans `core-services/release-controller/_releases/` in the release repo for config files matching the current version pattern. +3. Uses the generic `bumper.Bump()` pipeline: finds matching files, reads each one, bumps version references from X.Y to X.(Y+1), and writes the result (with an updated filename) back to disk. +4. Version bumping is done via `ReplaceWithNextVersionInPlace()` which finds all `Major.Minor` patterns and increments the minor version. + +#### Flags +| Flag | Default | What it controls | +|---|---|---| +| `--current-release` | (required) | Current OCP version, e.g. `4.17` | +| `--release-repo` | (required) | Absolute path to `openshift/release` working copy | +| `--log-level` | 5 (debug) | Log verbosity level | + +#### Key files +- `cmd/branchingconfigmanagers/release-controller-config-manager/main.go` +- `pkg/branchcuts/bumper/release-controller-config-bumper.go` +- `pkg/branchcuts/bumper/bumper.go` -- generic `Bump()` pipeline + +--- + +### rpm-deps-mirroring-services + +#### What +Bumps RPM mirror `.repo` files for a new OCP version. These files define RHEL RPM repositories used during OCP builds. When a new version is cut, the mirror configs need corresponding entries for the next version. + +#### How it works +1. Parses the current release version. +2. Scans `core-services/release-controller/_repos/` for `.repo` files matching the glob `ocp-X.Y*.repo`. +3. For each file, parses it as an INI file and replaces version strings (e.g. `4.17` to `4.18`) in section names and `baseurl` values. +4. Writes updated files with bumped filenames. + +#### Flags +| Flag | Default | What it controls | +|---|---|---| +| `--current-release` | (required) | Current OCP version | +| `--release-repo` | (required) | Absolute path to `openshift/release` working copy | +| `--log-level` | 5 (debug) | Log verbosity level | + +#### Key files +- `cmd/branchingconfigmanagers/rpm-deps-mirroring-services/main.go` +- `pkg/branchcuts/bumper/repo/repo-bumper.go` + +--- + +### generated-release-gating-jobs + +#### What +Bumps ci-operator config files that define release gating jobs (e.g. `openshift/release` configs for aggregated jobs) from the current OCP version to the next. Updates base images, release references, test definitions, metadata variants, branch references, step environment variables, and the `interval` field. + +#### How it works +1. Parses the current release version. +2. Finds ci-operator config files that provide gating signals for the current version. Two file-finding strategies: + - `signal` (default): discovers files from any repo that provide a signal for the given OCP version via ci-operator config metadata. + - `regexp`: regex-matches files in `ci-operator/config/openshift/release/` by version pattern. +3. For each file, bumps all version references from X.Y to X.(Y+1) in: base images, releases, tests, metadata variant, metadata branch, and step environment variables. +4. Sets the test `interval` to the specified value (default 168 hours = 1 week). +5. Writes updated configs with bumped filenames. + +#### Flags +| Flag | Default | What it controls | +|---|---|---| +| `--current-release` | (required) | Current OCP version | +| `--release-repo` | (required) | Absolute path to `openshift/release` working copy | +| `--interval` | 168 | New interval value (hours) to set on gating jobs | +| `--log-level` | 5 (debug) | Log verbosity level | +| `--file-finder` | `signal` | Method to find gating job files: `regexp` or `signal` | + +#### Key files +- `cmd/branchingconfigmanagers/generated-release-gating-jobs/main.go` +- `pkg/branchcuts/bumper/gen-release-jobs-bumper.go` + +--- + +### frequency-reducer + +#### What +Reduces the execution frequency of periodic CI tests for older OCP versions. As versions age, their tests run less frequently to save CI resources while maintaining coverage. Only affects `openshift` and `openshift-priv` org repos. + +#### How it works +1. Iterates over all ci-operator config files in the config directory. +2. For each test with a `cron` or `interval` field (excluding `mirror-nightly-image` and `promote-` tests), determines the test's OCP version from the branch name. +3. Applies frequency reduction based on age relative to the current version: + +| Version age | Target frequency | Cron threshold | +|---|---|---| +| Older than current-2 | Monthly (1x/month) | Reduced only if currently >1x/month | +| current-2 (past-past) | Bi-weekly (2x/month) | Reduced only if currently >2x/month | +| current-1 (past) | Weekly (1x/week, weekends) | Reduced only if currently >4x/month | +| Current or newer | No change | -- | + +4. When reducing frequency, `interval`-based schedules are converted to `cron` expressions. Generated cron times are randomized to spread load. +5. Writes updated configs back to disk. + +#### Flags +| Flag | Default | What it controls | +|---|---|---| +| `--current-release` | (required) | Current OCP version | +| `--config-dir` | (required, via `ConfirmableOptions`) | Path to ci-operator config directory | +| `--confirm` | false | Actually write changes (dry-run by default) | + +#### Key files +- `cmd/branchingconfigmanagers/frequency-reducer/main.go` + +--- + +## Deployment +All sub-managers run as periodic Prow jobs. The typical pattern is: +1. Periodic job runs the config manager against a checkout of openshift/release +2. If it exits 0 (config now matches policy), the `prcreator` tool commits changes and opens a PR + +All accept `--overwrite-time` (or equivalent) for testing lifecycle transitions. + +## Related +- `cmd/repo-brancher` -- fast-forwards branches, configured by `fast-forwarding-config-manager` +- `cmd/blocking-issue-creator` -- creates merge-blocker issues for frozen branches +- `pkg/api/ocplifecycle/` -- lifecycle config types, timeline computation +- `pkg/branchcuts/bumper/` -- generic bumper framework used by release-controller, rpm-deps, and gating-jobs managers ## Config Manager Contract Given the current state of the CI configuration (openshift/release repository working copy) and product lifecycle data, @@ -98,4 +343,3 @@ it ended with zero exit code, it should use the `prcreator` tool to commit the c changes. TODO: What period - diff --git a/cmd/bugzilla-backporter/README.md b/cmd/bugzilla-backporter/README.md new file mode 100644 index 00000000000..25d50447792 --- /dev/null +++ b/cmd/bugzilla-backporter/README.md @@ -0,0 +1,37 @@ +# bugzilla-backporter + +## What +**DEPRECATED.** A web UI server for cloning (backporting) Bugzilla bugs to different target releases. Users could look up a bug by ID, see its current clones, and create new clones targeting other OCP releases. Deprecated because OpenShift moved from Bugzilla to Jira for bug tracking. + +## How it works +1. Starts an HTTP server on the configured address. +2. Reads the Prow plugin config (`plugins.yaml`) to extract all known Bugzilla `targetRelease` values for the `openshift` org, sorted for the UI dropdown. +3. Connects to the Bugzilla API using a secret API key, with a caching HTTP transport layer to reduce API calls. +4. Exposes these HTTP endpoints: + +| Endpoint | What it does | +|---|---| +| `/` | Landing page | +| `/clones` | Look up existing clones of a bug | +| `/clones/create` | Create a new clone targeting a different release | +| `/bug` | Return bug details as JSON | +| `/help` | Help/debug endpoint | + +5. Exports Prometheus metrics under `bugzillabackporter` (request duration, response size). + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--log-level` | `info` | Log verbosity | +| `--address` | `:8080` | Server listen address | +| `--gracePeriod` | `10s` | Shutdown grace period | +| `--plugin-config` | `/etc/plugins/plugins.yaml` | Path to Prow plugin config for target release discovery | +| `--bugzilla-api-key-path` | (Prow default) | Path to Bugzilla API key secret | +| `--bugzilla-endpoint` | (Prow default) | Bugzilla API URL | + +## Key files +- `cmd/bugzilla-backporter/main.go` -- server setup and endpoint wiring +- `pkg/backporter/` -- handler implementations, caching transport, sorting logic + +## Deployment +Appears decommissioned — no Deployment manifest exists in the release repo. A stale Ingress entry for `bugs.ci.openshift.org` remains in `clusters/app.ci/cert-manager/prow_ingress.yaml` but the actual Deployment and Service have been removed. diff --git a/cmd/check-cluster-profiles-config/README.md b/cmd/check-cluster-profiles-config/README.md new file mode 100644 index 00000000000..d09afd193aa --- /dev/null +++ b/cmd/check-cluster-profiles-config/README.md @@ -0,0 +1,37 @@ +# check-cluster-profiles-config + +## What +Validates the cluster profile configuration file used by OpenShift CI. Checks for structural correctness (no duplicate profiles or duplicate orgs within a profile) and verifies that the required Kubernetes Secret for each profile exists in the `ci` namespace. Uses the ci-operator config resolver to look up the secret name for each profile. + +## How it works -- full flow + +1. **Load config**: Read and unmarshal the cluster profiles YAML file from `--config-path`. The file contains a `ClusterProfilesList` -- an array of profile definitions, each with a profile name and a list of owners (org references). + +2. **Validate structure** (`Validate`): + - For each profile in the list: + - Check that no org appears more than once within the same profile's owners list + - Check that the profile name has not already been defined earlier in the file (no duplicate profiles) + - Build an internal `ClusterProfilesMap` for subsequent checks. + +3. **Verify secrets** (`checkCiSecrets`): + - For each validated profile: + - Query the ci-operator config resolver (`config.ci.openshift.org`) via `NewResolverClient().ClusterProfile()` to get the profile's details, including its secret name + - Attempt to `GET` the corresponding Secret from the `ci` namespace on the local cluster + - Fatal if the secret does not exist or cannot be retrieved + +4. If all checks pass, log success and exit 0. If any check fails, fatal with the error. + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--config-path` | `""` | Path to the cluster profile configuration YAML file | + +## Key files + +- `cmd/check-cluster-profiles-config/main.go` -- all logic: config loading, validation, secret verification +- `pkg/api/types.go` -- `ClusterProfilesMap`, `ClusterProfilesList` types +- `pkg/registry/server/` -- `NewResolverClient()` for querying the config resolver + +## Deployment +One-shot CLI tool, typically run in CI (presubmit or postsubmit) to validate changes to the cluster profiles config. Requires in-cluster access to the `ci` namespace for secret verification and network access to the config resolver service. diff --git a/cmd/check-gh-automation/README.md b/cmd/check-gh-automation/README.md index 4bc172d5b11..54d0f48d134 100644 --- a/cmd/check-gh-automation/README.md +++ b/cmd/check-gh-automation/README.md @@ -1,6 +1,68 @@ -# Check GH Automation -A tool to check that our bots (`openshift-merge-robot`, `openshift-ci-robot`, and `openshift-cherrypick-robot`) have access to repositories that have CI configured. It also checks that the app used to run it is installed in the repositories. This tool also verifies if `openshift-cherrypick-robot` is an organization member for repos that have the `cherrypick` external prow plugin configured. +# check-gh-automation +## What +Validates that GitHub automation (bots, apps, branch protection) is correctly configured for repositories that use OpenShift CI. Checks bot collaborator access, CI app installation, branch protection admin requirements, cherrypick robot permissions, and automated branching prerequisites. Can run against all Prow-configured repos, a specific list, or only repos modified in a PR. + +## How it works -- full flow + +### Repository selection +The tool determines which repos to check using three strategies (in priority order): +1. **Explicit list** (`--repo org/repo`): check only the specified repos +2. **PR-scoped** (`--candidate-path`): resolve the PR's `JobSpec` from environment, diff against base SHA to find added/modified ci-operator configs and prow configs, extract unique `org/repo` pairs. If more than 10 repos are found, skip all checks (likely a bulk config update, not a new repo). +3. **All Prow repos** (default): check every repo referenced in the Prow config's `AllRepos` set + +### Checks performed for each repo + +#### Bot access (`--bot`, repeatable) +For each specified bot username: +- Check if the bot is an org member (`IsMember`) +- If not an org member, check if it is a repository collaborator (`IsCollaborator`) +- Fail the repo if any bot has neither org membership nor collaborator access + +#### App installation (`--app`, default `openshift-ci`) +Two modes controlled by `--app-check-mode`: +- **`standard`** (default): always check if the app is installed on the repo +- **`tide`**: only check if at least one Tide query exists for the repo; skip otherwise + +Calls `IsAppInstalled()` on the GitHub client. + +#### Branch protection (`--check-branch-protection`, default true) +If branch protection is configured for the repo (not explicitly set to `unmanaged`): +- Verify the repo is public (or the org has a paid GitHub plan; free plan orgs cannot use branch protection on private repos) +- Verify `openshift-merge-robot` has `admin` permission on the repo (required to manage branch protection rules) + +#### Cherrypick robot (when plugin config is loaded) +If the `cherrypick` external plugin is configured for the repo (at org or repo level): +- Check that `openshift-cherrypick-robot` is either an org member or has read/write/admin access on the repo + +#### Automated branching (when `--candidate-path` is set) +If a ci-operator config is found for the repo and it promotes to the `ocp` namespace: +- Verify that GitHub Issues are enabled on the repository (required for automated branching notifications) + +### Output +Collects all failing repos into a set. If any repos fail, exits with a fatal log listing them all. + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--bot` | (none, repeatable) | Bot username to check for collaborator/member access | +| `--app` | `openshift-ci` | GitHub App name to check for installation | +| `--app-check-mode` | `standard` | `standard`: always check app; `tide`: only check if Tide queries exist for the repo | +| `--check-branch-protection` | `true` | Verify `openshift-merge-robot` has admin access where branch protection is enabled | +| `--ignore` | (none, repeatable) | Org or org/repo to skip. Formatted as `org` or `org/repo` | +| `--repo` | (none, repeatable) | Specific org/repo to check (overrides auto-detection) | +| `--candidate-path` | `""` | Path to openshift/release working copy; enables PR-scoped repo detection | +| Prow config flags | -- | `--config-path`, `--job-config-path`, `--supplemental-prow-config-dir` via `ConfigOptions` | +| Plugin config flags | -- | `--plugin-config` via `PluginOptions`; if not set, cherrypick checks are skipped | +| GitHub flags | -- | `--github-token-path`, `--github-endpoint`, etc. via `GitHubOptions` | + +## Key files + +- `cmd/check-gh-automation/main.go` -- all logic: repo determination (`determineRepos`, `gatherModifiedRepos`), checks (`checkRepos`), GitHub API interactions + +## Deployment +CLI tool. Typically run as a presubmit job on openshift/release PRs (with `--candidate-path` pointing to the PR checkout) and as a periodic job (checking all Prow-configured repos). Requires a GitHub token with read access to check org membership, collaborator status, app installation, and permissions. This can be run in multiple modes: ## Pass Prow Config Options diff --git a/cmd/ci-images-mirror/README.md b/cmd/ci-images-mirror/README.md index 7275366722d..5688a1652d3 100644 --- a/cmd/ci-images-mirror/README.md +++ b/cmd/ci-images-mirror/README.md @@ -1,9 +1,83 @@ # ci-images-mirror -This tool mirrors images used by tests in OpenShift CI from `app.ci`'s integrated registry to `quay.io/openshift/ci`. It is a temporary automation while migrating all users of the CI images -from `app.ci` to `quay.io`. When the migration process is complete, i.e., all clients uses images from `quay.io`, -this tool will be decommissioned and CI images will stop promoting images to `app.ci`. +## What +Mirrors CI images from the app.ci OpenShift image registry to `quay.io/openshift/ci`. This ensures CI images are available in Quay for consumers that cannot directly pull from the internal cluster registry. It consists of two cooperating systems: a controller-runtime controller that watches ImageStreams for changes and queues mirror tasks, and a consumer loop that processes the queue by executing `oc image mirror` commands. +## How it works -- full flow + +### Architecture overview +The tool runs three concurrent subsystems: + +1. **ImageStream controller** (`quay_io_ci_images_distributor`): watches ImageStream events on the app.ci cluster and determines which tags need mirroring based on ci-operator configs and the step registry +2. **Mirror consumer**: a background goroutine that continuously takes tasks from the mirror store and executes `oc image mirror` commands +3. **Supplemental/ART image service**: periodic tickers that mirror images defined in the config file (supplemental CI images and ART images) + +### ImageStream controller +- Watches ImageStream resources and maps events to individual ImageStreamTag reconcile requests +- Filters tags through `testInputImageStreamTagFilterFactory`: only mirrors tags that are referenced by ci-operator test configs (as base images, test inputs, or release inputs) or are in additional namespaces/imagestreams/tags specified via flags +- Tags in the `--ignore-image-stream-tag` list and supplemental CI image targets are excluded from controller mirroring (they are handled separately) +- On reconcile, compares the source image digest with the existing Quay target digest. If they differ, queues a `MirrorTask` in the mirror store +- Only mirrors images with valid manifest v2 by default (`--only-valid-manifest-v2-images`) + +### Mirror store and consumer +- The `MirrorStore` is an in-memory queue (keyed by destination) of `MirrorTask` objects +- The `MirrorConsumerController` runs in a loop: takes batches of 10 tasks, executes `oc image mirror` with the configured registry credentials +- Includes metrics tracking for queue depth and mirror operations + +### Supplemental CI images +Defined in the config file under `supplementalCIImages`. These are images not discovered by the controller but explicitly configured for mirroring (e.g., third-party images needed by CI). +- Runs on an hourly ticker +- Compares source and target digests before mirroring (skip if already in sync) +- Creates a backup/prune task before each mirror to preserve the old image + +### ART images +Defined in the config file under `artImages`. These are ART (Automated Release Team) images resolved from ImageStreams on the cluster. +- Also runs on an hourly ticker +- Uses the same mirror-store mechanism + +### Image naming convention +Target images in Quay follow the pattern: `quay.io/openshift/ci:{namespace}_{name}_{tag}` (slashes and colons in the ImageStreamTag name are replaced with underscores). + +### Config validation mode +With `--validate-config-only`, the tool validates the config file against the release repo (checks that source images are accessible, targets don't overwrite promoted tags, etc.) and exits. + +### HTTP API +Exposes an API server for monitoring: +- `GET /api/health` -- health check +- `GET /api/v1/mirrors?action=summarize` -- summary of the mirror queue +- `GET /api/v1/mirrors?action=show&limit=N` -- show N pending mirror tasks + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--leader-election-namespace` | `ci` | Namespace for leader election | +| `--leader-election-suffix` | (empty) | Suffix for leader election lock (for local testing, requires `--dry-run`) | +| `--enable-controller` | (none) | Enable specific controllers. Available: `quay_io_ci_images_distributor` | +| `--dry-run` | `false` | Dry-run mode | +| `--release-repo-git-sync-path` | (required) | Path to the release repository (for ci-operator configs and step registry) | +| `--config` | (empty) | Path to the CIImagesMirrorConfig file (supplemental and ART images) | +| `--registry-config` | (required) | Path to Docker registry credentials file | +| `--only-valid-manifest-v2-images` | `true` | Skip images with invalid manifest v2 | +| `--port` | `8090` | HTTP API server port | +| `--gracePeriod` | `10s` | Server shutdown grace period | +| `--validate-config-only` | `false` | Validate config and exit | +| `--quayIOCIImagesDistributorOptions.additional-image-stream-tag` | (none) | Extra ISTs to mirror (can repeat) | +| `--quayIOCIImagesDistributorOptions.additional-image-stream` | (none) | Extra ISs to mirror (can repeat) | +| `--quayIOCIImagesDistributorOptions.additional-image-stream-namespace` | (none) | Extra namespaces to mirror (can repeat) | +| `--quayIOCIImagesDistributorOptions.ignore-image-stream-tag` | (none) | ISTs to skip mirroring (can repeat) | + +## Key files +- `cmd/ci-images-mirror/main.go` -- entry point, manager setup, supplemental image service, HTTP API, config validation +- `pkg/controller/quay_io_ci_images_distributor/quay_io_ci_images_distributor.go` -- ImageStream controller, tag filtering, reconciler +- `pkg/controller/quay_io_ci_images_distributor/mirror.go` -- MirrorStore, MirrorConsumerController, `oc image mirror` execution +- `pkg/controller/quay_io_ci_images_distributor/oc_quay_io_image_helper.go` -- `oc image info` wrapper +- `pkg/controller/quay_io_ci_images_distributor/supplemental_images.go` -- config file loading and types +- `pkg/controller/quay_io_ci_images_distributor/metrics.go` -- Prometheus metrics + +## Deployment +Long-lived controller-runtime Deployment on app.ci with leader election. Requires in-cluster access to the OpenShift image registry and registry credentials for Quay push access. + +Uses a git-sync sidecar to keep the release repo up-to-date for ci-operator config and step registry resolution. The tool watches all image stream tags on `app.ci`, and compare its digest with the one from the targeting image on `quay.io`. If they have different digests or the target image does not exist, a mirroring task will be created in a queue which will be picked up and executed afterwords. @@ -73,4 +147,4 @@ $ curl -s http://localhost:28090/api/v1/mirrors\?action\=show\&limit\=1 | jq } $ oc logs -n ci -l app=ci-images-mirror -c ci-images-mirror -f | grep -i keep-manifest-list -``` \ No newline at end of file +``` diff --git a/cmd/ci-operator-checkconfig/README.md b/cmd/ci-operator-checkconfig/README.md index 584a8eb99a7..f5f5ed2c4c6 100644 --- a/cmd/ci-operator-checkconfig/README.md +++ b/cmd/ci-operator-checkconfig/README.md @@ -1,37 +1,81 @@ -`ci-operator-checkconfig` -========================= - -This program can be used to perform validation over a set of `ci-operator` -configuration files. It is used in [`openshift/release`][openshift_release] to -enforce the correctness of all configuration files present there, via a -[pre-submit job][presubmit_job]. - -It acts mostly as a front-end for the validation code in -[`pkg/validation`][pkg_validation], which is also used by other components, -guaranteeing the configuration files will be usable by them. Since it operates -on several thousands of files, the validation code must be efficient and work at -scale. Files are validated in parallel and work is reused between them as much -as possible. - -Validation is performed after loading information from `openshift/release` and -is based on the resolved contents of the configuration files (meaning -multi-stage tests are fully expanded), so the same checks done just prior to the -actual execution of the test can also be done here. Since all configuration -files are loaded, cross-configuration validation can also be performed. - -Testing locally ---------------- - -To validate a local copy of `openshift/release`, simply execute: - -```console -ci-operator-checkconfig \ - --config-dir path/to/release/ci-operator/config \ - --registry path/to/release/ci-operator/step-registry \ - --cluster-profiles-config path/to/release/ci-operator/step-registry/cluster-profiles/cluster-profiles-config.yaml - … -``` - -[openshift_release]: https://github.com/openshift/release.git -[pkg_validation]: https://github.com/openshift/ci-tools/tree/master/pkg/validation -[presubmit_job]: https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-release-master-ci-operator-config +# ci-operator-checkconfig + +## What +Validates ci-operator configuration files for correctness. This is the primary static analysis gate that catches configuration mistakes before they can break CI jobs. It runs all structural, semantic, and cross-config validations in parallel, including checking for duplicate promotion targets across the entire configuration corpus. + +Used in presubmit checks on openshift/release to prevent invalid ci-operator configs from merging. + +## How it works -- full flow + +### Startup +1. Parse flags: `--config-dir` (ci-operator configs), `--registry` (step registry), `--cluster-profiles-config`, `--cluster-claim-owners-config`, `--cluster-profile-set-details`, plus filtering flags `--org` and `--repo` +2. If `--registry` is provided, load the full step registry (references, chains, workflows, observers) and build a `Resolver` from it +3. Load cluster profiles config and cluster claim owners config from their respective paths +4. Create a `ConfigAgent` that loads all ci-operator configs from `--config-dir`, optionally filtered by `--org`/`--repo` +5. Optionally load cluster profile set details (JSON file mapping profiles to available sets) + +### Validation (parallel produce-map-reduce) +The validation runs as a concurrent pipeline using `ProduceMapReduce`: + +**Produce phase**: Iterates over all loaded configs from the ConfigAgent and sends them to worker goroutines. + +**Map phase** (per config, concurrent): Each configuration is validated through multiple layers: + +1. **Registry resolution validation** (if `--registry` is set): Resolves all multi-stage test references through the step registry, then validates the fully-resolved configuration via `IsValidResolvedConfiguration()`. This catches references to nonexistent steps, chains, or workflows. + +2. **Config agent matching**: Verifies the config can be matched by the ConfigAgent (catches filename/metadata mismatches where the YAML content disagrees with the filesystem path). + +3. **Graph configuration validation**: Converts the config to a static graph via `FromConfigStatic()` and validates it with `IsValidGraphConfiguration()`, which checks: + - No duplicate build targets across the entire pipeline + - Container test `from` images exist in the pipeline + - Multi-stage test step `from` images reference known pipeline images + - Shard counts are valid (>1, not allowed on postsubmits) + +4. **Promoted tag collection**: Extracts all promoted image tags and sends them to the reduce phase for cross-config duplicate detection. + +5. **Registry override check**: Rejects any config that sets `promotion.registry_override` (this field is not allowed). + +**Reduce phase**: Collects all promoted tags across all configs and checks for duplicates -- the same `ImageStreamTag` being promoted from multiple org/repo/branch combinations is an error. + +### Specific validations performed +The `Validator` validates (non-exhaustive highlights): + +- **`build_root`**: exactly one of `image_stream_tag`, `project_image`, or `from_repository` must be set; `image_stream_tag` requires namespace/name/tag +- **`base_images`/`base_rpm_images`**: tags must be set, names cannot be `root` or reserved bundle prefixes (`src-bundle-*`, `ci-index-*`) +- **`images`**: `to` must be set, no duplicate pipeline image names, `dockerfile_literal` is mutually exclusive with `context_dir`/`dockerfile_path`, valid architectures only (amd64, arm64, ppc64le, s390x), `run_if_changed`/`skip_if_only_changed`/`pipeline_run_if_changed`/`pipeline_skip_if_only_changed` are mutually exclusive +- **`promotion`**: namespace required, cannot promote to `kube*`/`openshift*`/`default`/`redhat*` namespaces (with exceptions), no duplicate targets, official image promoters must import a release stream +- **`releases`**: exactly one of integration/candidate/prerelease/release per entry, valid products/streams/architectures/versions (minor version format X.Y), `latest`/`initial` cannot coexist with `tag_specification` +- **`resources`**: must have a `*` blanket policy, quantities must be positive and parseable, valid resource keys only (cpu, memory, ephemeral-storage, devices.kubevirt.io/kvm, /dev/shm, nvidia.com/gpu) +- **`tests`**: names must be valid DNS subdomains with length limits (61 chars general, 42 for claim tests), cron expressions must parse, `run_if_changed` and friends are mutually exclusive +- **`operator`**: bundle `base_index` and `skip_building_index` require `as`, substitution `with` must resolve to a known image, valid `update_graph` values +- **`external_images`**: registry must be `quay.io/*`, cannot collide with `base_images` keys +- **General**: must define at least one test or image, `rpm_build_location` requires `rpm_build_commands`, `canonical_go_repository` must not duplicate the default value, step dependencies must resolve + +### Exit +If any validation errors are found, each is logged individually and the process exits with a fatal error. Exit code 0 means all configs are valid. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-dir` | (required) | Path to ci-operator configuration directory | +| `--registry` | `""` | Path to step registry directory; enables registry resolution validation | +| `--cluster-profiles-config` | `""` | Path to cluster profiles config file for profile validation | +| `--cluster-claim-owners-config` | `""` | Path to cluster claim owners config file | +| `--cluster-profile-set-details` | `""` | Path to JSON file with cluster profile set details | +| `--org` | `""` | Limit validation to configs in this org | +| `--repo` | `""` | Limit validation to configs in this repo | +| `--log-level` | `info` | Log verbosity level | +| `--only-process-changes` | `false` | Only validate files modified vs. the upstream branch | + +## Key files +- `cmd/ci-operator-checkconfig/main.go` -- entry point, flag parsing, produce-map-reduce orchestration, promoted tag deduplication +- `pkg/validation/config.go` -- core configuration validation (build root, images, promotion, resources, releases, operator, base images) +- `pkg/validation/test.go` -- test step validation (names, cron, multi-stage steps, parameters, leases, cluster profiles) +- `pkg/validation/release.go` -- release specification validation (candidate, prerelease, release, integration) +- `pkg/validation/graph.go` -- graph-level validation (duplicate targets, from-image resolution in container and multi-stage tests) +- `pkg/defaults/defaults.go` -- `FromConfigStatic()` converts config to graph representation for graph validation +- `pkg/registry/resolver.go` -- resolves multi-stage test references through chains/workflows +- `pkg/config/options.go` -- shared `Options` struct providing `--config-dir`, `--org`, `--repo`, `--only-process-changes` filtering + +## Deployment +Runs as a presubmit check on openshift/release PRs that modify `ci-operator/config/` or the step registry. Also used in local `make validate` targets. diff --git a/cmd/ci-operator-config-mirror/README.md b/cmd/ci-operator-config-mirror/README.md new file mode 100644 index 00000000000..c5e6f38d9e7 --- /dev/null +++ b/cmd/ci-operator-config-mirror/README.md @@ -0,0 +1,50 @@ +# ci-operator-config-mirror + +## What +CLI tool that mirrors ci-operator configuration files from public organizations into a private organization (typically `openshift-priv`). It transforms all image references, namespaces, and promotion targets so that private CI builds use private image streams instead of public ones. + +This is a critical part of the private CI pipeline: repos that build official OCP images have their CI configs duplicated into the private org so that embargoed fixes can be tested with private image streams before public disclosure. + +## How it works -- full flow + +1. Iterate over every ci-operator config file in `--config-dir` using `OperateOnCIOperatorConfigDir()` +2. For each config, apply filtering: + - Skip configs already belonging to the destination org (`--to-org`) + - If `--only-org` is set, skip configs from other orgs (unless whitelisted via `--whitelist-file`) + - Skip repos that don't build any official images (`api.BuildsAnyOfficialImages` with `WithoutOKD`) and aren't whitelisted +3. For each eligible config, apply transformations: + - Set `canonical_go_repository` to the original public `github.com/org/repo` path (so Go imports resolve correctly in private builds) + - **Release tag configuration**: if namespace is `ocp`, change to `ocp-private` and append `-priv` to the name (e.g. `4.16` becomes `4.16-priv`) + - **Integration releases**: same `ocp` to `ocp-private` / `-priv` transformation + - **Build root**: if the ImageStreamTag is in `ocp`, rewrite to `ocp-private` with `-priv` suffix + - **Base images / base RPM images**: rewrite any `ocp` namespace references that are valid OCP versions to `ocp-private` with `-priv` + - **Promotion configuration**: for targets in `ocp` namespace, append `-priv` to name/tag, set namespace to `ocp-private`, disable `tag_by_commit`; for targets in other namespaces, set `Disabled: true` to prevent conflicts with the public counterpart. For whitelisted repos that don't build official images, disable all non-official promotion targets. + - **Tests**: strip all periodic and postsubmit tests (only presubmits are needed in private repos) + - Rewrite the org to `--to-org` and rename the repo using `MirroredRepoName()` (flattened orgs keep the original repo name; non-flattened orgs get `{org}-{repo}`) +4. Collect the transformed configs grouped by repo +5. If `--clean` is true (default), delete all existing subdirectories under the destination org's config directory to remove stale configs +6. Write all transformed configs to disk using `DataWithInfo.CommitTo()` + +### Repo naming convention +- **Flattened orgs** (default: `openshift`, `openshift-eng`, `operator-framework`, `redhat-cne`, `openshift-assisted`, `ViaQ`): repo name stays the same (e.g. `openshift/installer` -> `openshift-priv/installer`) +- **Non-flattened orgs**: repo name is prefixed with the org (e.g. `stolostron/multicloud-operators-subscription` -> `openshift-priv/stolostron-multicloud-operators-subscription`) +- Additional orgs can be flattened with `--flatten-org` + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-dir` | (required) | Directory containing ci-operator config files | +| `--to-org` | (required) | Destination organization name for mirrored configs | +| `--only-org` | `""` | Only mirror configs from this source organization | +| `--flatten-org` | (repeatable) | Additional orgs whose repos should not have org prefix in private | +| `--clean` | `true` | Delete all subdirectories under `--to-org` before generating new configs | +| `--whitelist-file` | `""` | Path to YAML file listing repos to include even if they don't build official images | + +## Key files +- `cmd/ci-operator-config-mirror/main.go` -- all logic is in this single file +- `pkg/privateorg/flatten.go` -- `MirroredRepoName()` naming logic and `DefaultFlattenOrgs` list +- `pkg/config/whitelist.go` -- whitelist file loading and matching +- `pkg/api/promotion.go` -- `BuildsAnyOfficialImages()` used for filtering + +## Deployment +CLI tool. Not a long-running service. Invoked by `auto-config-brancher` as part of the periodic config generation pipeline in `openshift/release`. Runs in the `ci_auto-config-brancher_latest` container image. diff --git a/cmd/ci-operator-configresolver/README.md b/cmd/ci-operator-configresolver/README.md new file mode 100644 index 00000000000..de9ecb1d2ef --- /dev/null +++ b/cmd/ci-operator-configresolver/README.md @@ -0,0 +1,106 @@ +# ci-operator-configresolver + +## What +Long-running HTTP service that loads, resolves, and serves ci-operator configurations and the multi-stage step registry on demand. It is the central config resolution service for OpenShift CI: `ci-operator` calls it at runtime to get its fully-resolved configuration (with all step registry references expanded inline). + +It also hosts a web UI (the "Step Registry" browser) on a separate port where users can browse, search, and view documentation for steps, chains, workflows, and job configurations. + +## How it works -- full flow + +### Startup +1. Parse and validate flags +2. Add the OpenShift `imagev1` scheme (needed for ImageStream lookups) +3. Initialize config and registry agents: + - **Config agent** (`agents.NewConfigAgent`): loads all ci-operator YAML configs from `--config` (or `{release-repo-path}/ci-operator/config/`) + - **Registry agent** (`agents.NewRegistryAgent`): loads the multi-stage step registry from `--registry` (or `{release-repo-path}/ci-operator/step-registry/`) +4. If `--release-repo-git-sync-path` is used, both agents share a single `UniversalSymlinkWatcher` that watches for git-sync symlink changes and triggers reload of both configs and registry simultaneously +5. If `--validate-only` is set, exit after loading (used for CI validation of configs) +6. Create a Kubernetes client for ImageStream lookups (in-cluster config) +7. Start two HTTP servers on separate ports + +### API server (port 8080 by default) + +| Endpoint | Method | What it does | +|---|---|---| +| `/config` | GET | Resolve a stored config by metadata query params (`org`, `repo`, `branch`, `variant`). Looks up the config, resolves all registry references, returns fully-resolved JSON. | +| `/resolve` | POST | Resolve a literal (inline) config. Accepts an unresolved `ReleaseBuildConfiguration` JSON in the request body, resolves registry references, returns resolved JSON. | +| `/mergeConfigsWithInjectedTest` | GET | Merge multiple configs (specified via repeated query params) and inject a test from one config into the merged result. Used for cross-repo test injection. | +| `/clusterProfile` | GET | Return details about a cluster profile by `name` query param. | +| `/configGeneration` | GET | Return the current generation counter for configs (increments on reload). | +| `/registryGeneration` | GET | Return the current generation counter for registry (increments on reload). | +| `/integratedStream` | GET | Return information about an integrated ImageStream by `namespace` and `name` query params. Responses are cached in memory with 1-minute TTL. Validates against an allowlist of stream patterns (e.g. `ocp/4.12+`, `origin/4.12+`, `ocp-private/4.12+`, `origin/scos-*`, `origin/sriov-*`, `origin/metallb-*`, `origin/ptp-*`). | +| `/readyz` | GET | Readiness probe (always 200). | + +### UI server (port 8082 by default) + +| Path | What it shows | +|---|---| +| `/` | Main page listing all references, chains, and workflows in the registry | +| `/search` | Search across all configs by org/repo/branch/test name | +| `/job` | View a specific job's resolved config with all steps expanded | +| `/reference/{name}` | View a specific step reference with documentation, code, and usage | +| `/chain/{name}` | View a specific chain with its step sequence and documentation | +| `/workflow/{name}` | View a specific workflow with pre/test/post chains and documentation | +| `/ci-operator-reference` | Syntax-highlighted YAML reference for ci-operator configuration | +| `/static/...` | Static assets (CSS, JS) | + +### Config resolution flow (what `/config` does internally) +1. Extract `org`, `repo`, `branch`, `variant` from query parameters +2. Look up the matching config via `configAgent.GetMatchingConfig()` (supports regex matching on branch names) +3. Call `registryAgent.ResolveConfig()` which expands all step registry references: + - Inline the commands from referenced steps + - Expand chains into their constituent steps + - Expand workflows into pre/test/post step sequences + - Resolve cluster profile references +4. Return the fully-resolved config as indented JSON + +### Integrated stream cache +The `/integratedStream` endpoint fetches ImageStream data from the cluster and caches it in memory for 1 minute to avoid excessive API calls. The cache key is `{namespace}/{name}`. Concurrent access is protected by a mutex. + +### File watching and hot-reload +- When `--release-repo-git-sync-path` is set, a single `fsnotify` watcher monitors the git-sync symlink. When git-sync updates the symlink (on new commits), both the config agent and registry agent reload simultaneously. +- When `--config` and `--registry` are set separately, each agent watches its own directory independently via `ConfigMap` mount watchers. +- Each reload increments a generation counter accessible via `/configGeneration` and `/registryGeneration`. + +### Metrics +- Prometheus metrics exposed on the default metrics port under `ci-operator-configresolver` prefix +- HTTP request duration and response size tracked per endpoint +- Error rate counters for config resolution failures + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config` | `""` | Path to ci-operator config directory (mutually exclusive with `--release-repo-git-sync-path`) | +| `--registry` | `""` | Path to step registry directory (mutually exclusive with `--release-repo-git-sync-path`) | +| `--release-repo-git-sync-path` | `""` | Path to a git-synced release repo; derives config and registry paths automatically | +| `--log-level` | `info` | Log level (`debug`, `info`, `warn`, `error`) | +| `--port` | `8080` | API server port | +| `--ui-port` | `8082` | UI server port | +| `--address` | `:8080` | DEPRECATED: use `--port` | +| `--ui-address` | `:8082` | DEPRECATED: use `--ui-port` | +| `--gracePeriod` | `10s` | Grace period for server shutdown | +| `--validate-only` | `false` | Load and validate configs/registry, then exit | +| `--flat-registry` | `false` | Disable directory-structure-based registry validation | +| Instrumentation flags | | `--health-port`, metrics port, etc. | + +## Key files +- `cmd/ci-operator-configresolver/main.go` -- entry point, server setup, endpoint wiring, integrated stream cache +- `pkg/registry/server/server.go` -- HTTP handler implementations for `/config`, `/resolve`, `/mergeConfigsWithInjectedTest`, `/clusterProfile` +- `pkg/webreg/webreg.go` -- web UI handler (`WebRegHandler`) with routing for `/`, `/search`, `/job`, `/reference`, `/chain`, `/workflow` +- `pkg/load/agents/configAgent.go` -- config loading agent with file watching +- `pkg/load/agents/registryAgent.go` -- registry loading agent with file watching +- `pkg/api/configresolver/` -- `LocalIntegratedStream()` for ImageStream lookups + +## Deployment +Long-lived Deployment on `app.ci`, namespace `ci`. Two ports exposed: +- Port 8080: API (consumed by `ci-operator` at runtime) +- Port 8082: UI (the Step Registry browser, accessible to users) + +Health check: readiness probe hits `/readyz` on the API port. The health endpoint gates on the API server being responsive. + +Uses git-sync sidecar to keep a local copy of `openshift/release` up to date. + +## Related +- `cmd/ci-operator` -- primary consumer of the `/config` and `/resolve` API endpoints +- `cmd/generate-registry-metadata` -- generates metadata consumed by the UI +- The public UI is available at `https://steps.ci.openshift.org/` diff --git a/cmd/ci-operator-prowgen/README.md b/cmd/ci-operator-prowgen/README.md index 9f70cb73ec9..66cbcb8bc98 100644 --- a/cmd/ci-operator-prowgen/README.md +++ b/cmd/ci-operator-prowgen/README.md @@ -1,53 +1,98 @@ -prowgen -======= +# ci-operator-prowgen -Prowgen is a tool that generates [job configurations](https://docs.prow.k8s.io/docs/jobs/) based on -[ci-operator configuration](https://docs.ci.openshift.org/architecture/ci-operator/) and its own -configuration file named `.config.prowgen`. +## What +Generates Prow job YAML definitions (presubmits, postsubmits, periodics) from ci-operator configs and `.config.prowgen` files. This is the `make jobs` engine in the openshift/release repo — every Prow job definition is produced by this tool. -The contents of `.config.prowgen` will be appended to every job configuration during Prowgen execution: +## How it works — full flow -**Example:** +### Config loading +- Reads ci-operator configs from `--from-dir` or `--from-release-repo` (resolves to `$GOPATH/.../release/ci-operator/config`) +- For each config, loads `.config.prowgen` from two levels: + - **Org-level**: `{configDir}/{org}/.config.prowgen` + - **Repo-level**: `{configDir}/{org}/{repo}/.config.prowgen` + - Repo merges onto org via `MergeDefaults()`: booleans OR'd, lists concatenated +- Output written to `--to-dir` or `--to-release-repo` (resolves to `$GOPATH/.../release/ci-operator/jobs`) +- Uses `--known-infra-file` to preserve hand-maintained infra job files -```yaml -slack_reporter: -- channel: "#ops-testplatform" - job_states_to_report: - - failure - - error - report_template: ':failed: Job *{{.Spec.Job}}* ended with *{{.Status.State}}*. <{{.Status.URL}}|View logs> {{end}}' - job_names: - - images - # job_name_patterns supports regex patterns for matching job names - job_name_patterns: - - "^unit-.*" - - "^e2e-.*-serial$" - # excluded_job_patterns excludes jobs matching these patterns (similar to excluded_variants) - excluded_job_patterns: - - ".*-skip$" - - "^nightly-.*" -skip_operator_presubmits: -- branch: release-4.19 - variant: periodics -``` +### Job generation (`GenerateJobs()` in pkg/prowgen/prowgen.go) +For each test in config: -Most of the time, Prowgen will overwrite configurations on `openshift/ci-operator/jobs/` with the ones -defined in `openshift/ci-operator/jobs/`. +1. **Periodic detection**: `IsPeriodic()` returns true if ANY of Interval, MinimumInterval, Cron, or ReleaseController is set + - Periodic tests get `GeneratePeriodicForTest()` + - If test also has `Presubmit: true`, a presubmit is generated too +2. **Postsubmit**: If `Postsubmit: true`, generates postsubmit with `MaxConcurrency=1` +3. **Presubmit**: Default for everything else -Prowgen is tipically run using `make update` or `make jobs` from within `openshift/release` directoy. +### Image test generation +- Checks `ImageTargets(configSpec)` for `[images]` target +- Creates presubmit with targets `[images]` plus any additional presubmit targets +- Propagates `Images.RunIfChanged`, `Images.SkipIfOnlyChanged`, `Images.PipelineRunIfChanged`, `Images.PipelineSkipIfOnlyChanged` +- If `PromotionConfiguration` exists: creates postsubmit for actual image targets, periodic if `PromotionConfiguration.Cron` set -Testing -------- +### Operator bundle handling +- If `configSpec.Operator` defined: creates presubmit for each bundle build/index -`Prowgen` is hardcoded to use `GOPATH` + `src/github.com/openshift/release`, if you want to -test it on your machine you can run the tool directly from the `openshift/release` repository -root path or use a symbolic link ponting to your `openshift/release` clone: +### Presubmit details +- `AlwaysRun`: true if no run_if_changed, no skip_if_only_changed, not defaultDisable, no pipeline conditions +- Trigger: default regex `(?m)^/test( | .* )(shortName|remaining-required),?($|\s.*)` or explicit `/test variant-name` for disabled defaults +- Branches: exact match + feature branch patterns +- Context: `ci/prow/{shortName}` -```bash -# generally GOPATH=~/go -ln -s ~/cloned-repos/openshift/release ~/go/src/github.com/openshift/release +### Periodic details +- `@daily` cron: deterministic hash from job name (minutes 0-59, hours 22-4 UTC) +- `ReleaseController=true`: overrides to `@yearly`, adds `release-controller=true` label +- Adds `ExtraRefs` with repo/branch info + +### Slack reporter config matching (`GetSlackReporterConfigForJobName()` in pkg/config/load.go) +Matching order for each config entry: +1. Skip if variant in `excluded_variants` +2. Skip if full job name matches any `excluded_job_patterns` regex +3. Match if test name in `job_names` (exact match against testName, not full job name) +4. Match if test name matches any `job_name_patterns` regex +5. First match wins + +### Variant handling +- If variant is set: label `ci-operator.openshift.io/variant = variant` added to job base +- `SkipPresubmits()` checks branch+variant against `SkipOperatorPresubmits` list + +### .config.prowgen structure +```yaml +private: false # Hide jobs in Deck +expose: false # Override private to show in Deck +rehearsals: + disable_all: false + disabled_rehearsals: [] # Test names to skip rehearsal +slack_reporter_configs: + - channel: "#channel" + job_states_to_report: [failure, error] + job_names: [test-name] + job_name_patterns: ["regex.*"] + excluded_variants: [variant1] + excluded_job_patterns: ["^periodic-ci-org-repo-branch-"] +skip_operator_presubmits: + - branch: release-4.14 + variant: some-variant +enable_secrets_store_csi_driver: false ``` +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--from-dir` | — | Source ci-operator config directory | +| `--from-release-repo` | false | Use release repo config path | +| `--to-dir` | — | Target Prow job config directory | +| `--to-release-repo` | false | Use release repo jobs path | +| `--registry` | — | Step registry path for workflow/chain resolution | +| `--known-infra-file` | — | Infra filenames to skip (repeatable) | + +## Key files +- `cmd/ci-operator-prowgen/main.go` — entry point, config loading, org->repo merge +- `pkg/prowgen/prowgen.go` — `GenerateJobs()`, presubmit/postsubmit/periodic generation +- `pkg/prowgen/jobbase.go` — `NewProwJobBaseBuilder()`, variant labels, Private/Expose +- `pkg/config/load.go` — `.config.prowgen` loading, `GetSlackReporterConfigForJobName()` + +## Deployment +CLI tool. Run via `make jobs` in openshift/release. Also called by `auto-config-brancher` during automated branch cutting. Then you can execute `ci-operator-prowgen`: ```bash diff --git a/cmd/ci-operator-yaml-creator/README.md b/cmd/ci-operator-yaml-creator/README.md index b2f6238e3db..963cd3cf9db 100644 --- a/cmd/ci-operator-yaml-creator/README.md +++ b/cmd/ci-operator-yaml-creator/README.md @@ -1,7 +1,62 @@ -# CI-Operator-Yaml Creator +# ci-operator-yaml-creator -A small tool that will create a PullRequest with a `.ci-operator.yaml` file for the `main`/`master` branch of all repositories -that are built by ART, don't have `build_root_image.from_repository: true` and where there is currently no `.ci-operator.yaml` -file matching the `build_root` configured in openshift/release. +## What +Creates or updates `.ci-operator.yaml` files in ART-built repositories to declare their `build_root_image`. This enables reading the build root configuration directly from the component repository rather than from the central ci-operator config in openshift/release. +When the in-repo `.ci-operator.yaml` already matches the central config, the tool also updates the central config to set `build_root.from_repository: true`, completing the migration. + +## How it works -- full flow + +1. **Build ART repo filter**: Load all image configs from the ocp-build-data directory (all versions) and build a set of `org/repo` strings that are ART-built. Only repositories in this set are processed. + +2. **Iterate ci-operator configs**: Walk the `--ci-operator-config-dir` directory using `config.OperateOnCIOperatorConfigDir()`. For each config file: + - Skip if the repo is not ART-built (not in the filter set) + - Skip if `build_root_image` is nil, already `from_repository`, has a variant, or is not on the `master`/`main` branch + - Skip if `build_root_image.image_stream_tag` is nil (other build root types are not handled) + +3. **Check in-repo config**: Fetch the existing `.ci-operator.yaml` from the component repo (on its default branch) via `github.FileGetterFactory`. + +4. **Compare**: Build the expected `CIOperatorInrepoConfig` from the central config's `ImageStreamTagReference` and diff against the in-repo version. + +5. **If they match**: The in-repo file is already correct. Update the central ci-operator config file in place: + - Clear `image_stream_tag` from `build_root_image` + - Set `from_repository: true` + - Write the modified config back to the ci-operator config directory + +6. **If they differ**: The in-repo file needs updating. + - Serialize the expected config to YAML + - Clone the repo (up to `--push-ceiling` repos) + - Checkout the default branch + - Write the new `.ci-operator.yaml` to the repo checkout + - Call the PR creation function to push and open a PR with a descriptive body explaining the change, linking to documentation, and noting this is mandatory for OCP components with ART build configs + +### PR body content +The PR explains: +- `.ci-operator.yaml` references the `build_root_image` from openshift/release +- This enables updating the build root in lockstep with code changes +- It is mandatory for all OCP components with an ART build config +- Links to the docs at `docs.ci.openshift.org/architecture/ci-operator/#build-root-image` +- A second auto-generated PR to openshift/release will follow once this one merges + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--ci-operator-config-dir` | (required) | Base path to ci-operator config directory (e.g. `ci-operator/config` in openshift/release) | +| `--ocp-build-data-dir` | `../ocp-build-data` | Path to ocp-build-data repo checkout | +| `--push-ceiling` | `1` | Max number of repos to push updated `.ci-operator.yaml` to; 0 = unlimited | +| `--create-prs` | `false` | Whether to create GitHub PRs after pushing | +| `--max-concurrency` | `4` | Legacy flag, does nothing (tool cannot run concurrently) | +| PR creation flags | -- | `--self-approve`, `--github-token-path`, `--pr-source-mode`, etc. via `PRCreationOptions` | + +## Key files + +- `cmd/ci-operator-yaml-creator/main.go` -- all logic: ART filter construction, config iteration, in-repo file comparison, PR creation +- `pkg/api/ocpbuilddata/` -- `LoadImageConfigs` for ART repo discovery +- `pkg/api/types.go` -- `CIOperatorInrepoConfig`, `CIOperatorInrepoConfigFileName` (`.ci-operator.yaml`) +- `pkg/config/` -- `OperateOnCIOperatorConfigDir`, `Info` metadata +- `pkg/github/prcreation/prcreation.go` -- `PRCreationOptions.UpsertPR()` + +## Deployment +Runs as a periodic Prow job. Requires GitHub token for fetching in-repo files and creating PRs, plus read access to ocp-build-data and the ci-operator config directory. If the `.ci-operator.yaml` is already up-to-date, it will set `build_root.from_repository: true` diff --git a/cmd/ci-operator/README.md b/cmd/ci-operator/README.md new file mode 100644 index 00000000000..a738d23a03a --- /dev/null +++ b/cmd/ci-operator/README.md @@ -0,0 +1,132 @@ +# ci-operator + +## What +Core CI execution engine. Every OpenShift CI job runs inside ci-operator. It reads a declarative YAML config, builds a DAG of steps (source clone, image builds, test execution, promotion), creates an ephemeral namespace, executes the graph with maximum parallelism, collects artifacts, and tears down. + +This is the beating heart of OpenShift CI: thousands of jobs per day, all driven by this single binary. + +## How it works — full flow + +### 1. Startup +1. Read config from: `--config` flag > `CONFIG_SPEC` env var (supports base64+gzip) > `CONFIG_SPEC_GCS_URL` > configresolver API +2. Resolve job spec from `JOB_SPEC` environment variable (Prow injects this into every job pod) +3. Set up dual logging: human-friendly console on stdout + JSON file (`ci-operator.log` in artifacts dir) +4. Create `secrets.DynamicCensor` for automatic credential censoring in all output +5. Initialize MetricsAgent with plugins: insights, events, builds, nodes, leases, pods, machines, images + +### 2. Namespace lifecycle +Namespace name: `ci-op-{hash}` where hash = SHA256 of all inputs, base32-encoded, 5 bytes. + +Steps: +1. ProjectRequest creation with retry loop (waits for TerminatingPhase to clear if a stale namespace exists) +2. RBAC readiness check: SelfSubjectAccessReview for "create rolebindings" (30 retries, 1s each) +3. Annotations: `ci.openshift.io/idle-cleanup-duration-ttl` (default 1h), `ci.openshift.io/cleanup-duration-ttl` (default 72h) +4. Heartbeat goroutine updates `ci.openshift.io/namespace-last-active` every 10 minutes +5. Wait for ServiceAccount imagePullSecrets (299 retries, 1s each = ~5 min timeout) +6. PR author access: RoleBinding `ci-op-author-access` granting admin to `{author}@github` group +7. Secrets: pull, push, upload, clone auth (SSH or OAuth), external image pull, promotion kubeconfig +8. Pipeline ImageStream creation with local lookup policy +9. PodDisruptionBudget: maxUnavailable=0 for all CI pods + +### 3. Step graph +The core abstraction. Every action is a Step with: +- `Requires()` — dependency StepLinks (image tags, parameters, etc.) +- `Creates()` — output StepLinks +- `Run(ctx)` — execution + +Step types include: source clone, binary build, test-binary build, RPM build, image builds (BuildConfig-based), multi-stage tests (pre/test/post phases), input image tags (external imports), output image tags, index generation, bundle builds. + +`BuildGraph()` creates the full DAG. `BuildPartialGraph()` creates a subset for named targets. `TopologicalSort()` detects cycles and orders execution. + +Execution (`steps.Run()`): launches goroutines per step, DAG-scheduled. When a step completes, its output links are marked satisfied, unblocking children. Context cancellation propagates to all steps. + +### 4. Multi-stage tests +Three phases executed in order: +- **pre**: Setup steps. Short-circuits to post on failure. +- **test**: Main test steps. Short-circuits to post on failure. +- **post**: Cleanup. Best-effort by default — failures don't fail the overall test. + +Each step runs in its own pod via entrypoint-wrapper. Key mounts: +- `/var/run/secrets/ci.openshift.io/cluster-profile` — cloud credentials +- `/var/run/secrets/ci.openshift.io/multi-stage` — shared dir snapshot +- `/cli` — oc/kubectl binaries +- `/var/run/configmaps/ci.openshift.io/multi-stage` — step command script + +Supports: observers (concurrent monitoring pods), VPN sidecar injection (from `vpn.yaml` in cluster profile), Google Secret Manager via CSI driver, cluster claims from Hive pools. + +### 5. Lease management (Boskos) +- Owner: `{namespace}-{jobSpecHash}` +- `Acquire()`: blocks up to 120 minutes, retries 60 times +- `Heartbeat()`: every 30 seconds in background goroutine. Persistent failure cancels all dependent steps. +- `ReleaseAll()` on cleanup + +### 6. Promotion +After tests pass, if `--promote` flag set: +- Runs promotion steps concurrently via goroutines +- Each target gets its own promotion step +- Channel-based result collection +- Any promotion failure fails the whole job + +### 7. Artifacts +- `ARTIFACT_DIR` env var mounted in test pods +- Post-completion: `tar czf` streams artifacts from pod to local storage +- Searches for `custom-prow-metadata.json` and merges into final `metadata.json` +- Per-container JUnit via `ci-operator.openshift.io/container-sub-tests` annotation + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config` | — | Path to ci-operator config YAML | +| `--unresolved-config` | — | Unresolved config for configresolver | +| `--namespace` | `ci-op-{hash}` | Namespace template | +| `--lease-server` | app.ci boskos URL | Boskos server address | +| `--lease-server-credentials-file` | — | Format: `username:password` | +| `--promote` | false | Enable image promotion after tests | +| `--pod-pending-timeout` | 60m | Max time waiting for pod to start | +| `--delete-when-idle` | 1h | Idle TTL before namespace cleanup | +| `--delete-after` | 72h | Hard TTL before namespace cleanup | +| `--give-pr-author-access-to-namespace` | true | Grant PR author admin in test namespace | +| `--restrict-network-access` | false | Egress firewall to 10.0.0.0/8 | +| `--enable-secrets-store-csi-driver` | false | GSM secret injection via CSI | +| `--registry` | — | Step registry path for local resolution | +| `--node` | — | Restrict pod scheduling to named node | + +## Key env vars +| Variable | What it does | +|---|---| +| `JOB_SPEC` | Prow-injected JSON with org, repo, PR number, base/head SHA, job type | +| `CONFIG_SPEC` | Inline ci-operator config (supports base64+gzip encoding) | +| `CONFIG_SPEC_GCS_URL` | GCS URL to fetch ci-operator config from | +| `UNRESOLVED_CONFIG` | Unresolved config (needs registry resolution) | +| `ARTIFACT_DIR` | Directory for test artifacts, mounted into test pods | + +## Gotchas +- The namespace TTL (`--delete-after`) controls cleanup — if a job is killed, the namespace may linger until TTL expires +- `--pod-pending-timeout` (default 60m) controls how long to wait for pods to schedule before failing +- Promotion only happens when `--promote` is passed AND the job succeeds +- Multi-stage test steps are wrapped by `entrypoint-wrapper` +- Namespace hash is deterministic from inputs — rerunning the same job may reuse a namespace if the previous one hasn't been cleaned up yet (waits for TerminatingPhase) + +## Key files +- `cmd/ci-operator/main.go` — entry point, namespace lifecycle, flag parsing (~2600 lines) +- `pkg/api/types.go` — ReleaseBuildConfiguration schema (~2980 lines) +- `pkg/api/graph.go` — Step interface, StepLink, BuildGraph, TopologicalSort +- `pkg/steps/run.go` — concurrent DAG executor +- `pkg/steps/multi_stage/multi_stage.go` — multi-stage test orchestration +- `pkg/steps/pod.go` — PodStep (single container test) +- `pkg/steps/source.go` — source clone step +- `pkg/steps/project_image.go` — BuildConfig-based image builds +- `pkg/steps/lease.go` — LeaseStep wrapper +- `pkg/steps/artifacts.go` — artifact collection and JUnit +- `pkg/lease/client.go` — Boskos lease client +- `pkg/defaults/defaults.go` — graph generation from config (FromConfig) +- `pkg/metrics/` — MetricsAgent and plugins + +## Deployment +Not deployed as a service. Runs as the main process inside every Prow job pod. ci-operator is the binary that Prow invokes. The binary is baked into the test pod image. + +## Related +- `cmd/entrypoint-wrapper` — wraps each multi-stage test step +- `cmd/ci-operator-configresolver` — serves configs to ci-operator at runtime +- ci-docs: `architecture/ci-operator.md` — deep dive on config and execution model +- ci-docs: `architecture/step-registry.md` — multi-stage test architecture diff --git a/cmd/ci-scheduling-webhook/README.md b/cmd/ci-scheduling-webhook/README.md new file mode 100644 index 00000000000..a4dd477b7ed --- /dev/null +++ b/cmd/ci-scheduling-webhook/README.md @@ -0,0 +1,70 @@ +# ci-scheduling-webhook + +## What +Kubernetes mutating admission webhook that controls where OpenShift CI workload pods land. It classifies every pod into a workload class (builds, tests, longtests, prowjobs), injects the appropriate nodeSelector, tolerations, and affinity rules so pods are routed to dedicated machinesets, and then actively manages node scale-down by tainting and cordoning underutilized nodes. + +This is the primary cost-efficiency mechanism for the CI build farm: it consolidates pods onto fewer nodes, makes idle nodes candidates for removal, and manages the MachineSet replica count directly rather than relying on the cluster autoscaler. + +## How it works -- full flow + +### Pod classification +When a pod admission request arrives at `/mutate`, the webhook inspects the pod's namespace and labels to assign a `PodClass`: + +| PodClass | Criteria | +|---|---| +| `builds` | Namespace starts with `ci-op-` or `ci-ln-` and pod has label `openshift.io/build.name` | +| `tests` | Namespace starts with `ci-op-` or `ci-ln-` and pod does NOT have the build label | +| `longtests` | A `tests` pod whose name matches long-running patterns (e.g. `release-images-`, `e2e-aws-upgrade`, `rpm-repo`, various `ovn-upgrade` patterns) | +| `prowjobs` | Namespace is `ci` and pod has label `created-by-prow` | +| (none) | Everything else -- the webhook adds safe-to-evict annotations but no scheduling changes | + +Pods requesting special resources (anything beyond cpu/memory/ephemeral-storage) are skipped from classification to avoid conflicts with device plugins. + +### Pod mutations applied +For classified pods, the webhook applies these JSON patches: + +1. **RuntimeClassName**: set to `ci-scheduler-runtime-{podClass}` which carries the tolerations for the corresponding machineset taints +2. **NodeSelector**: `ci-workload={podClass}` to target the correct node pool +3. **Node affinity (anti-affinity to scale-down candidates)**: the node with the fewest pods (most likely to scale down next) is precluded via `RequiredDuringSchedulingIgnoredDuringExecution` with a `kubernetes.io/hostname NotIn` expression +4. **Labels**: `ci-workload` and `ci-workload-namespace` labels are added to the pod +5. **DNS wait init container**: a `ci-scheduling-dns-wait` init container is prepended that polls `static.redhat.com` for up to 120 seconds, working around DNS races on newly provisioned nodes +6. **Build pod spot preference**: build pods get `PreferredDuringScheduling` affinity for `spotinst.io/node-lifecycle=spot` nodes for cost savings +7. **High-perf builds**: build pods requesting >= 32Gi memory or >= 13 CPU cores get an additional `ci-instance-type=high-perf` nodeSelector and toleration +8. **Min build CPU**: `docker-build` containers get their CPU request/limit bumped to at least `--min-build-millicores` (default 4000m) +9. **NET_ADMIN injection**: test containers with env `TEST_REQUIRES_BUILDFARM_NET_ADMIN=true` get `NET_ADMIN`, `NET_RAW`, `SETUID`, `SETGID` capabilities, `runAsUser=0`, and CPU request/limit set to 12 cores +10. **safe-to-evict annotation**: all pods in `openshift-*` namespaces, specific non-openshift namespaces (`rh-corp-logging`, `ocp`, `cert-manager`), and catalog pods get `cluster-autoscaler.kubernetes.io/safe-to-evict: "true"` + +### Node mutation +The webhook also intercepts Node admission requests. For nodes with a `ci-workload` label, it adds the `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` annotation to prevent the cluster autoscaler from interfering with the webhook's own scale-down logic. + +### Node scale-down management +A background goroutine runs per pod class on a 1-minute polling interval (`evaluateNodeClassScaleDown`): + +1. **Avoidance ordering**: nodes (at least 15 minutes old) are sorted by ascending active CI pod count, then by age. The bottom ~25% are marked as "avoidance" targets. +2. **PreferNoSchedule taint**: avoidance nodes that still have pods get a `ci-workload-avoid` taint with `PreferNoSchedule` effect, encouraging new pods to land elsewhere. +3. **Cordon (NoSchedule)**: avoidance nodes with zero active CI pods are cordoned (`spec.unschedulable=true`). This is the point where no new pods can arrive. +4. **Scale-down trigger**: once a cordoned node has zero active CI workload pods (verified via a live API query, not just the informer cache), a separate goroutine (`evaluateNodeScaleDown`) initiates removal: + - Sets a `ci-workload-evict` NoExecute taint (to trigger DNS pod graceful shutdown -- workaround for OCPBUGS-488) + - Sleeps 40 seconds for DNS pod graceful termination + - Annotates the Machine with `machine.openshift.io/delete-machine: "true"` (and the older annotation key) + - Decrements the owning MachineSet's replica count by 1 + - Waits up to 1 hour for the node to disappear, checking MachineSet reconciliation status + +Scale-down operations are serialized per pod class via `nodeClassScaleDownLock` mutexes. Nodes with the `ci-scheduling.ci.openshift.io/keep-node` annotation/label set to `"true"` are excluded from all avoidance and scale-down logic. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--tls-cert` | (none) | TLS certificate file. If omitted with `--tls-key`, a self-signed test cert is generated | +| `--tls-key` | (none) | TLS private key file. Must be specified with `--tls-cert` or both omitted | +| `--port` | 443 | HTTPS listen port | +| `--as` | (none) | Impersonate this user for k8s API calls (e.g. `system:admin`) | +| `--min-build-millicores` | 4000 | Minimum CPU millicores enforced for `docker-build` containers | + +## Key files +- `cmd/ci-scheduling-webhook/main.go` -- entry point, flag parsing, TLS setup, HTTP server on `/mutate` +- `cmd/ci-scheduling-webhook/mutation.go` -- pod and node mutation logic, pod classification, DNS init container injection, NET_ADMIN/high-perf patching +- `cmd/ci-scheduling-webhook/prioritization.go` -- node/pod informers with custom indexes, scale-down evaluation loop, MachineSet interaction, avoidance taint management, spot instance encouragement (disabled) + +## Deployment +Long-lived Deployment on build farm clusters, registered as a `MutatingWebhookConfiguration` for both Pod and Node resources. Uses in-cluster kubeconfig by default (or `KUBECONFIG` env var). Requires RBAC to read/patch nodes, list pods in all namespaces, and read/patch Machine and MachineSet resources in `openshift-machine-api`. diff --git a/cmd/ci-secret-bootstrap/README.md b/cmd/ci-secret-bootstrap/README.md index 40fff157526..731f9fb2ab1 100644 --- a/cmd/ci-secret-bootstrap/README.md +++ b/cmd/ci-secret-bootstrap/README.md @@ -1,8 +1,70 @@ -# CI-Secret-Bootstrap - -This tool extends the [populate-secrets-from-bitwarden.sh](https://github.com/openshift/release/blob/c8c89d08c56c653b91eb8c7580657f7ce522253f/ci-operator/populate-secrets-from-bitwarden.sh) -to support mirroring secrets cross Kubernetes/OpenShift-clusters. - +# ci-secret-bootstrap + +## What +Provisions secrets from Vault and Google Secret Manager (GSM) to Kubernetes clusters across the CI infrastructure. Reads a mapping config that defines which Vault items/fields map to which Kubernetes Secrets on which clusters, then creates or updates them. + +## How it works — full flow + +### 1. Configuration +The config file (`--config`) defines: +- **cluster_groups**: named groups of clusters (e.g., `build_farm: [app.ci, build01, build02]`) +- **secrets**: array of source-to-target mappings: + - `from`: map of field names to Vault item+field references + - `to`: array of target cluster/namespace/name/type specifications + - Target can specify `cluster` (direct) or `cluster_groups` (expanded) +- **user_secrets_target_clusters**: clusters receiving self-service user secrets + +### 2. Secret construction from Vault (`constructSecretsFromVault()`) +- Fetches all fields in parallel via goroutines +- Supports `.dockerconfigjson` construction from multiple registry auth fields +- Supports `base64_decode` for pre-encoded values +- User secrets fetched separately via `client.GetUserSecrets()` — self-service secrets with `secretsync/target-clusters` targeting + +### 3. Secret construction from GSM (if `--enable-gsm`) +- Bundles define grouped secrets with components, docker configs, and targets +- Field auto-discovery: lists all fields in a collection/group if not explicitly specified +- `${CLUSTER}` variable substitution for cluster-specific secrets +- Component inheritance for reusability + +### 4. Conflict detection +Prevents same secret (cluster/namespace/name) from being managed by both Vault and GSM. Vault takes precedence. + +### 5. Writing to clusters (`updateSecrets()`) +- Creates namespace if missing +- For existing secrets: checks if update needed +- Handles immutable field changes (requires `--force`) +- Special case: OSD global pull secret (`openshift-config/pull-secret`) — mutates in-place, only updating specific registry entries +- All secrets get `ci-secret-bootstrap` label for tracking + +### 6. Validation modes +- `--validate-only`: load and validate config, check Vault items exist, exit +- `--validate-bitwarden-items-usage`: report unused Vault items (older than 7 days) + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config` | — | Vault bootstrap config file | +| `--vault-addr` | `VAULT_ADDR` env | Vault server address | +| `--vault-token-file` | `VAULT_TOKEN` env | Token file | +| `--vault-prefix` | — | Vault key path prefix | +| `--vault-role` | — | Kubernetes auth role (alternative to token) | +| `--dry-run` | true | Preview mode | +| `--confirm` | true | Actually mutate secrets (requires dry-run=false) | +| `--force` | false | Force update even if different | +| `--validate-only` | false | Exit after validation | +| `--cluster` | — | Only provision to this cluster | +| `--secret-names` | — | Only provision these secrets | +| `--enable-gsm` | false | Enable GSM bundle mechanism | +| `--gsm-config` | — | GSM config file | +| `--gsm-credentials-file` | — | GSM service account credentials | + +## Key files +- `cmd/ci-secret-bootstrap/main.go` — orchestration, Vault/GSM construction, cluster writing +- `pkg/api/secretbootstrap/secretboostrap.go` — Config, SecretConfig, ItemContext types +- `pkg/secrets/flags.go` — Vault client options + +## Deployment +Periodic Prow job. ## Args and config.yaml We use `--kubeconfig` to specify the path to a [kube config](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) diff --git a/cmd/ci-secret-generator/README.md b/cmd/ci-secret-generator/README.md index 75ee43f17f8..7f0330a96e5 100644 --- a/cmd/ci-secret-generator/README.md +++ b/cmd/ci-secret-generator/README.md @@ -1,7 +1,74 @@ -# CI-Secret-Generator +# ci-secret-generator -This tool aims to automate the process of deployment of secrets to Bitwarden while also providing a platform to document the commands used to generate these secrets. +## What +Generates secrets by executing shell commands defined in a configuration file, then uploads the results to a secrets backend (Vault) and optionally syncs them to Google Secret Manager (GSM). Includes validation that every generated secret is consumed by either the ci-secret-bootstrap config or the GSM config, preventing orphaned secrets. +## How it works -- full flow + +1. **Load and validate config.** Reads the generator config (`--config`), which defines items with fields. Each item has: + - `ItemName`: the secret collection/item name in the backend. + - `Fields`: list of `{Name, Cmd, Cluster}` -- each field is generated by running a shell command. + - `Params`: parameter sets (must include `cluster`) for cartesian expansion of items. + - `Notes`: optional metadata attached to the item. + Validation ensures every item has a name, every named field has a command, and a `cluster` param exists. + +2. **Load bootstrap config.** If `--validate` is true (default), loads the ci-secret-bootstrap config (`--bootstrap-config`) and optionally the GSM config (`--gsm-config`). + +3. **Query Prow disabled clusters.** Fetches the list of clusters disabled in Prow. Fields targeting disabled clusters are skipped during generation. + +4. **Validate references.** Checks that every `(ItemName, FieldName)` pair from the generator config appears in either: + - The ci-secret-bootstrap config (as an `ItemContext` in any secret's `from` list, including DockerConfigJSON data items), or + - The GSM config bundles (matching by normalized group and field names). + Any item not found in either config produces a fatal validation error. + +5. **Exit early if `--validate-only`.** If set, stops after validation without generating or uploading secrets. + +6. **Generate secrets.** For each item and field: + - Skips fields targeting disabled clusters. + - Constructs a GSM secret name as `__`. + - Executes the field's command via `bash -o errexit -o nounset -o pipefail -c ""`. + - Validates the command output: must have zero stderr, non-empty stdout, and stdout must not be literally `"null"`. + - On command failure, if the field previously existed in the GSM index, preserves its index entry (does not remove it). + +7. **Upload to backend.** For each item: + - Calls `client.SetFieldOnItem(itemName, fieldName, payload)` for each generated field. + - If the item has `Notes`, calls `client.UpdateNotesOnItem(itemName, notes)`. + - Updates the GSM index secret with the list of all successfully generated field names. + +8. **GSM sync (optional).** When `--enable-gsm-sync` is true: + - Wraps the secrets client with a `GSMSyncDecorator` that mirrors writes to GSM. + - Reads the existing GSM index secret to track which fields were previously generated (used for index preservation on failure). + - Requires `--gsm-credentials-file` and `--gsm-project-config`. + +9. **Dry-run mode.** When `--dry-run` is true (default), writes generated secrets to a temp file or `--output-file` instead of uploading to Vault/GSM. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config` | (required) | Path to the secret generator config file | +| `--bootstrap-config` | `""` | Path to ci-secret-bootstrap config (required if `--validate` is true) | +| `--validate` | `true` | Validate that generated items are consumed by bootstrap/GSM configs | +| `--validate-only` | `false` | Exit after validation without generating secrets | +| `--dry-run` | `true` | Write secrets to file instead of uploading | +| `--output-file` | `""` | Output file path for dry-run mode (temp file if empty) | +| `--log-level` | `info` | Log level | +| `--concurrency` | `1` | Max concurrent goroutines for secret generation | +| `--enable-gsm-sync` | `false` | Enable syncing cluster-init secrets to GSM | +| `--gsm-credentials-file` | `""` | Path to GCP service account credentials (required if GSM sync enabled and not dry-run) | +| `--gsm-project-config` | `""` | Path to GCP project config file | +| `--gsm-config` | `""` | Path to GSM bootstrap config (required if GSM sync enabled and `--validate` is true) | +| Vault/secrets flags | -- | Standard secrets client flags (Vault address, token, etc.) | + +## Key files +- `cmd/ci-secret-generator/main.go` -- all logic: config loading/validation, command execution, secret upload, GSM sync, bootstrap validation +- `pkg/api/secretgenerator/` -- generator config types and loading +- `pkg/api/secretbootstrap/` -- bootstrap config types (used for validation) +- `pkg/secrets/` -- secrets client interface (Vault, dry-run, GSM sync decorator) +- `pkg/gsm-secrets/` -- GSM index secret management +- `pkg/gsm-validation/` -- name normalization for GSM + +## Deployment +Runs as a periodic Prow job. The job has access to Vault credentials and (optionally) GCP service account credentials for GSM sync. ## Args and config.yaml The tool expects a configuration like the one below which specifies the mapping between the `itemName`+`attributeName`/`attachmentName`/`fieldName` and the command used to generate the secret. diff --git a/cmd/cluster-display/README.md b/cmd/cluster-display/README.md new file mode 100644 index 00000000000..66a10b4ea61 --- /dev/null +++ b/cmd/cluster-display/README.md @@ -0,0 +1,54 @@ +# cluster-display + +## What +HTTP API server that provides a dashboard view of CI clusters and Hive cluster pools. It queries every configured build cluster plus the Hive cluster for version, console URL, image registry host, cloud provider, product type, and HyperShift supported versions. Results are cached in memory and served as JSON, with optional JSONP callback support. + +## How it works -- full flow + +### Startup +1. Loads multi-cluster kubeconfigs via Prow's `KubernetesOptions`. Builds a controller-runtime client for each cluster. +2. Identifies the `hive` context (required, fatal if missing) and the `app.ci` context (falls back to in-cluster config if not explicitly present). +3. Removes `InClusterContext` and `DefaultClusterAlias` from the client map to avoid duplicates. +4. Registers Hive v1, Route v1, and Config v1 schemes. +5. Sets up kubeconfig change detection: if the kubeconfig file changes on disk, the server terminates so the Kubelet can restart it and pick up the new credentials. + +### Caching +In-memory cache with two independent sections: +- **Cluster data:** refreshed every **1 hour**. On refresh, queries all clusters in parallel. +- **Cluster pool data:** refreshed every **1 minute**. On fetch error, stale cached data is served with a logged error. + +### Prow disabled cluster tracking +A `prowClient` wrapper queries Prow for disabled clusters, caching the result for **3 minutes**. On error, requests are rate-limited (10/sec with burst of 5) and the client waits until the limiter allows, to avoid overwhelming Prow. The cache update happens asynchronously in a goroutine so concurrent readers are not blocked. + +### Cluster detail collection +For each cluster, the server: +1. Resolves the console Route host via `api.ResolveConsoleHost`. +2. Resolves the image registry host via `api.ResolveImageRegistryHost` (skipped for hive). +3. Reads the `ClusterVersion` resource for the current version (from `status.history[0]`). +4. Reads the `Infrastructure` resource for the cloud platform type. +5. Detects the product: checks for `configure-alertmanager-operator` Service in `openshift-monitoring` -- if present, it is OSD. Otherwise, checks the version string for "okd" (OKD) or defaults to OCP. +6. For the hive cluster specifically, reads the `hypershift/supported-versions` ConfigMap and includes the HyperShift supported versions array. + +### Endpoints + +| Endpoint | Method | Response | +|---|---|---| +| `/api/health` | GET | `{"ok": true}` | +| `/api/v1/clusters` | GET | JSON `{"data": [...]}` with cluster info maps. Accepts `?skipHive=true`. Disabled clusters are appended with `"error": "disabled cluster in Prow"`. | +| `/api/v1/clusterpools` | GET | JSON `{"data": [...]}` with pool info maps (namespace, name, ready, size, maxSize, imageSet, labels, releaseImage, owner, standby). | + +Both data endpoints support `?callback=` for JSONP responses (wraps JSON in `callbackName(json);`). + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--log-level` | `info` | Log verbosity | +| `--port` | `8090` | HTTP listen port | +| `--gracePeriod` | `10s` | Graceful shutdown period | +| Prow kubernetes flags | -- | Multi-cluster kubeconfig paths (`--kubeconfig`, etc.) | + +## Key files +- `cmd/cluster-display/main.go` -- all logic: server setup, caching, cluster/pool data collection, HTTP handlers, Prow disabled cluster detection + +## Deployment +Runs as a sidecar container inside the `ci-docs` Deployment on app.ci (not a standalone Deployment). Served behind a Route to provide a dashboard for the Test Platform team. Requires kubeconfig access to all build clusters and the hive cluster. Port 8090. diff --git a/cmd/cluster-init/README.md b/cmd/cluster-init/README.md index 9985125b1c6..1174b9f91b1 100644 --- a/cmd/cluster-init/README.md +++ b/cmd/cluster-init/README.md @@ -1,6 +1,100 @@ -# Cluster Init -`cluster-init` is a tool for creating and managing build clusters. It generates and updates yaml configurations for the clusters in the `openshift/release` repo, and, if desired, will create a self-merging PR for these configurations. This tool operates in one of two modes: +# cluster-init +## What +CLI tool that manages the full lifecycle of a Test Platform (TP) build cluster: provisioning infrastructure on a cloud provider, installing OCP, and generating/updating all the onboarding configuration needed to integrate the cluster into CI. Uses a cobra command tree with two top-level subcommands: `provision` and `onboard`. + +## How it works -- full flow + +### `provision` subcommand tree + +#### `provision aws create-stacks` +Creates CloudFormation stacks required for a build cluster on AWS. Loads a `cluster-install.yaml` spec, initializes an AWS provider with default config, and executes the `CreateAWSStacksStep`. Requires a properly configured AWS profile (named profile or environment variables). + +#### `provision ocp create ` +Runs the OCP installer in stages. The `` argument selects which stage to execute: +1. `install-config` -- generates `install-config.yaml` from the cluster-install spec +2. `manifests` -- runs `openshift-install create manifests` +3. `cluster` -- runs `openshift-install create cluster` + +Each stage is a `types.Step` that shells out to the installer binary. Commands must be run in sequence (install-config, then manifests, then cluster). + +### `onboard` subcommand tree + +#### `onboard config generate` +Generates all configuration files for a newly provisioned cluster: +1. Loads the `cluster-install.yaml` spec and resolves the `--install-base` working directory. +2. Connects to the cluster using the admin kubeconfig from the install directory. +3. Pulls runtime info from the live cluster: `Infrastructure` CR, `install-config` from `kube-system/cluster-config-v1`, CoreOS stream metadata from `openshift-machine-config-operator/coreos-bootimages`. +4. Runs a sequence of onboarding steps (each a `types.Step`): + - ProwJob configuration + - Build cluster directory scaffolding + - OAuth template generation + - ci-secret-bootstrap config update + - ci-secret-generator config update + - Sanitize prowjob config + - Sync rover group + - Prow plugin config + - Dex, certificates, Cloudability agent manifests + - Common symlinks + - Multi-arch builder controller, tuning operator + - Image registry, OpenShift monitoring, passthrough, nested podman manifests + - Cloud credential manifests (if `CredentialsMode == Manual`) + - Cloud-specific steps (AWS: CI scheduling webhook, machine sets via CloudFormation) + - Build cluster step and cert-manager (generate-only, skipped during update) + +#### `onboard config update` +Bulk-updates configuration for multiple existing clusters: +1. Loads all `cluster-install.yaml` files from `--cluster-install-dir` (defaults to `/clusters/`). +2. Loads kubeconfigs for all clusters via standard Prow kubernetes flags. +3. For each cluster with a valid kubeconfig, connects, pulls runtime info, and runs the same config steps as `generate` -- but with `update=true`, which: + - Skips the build-cluster and cert-manager steps. + - Uses cluster-sourced AWS config (reads from cluster objects) instead of hardcoded defaults. +4. Clusters with missing or invalid kubeconfigs are skipped with a warning (non-fatal). + +### Registered API schemes +Route v1, ImageRegistry v1, Image v1, Config v1, Auth v1, CloudCredential v1 -- these are registered at startup so the tool can interact with OpenShift-specific resources. + +## Flags + +### Global (persistent) +| Flag | Default | What it controls | +|---|---|---| +| `--cluster-install` | `""` | Path to `cluster-install.yaml` | +| `--install-base` | `""` | Working directory for install artifacts | + +### `onboard config generate` +| Flag | Default | What it controls | +|---|---|---| +| `--release-repo` | (required) | Path to local openshift/release checkout | +| `--release-branch` | `main` | Branch name in release repo | + +### `onboard config update` +| Flag | Default | What it controls | +|---|---|---| +| `--release-repo` | (required) | Path to local openshift/release checkout | +| `--release-branch` | `main` | Branch name in release repo | +| `--cluster-install-dir` | `""` | Directory containing cluster-install files (defaults to `/clusters/`) | +| Prow kubernetes flags | -- | Standard multi-cluster kubeconfig flags | + +## Key files +- `cmd/cluster-init/main.go` -- entry point, scheme registration, root command +- `cmd/cluster-init/cmd/onboard/onboard.go` -- `onboard` subcommand +- `cmd/cluster-init/cmd/onboard/config/config.go` -- `config` subcommand, step orchestration, cloud-specific step registration +- `cmd/cluster-init/cmd/onboard/config/generate.go` -- `generate` subcommand +- `cmd/cluster-init/cmd/onboard/config/update.go` -- `update` subcommand (bulk multi-cluster) +- `cmd/cluster-init/cmd/provision/provision.go` -- `provision` subcommand +- `cmd/cluster-init/cmd/provision/aws.go` -- `provision aws create-stacks` +- `cmd/cluster-init/cmd/provision/ocp.go` -- `provision ocp create` +- `cmd/cluster-init/runtime/runtime.go` -- shared runtime utilities (`BuildCmd`, `RunCmd`) +- `cmd/cluster-init/runtime/aws/` -- AWS config providers (from-cluster vs. from-defaults) +- `pkg/clusterinit/clusterinstall/` -- cluster-install spec loading and finalization +- `pkg/clusterinit/onboard/` -- all onboarding step implementations +- `pkg/clusterinit/provision/` -- provisioning step implementations (AWS, OCP) + +## Deployment +- **Periodic job:** `onboard config update` runs as a periodic Prow job for continuous config reconciliation across all managed clusters. +- **Manual invocation:** `provision` and `onboard config generate` are run interactively by engineers when standing up new clusters. +- Integration tests use the `CITOOLS_CLUSTERINIT_INTEGRATIONTEST` environment variable to toggle test-specific behavior. ## Create In order to create a new build cluster the tool can be used like: `cluster-init --release-repo= --cluster-name=`. @@ -13,4 +107,4 @@ If it is desired to only update a single cluster, then `--cluster-name=` -args will also need to be provided. If you would like the PR to be self-merging the `--self-approve=true` argument will also need to be provided. \ No newline at end of file +args will also need to be provided. If you would like the PR to be self-merging the `--self-approve=true` argument will also need to be provided. diff --git a/cmd/clusterimageset-updater/README.md b/cmd/clusterimageset-updater/README.md new file mode 100644 index 00000000000..0b9b6d28d93 --- /dev/null +++ b/cmd/clusterimageset-updater/README.md @@ -0,0 +1,46 @@ +# clusterimageset-updater + +## What +Batch tool that synchronizes Hive `ClusterImageSet` resources with the latest OCP pre-release images. It reads cluster pool YAML specs from a directory, resolves the latest release image matching each pool's version bounds, writes updated `ClusterImageSet` YAML files, and patches the pool specs to reference them. Designed to run as a periodic Prow job that commits changes back to openshift/release. + +## How it works -- full flow + +1. **Ensure labels on pools.** Walks `--pools` directory for `*_clusterpool.yaml` files. For each pool that has an `owner` label on the resource metadata, ensures `tp.openshift.io/owner` is propagated into `spec.labels` (so claimed clusters inherit the owner). Writes the file back if modified. + +2. **Collect version bounds.** Re-walks the pools directory. Each pool may have `version_lower` and `version_upper` labels (and optionally `version_stream`) that define which OCP release range the pool targets. Both lower and upper must be set or both absent. Groups pool file paths by their `VersionBounds`. + +3. **Resolve pull specs.** For each unique `VersionBounds`: + - Determines architecture: `multi` for version_lower >= 4.12, `amd64` for older versions (multi payload not available before 4.12). Returns an error if version_lower cannot be parsed as `major.minor`, so misconfigured pools fail fast. + - Queries the OCP release controller HTTP API (`prerelease.ResolvePullSpec`) for the latest release image matching the version range. + +4. **Merge colliding bounds.** If multiple pools with different `version_stream` values resolve to the same pull spec with the same lower/upper bounds, they are merged into a single canonical bounds entry (keeping the lexicographically greatest stream) to avoid duplicate ClusterImageSets. + +5. **Identify outdated ClusterImageSets.** Walks `--imagesets` directory for existing `*_clusterimageset.yaml` files. Compares each against the newly resolved pull specs using its `version_lower`/`version_upper` annotations. Marks stale ones for deletion. + +6. **Write new ClusterImageSets.** For each resolved bounds (in sorted order), creates a `ClusterImageSet` YAML with: + - Name derived from the pull spec (e.g., `ocp-release-4.14.1-multi-for-4.14.0-0-to-4.15.0-0`) + - Annotations recording `version_lower`, `version_upper`, and optionally `version_stream` for future reconciliation + - `spec.releaseImage` set to the resolved pull spec + +7. **Delete stale files.** Removes the ClusterImageSet YAML files identified as outdated in step 5. + +8. **Update pool specs.** For each pool file, updates `spec.imageSetRef.name` to point at the newly written ClusterImageSet name. + +### Naming convention +ClusterImageSet names follow the pattern: `ocp-release--for--to-`, with colons replaced by `-`, `@` replaced by `-`, and underscores replaced by `-` to produce valid Kubernetes object names. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--pools` | (required) | Directory containing cluster pool specs (`*_clusterpool.yaml`) | +| `--imagesets` | (required) | Directory containing ClusterImageSet definitions (`*_clusterimageset.yaml`) | + +## Key files +- `cmd/clusterimageset-updater/main.go` -- all logic: pool parsing, release resolution, architecture selection, ClusterImageSet generation, pool patching, colliding bounds merging + +## Deployment +Runs as a Prow periodic job. Reads from and writes to the openshift/release repository, then the changes are committed and PR'd automatically. + +## Gotchas +- Only files ending in `_clusterpool.yaml` (with underscore) are processed -- files like `_cluster-pool.yaml` (with hyphen) are silently skipped. +- Both `version_lower` and `version_upper` must be set together or not at all; setting only one causes a fatal error. diff --git a/cmd/config-brancher/README.md b/cmd/config-brancher/README.md index 698f151bc02..3ddb67c4ecc 100644 --- a/cmd/config-brancher/README.md +++ b/cmd/config-brancher/README.md @@ -1,5 +1,57 @@ # config-brancher +## What +Creates release branch ci-operator configs from dev branch configs. When OCP cuts a new release, this tool copies configs to the new branch, bumps version references in images and promotions, and manages which branch promotes to which release. + +## How it works — full flow + +### Branch identification +- Dev branches are those promoting to `--current-release` but NOT named `openshift-{currentRelease}` (checked via `IsBumpable()`) +- Branch mapping via `DetermineReleaseBranch()`: + - `master` -> `release-{futureRelease}` + - `main` -> `release-{futureRelease}` + - `openshift-{current}` -> `openshift-{future}` + - `release-{current}` -> `release-{future}` + +### Two modes + +**Mirror mode** (no `--bump-release`): +- For each `--future-release`, copies dev config to the future release branch +- Updates version references via `updateRelease()`, `updateImages()`, `updatePromotion()` +- Dev branch unchanged +- Future branches get enabled promotion; current dev gets promotion disabled for that release + +**Bump mode** (`--bump-release` set): +- Additionally bumps the dev branch to the `--bump-release` version +- Promotion targets in dev branch updated to new version +- Used during code freeze to move dev to the next release + +### Config transformation details + +**`updateRelease()`**: For each PromotionConfiguration target with Name containing currentRelease, replaces with futureRelease. Same for ReleaseTagConfiguration and Releases map (Integration.Name, Candidate.Version). + +**`updateImages()`**: For BaseImages, BaseRPMImages, BuildRootImage — if image references an official image and Name contains currentRelease, replaces suffix with futureRelease. + +**`updatePromotion()`**: Filters targets to those containing devRelease. Sets `Disabled = (futureRelease == devRelease)` — ensures only one branch promotes per release version. + +**`removePeriodics()`**: Removes tests where `IsPeriodic() && !Portable`. Portable tests survive across release branches. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--current-release` | — | Current release determining dev branch | +| `--future-release` | — | Future release versions (repeatable, required) | +| `--bump-release` | — | Release to bump dev branch to (must be in future-release) | +| `--skip-periodics` | false | Don't duplicate periodics for current and future releases | +| `--config-dir` | — | CI operator config directory | +| `--confirm` | false | Write changes to disk (dry-run if false) | + +## Key files +- `cmd/config-brancher/main.go` — `generateBranchedConfigs()`, update functions +- `pkg/promotion/promotion.go` — `IsBumpable()`, `DetermineReleaseBranch()`, FutureOptions + +## Deployment +CLI tool. Called by `auto-config-brancher` as part of the automated branch cutting sequence. This tool is intended to make the process of branching and duplicating configuration for the CI Operator easy across many repositories. diff --git a/cmd/config-change-trigger/README.md b/cmd/config-change-trigger/README.md new file mode 100644 index 00000000000..b2207ef9eca --- /dev/null +++ b/cmd/config-change-trigger/README.md @@ -0,0 +1,55 @@ +# config-change-trigger + +## What +Prow job that detects ci-operator configuration changes in an openshift/release pull request and triggers the affected image-building postsubmit jobs. When a PR modifies ci-operator configs that affect image builds, this tool ensures the corresponding postsubmit jobs run immediately so that updated images are available without waiting for the next natural trigger (e.g., a merge to the affected repo). + +This is the postsubmit counterpart to `pj-rehearse` (which handles presubmit rehearsals). While pj-rehearse validates that config changes do not break jobs, config-change-trigger ensures the side effects of those changes (new images) are realized promptly. + +## How it works -- full flow + +1. **Read job context**: Resolve the Prow `JOB_SPEC` environment variable to determine the PR under test (org, repo, base SHA, PR refs). + +2. **Load configurations**: Load two complete snapshots of all CI configuration from the openshift/release checkout: + - **PR version** (`prConfig`): the current working copy at `--candidate-path` (includes the PR's changes) + - **Base version** (`masterConfig`): the configuration as it existed at the base SHA (`baseSHA^1`, i.e., the merge base) + +3. **Diff ci-operator configs**: Use `diffs.GetChangedCiopConfigs()` to compare the base and PR versions of all ci-operator configuration files. This returns a set of changed configs keyed by org/repo/branch metadata. + +4. **Find affected postsubmits**: Call `diffs.GetImagesPostsubmitsForCiopConfigs()` to find all postsubmit jobs in the PR's Prow config that build images and are associated with the changed ci-operator configs. These are the jobs whose output images would be affected by the config changes. + +5. **Resolve current SHAs**: For each affected postsubmit, call the GitHub API (`GetRef`) to get the current HEAD SHA of the target branch (e.g., `heads/main` for openshift/some-repo). This ensures the triggered postsubmit runs against the latest code. + +6. **Create ProwJobs**: For each affected postsubmit (up to `--limit`): + - Build a `Refs` struct with the org, repo, branch, and current HEAD SHA + - Create a ProwJob using `pjutil.NewProwJob(pjutil.PostsubmitSpec(...))` with the postsubmit's labels and annotations + - Set the namespace to the Prow job namespace from the PR's config + - Submit the ProwJob to the Kubernetes API + +7. **Truncate if needed**: If more postsubmits are affected than the limit, only trigger the first N and log that truncation occurred. + +### Error handling +- If the PR config cannot be loaded, it logs a warning but continues (the tool still needs the base config) +- If the base config cannot be loaded, it fatals (both versions are required for diffing) +- Individual job creation failures are collected and reported as a fatal aggregate error at the end + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `true` | When true, use a fake Kubernetes client (prints but does not create ProwJobs) | +| `--limit` | `30` | Maximum number of postsubmit jobs to trigger | +| `--candidate-path` | (required) | Path to the openshift/release working copy with the PR's changes | + +Standard Prow GitHub flags (`--github-token-path`, etc.) are also supported. Anonymous GitHub access is allowed. + +## Key files +- `cmd/config-change-trigger/main.go` -- entire implementation: JOB_SPEC parsing, config loading, diffing, SHA resolution, ProwJob creation + +## Deployment +Runs as a Prow job (not a long-lived service). Reads the `JOB_SPEC` environment variable set by Prow. Typically configured as a postsubmit on openshift/release that runs after config PRs merge. Requires a checkout of openshift/release at `--candidate-path`. + +When `--dry-run=false`, requires in-cluster kubeconfig for ProwJob creation. + +## Related +- `cmd/pj-rehearse` -- similar concept but for presubmit rehearsals of config changes +- `pkg/diffs/diffs.go` -- shared diff detection logic (`GetChangedCiopConfigs`, `GetImagesPostsubmitsForCiopConfigs`) +- `pkg/config/load.go` -- config loading (`GetAllConfigs`, `GetAllConfigsFromSHA`) diff --git a/cmd/config-shard-validator/README.md b/cmd/config-shard-validator/README.md new file mode 100644 index 00000000000..40abf349c99 --- /dev/null +++ b/cmd/config-shard-validator/README.md @@ -0,0 +1,61 @@ +# config-shard-validator + +## What +Validates that ci-operator configuration files and Prow job configuration files are correctly sharded into ConfigMaps according to Prow's `config_updater` plugin rules. This ensures that every config file is automatically synced to the right ConfigMap in the cluster and that Prow job specs reference the correct ConfigMap shard for their `CONFIG_SPEC` environment variable. + +This is a safety net: if a config file does not match any `config_updater` glob, or matches more than one, the ConfigMap sync will silently miss it or create conflicts. + +## How it works -- full flow + +### Startup +1. Parse flags: `--release-repo-dir` (root of openshift/release checkout), plus `--org`/`--repo`/`--log-level` for filtering. +2. Derive paths: ci-operator configs at `{release-repo-dir}/ci-operator/config`, Prow jobs at `{release-repo-dir}/ci-operator/jobs`, plugin config at `{release-repo-dir}/core-services/prow/02_config/_plugins.yaml`. +3. Load the Prow plugin config to get the `config_updater.maps` section, which defines glob-to-ConfigMap mappings. + +### Phase 1: Collect paths and config info +4. Walk all ci-operator config files via `OperateOnCIOperatorConfigDir()`: + - For each config, record the relative path (relative to release repo root) and the expected ConfigMap name (computed from org/repo metadata via `info.ConfigMapName()`) + - Build a lookup map of config basenames to their `Info` for cross-referencing in Phase 2 + +5. Walk all Prow job config files via `OperateOnJobConfigDir()`: + - For each job config file, record the relative path and expected ConfigMap name + - For every presubmit, postsubmit, and periodic job in every file, inspect the `PodSpec` + +### Phase 2: Validate PodSpec CONFIG_SPEC references +6. For each job's PodSpec containers, check for `CONFIG_SPEC` environment variables that reference a ConfigMap via `ValueFrom.ConfigMapKeyRef`: + - Look up the referenced key in the config info map to find which config file it points to + - If the key does not correspond to any known ci-operator config file, report an error + - If the ConfigMap name in the reference does not match the expected ConfigMap shard for that config, report an error (e.g., a job referencing `ci-operator-misc-configs` when the config should be in `ci-operator-4.15-configs`) + +### Phase 3: Validate glob coverage +7. Compile all `config_updater.maps` glob patterns using `zglob`. +8. Perform additional validations on the `config_updater.maps` themselves: + - **No `default` cluster alias**: Glob entries must use explicit cluster names, not the `default` alias + - **GZIP required for job configs**: Any glob matching `ci-operator/jobs` must have `gzip: true` set +9. For each collected path (both ci-operator configs and job configs), check against all compiled globs: + - **Zero matches**: The file does not belong to any auto-updating ConfigMap -- it will not be synced. Error. + - **Exactly one match**: Verify the matched glob's ConfigMap name equals the expected ConfigMap name. If not, the file would land in the wrong ConfigMap. Error. + - **Multiple matches**: The file matches globs from more than one ConfigMap -- ambiguous sync target. Error for each match. + +All glob checking runs concurrently using a producer-consumer pattern. + +### Exit +If any validation failures were found, the tool exits with a fatal error. Exit code 0 means all config files are correctly mapped to exactly one ConfigMap and all job `CONFIG_SPEC` references point to the right shard. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--release-repo-dir` | (required) | Path to the root of the openshift/release repository checkout | +| `--org` | `""` | Limit validation to configs in this org | +| `--repo` | `""` | Limit validation to configs in this repo | +| `--log-level` | `info` | Log verbosity level | +| `--only-process-changes` | `false` | Only validate files modified vs. the upstream branch | + +## Key files +- `cmd/config-shard-validator/main.go` -- entry point, all three validation phases, glob compilation and matching +- `pkg/config/options.go` -- shared `Options` for config directory walking with org/repo filtering +- `pkg/config/load.go` -- `Info.ConfigMapName()` computes expected ConfigMap shard name from config metadata +- `pkg/jobconfig/files.go` -- `Info.ConfigMapName()` computes expected ConfigMap shard name for job config files + +## Deployment +Runs as a presubmit check on openshift/release PRs. Prevents merging config changes that would break the ConfigMap auto-sync mechanism. diff --git a/cmd/cvp-trigger/README.md b/cmd/cvp-trigger/README.md index d0a4348fa27..e24e01c79c8 100644 --- a/cmd/cvp-trigger/README.md +++ b/cmd/cvp-trigger/README.md @@ -1,8 +1,70 @@ -# CVP Trigger +# cvp-trigger -CVP Trigger tool will be used by the CVP pipeline to parametrize and trigger -the verification jobs for optional operator artifacts built internally in RH. +## What +CLI tool that triggers Container Verification Pipeline (CVP) operator verification jobs in Prow. Given a bundle image, index image, OCP version, and other operator metadata, it finds the corresponding periodic ProwJob definition, injects the operator-specific parameters, submits the job to the cluster, and watches it until completion. Outputs a JSON result file with the job status and artifacts URL. +CVP is the system that validates ISV (Independent Software Vendor) operators for Red Hat certification. + +## How it works -- full flow + +1. **Parse and validate options**: All required parameters are validated -- bundle image ref, channel, index image ref, operator package name, OCP version (must be `X.Y` where X >= 4), job name, and Prow config paths. The output path directory must exist (unless `--dry-run`). + +2. **Load Prow config**: Use the Prow config agent to load job definitions from the specified config and job-config paths. + +3. **Find the periodic job**: Search all periodic jobs in the loaded Prow config for one matching `--job-name`. Fatal if not found. + +4. **Construct a ProwJob**: Convert the periodic job spec into a ProwJob resource using `pjutil.NewProwJob()`. + +5. **Inject parameters**: + - **Multi-stage params** (`--multi-stage-param=KEY=VALUE`): `OO_CHANNEL`, `OO_PACKAGE`, and optionally `OO_INSTALL_NAMESPACE`, `OO_TARGET_NAMESPACES`, `CUSTOM_SCORECARD_TESTCASE` + - **Dependency overrides** (`--dependency-override-param=KEY=VALUE`): `BUNDLE_IMAGE`, `OO_INDEX`, `INDEX_IMAGE` + - **Environment variables**: `CLUSTER_TYPE=aws`, `OCP_VERSION`, and optionally `RELEASE_IMAGE_LATEST` + - **Input hash**: Computed from sorted env vars, appended as `--input-hash=...` for deduplication + +6. **Dry-run mode**: If `--dry-run`, marshal the ProwJob to YAML, print it, and exit. + +7. **Submit and watch**: + - Create a ProwJob client from in-cluster config + - Submit the ProwJob via `Create()` + - Watch the ProwJob using a field selector on its name, with exponential backoff for watch creation (10 steps, factor 2, starting at 1s) + - On terminal state (`Success`, `Failure`, `Aborted`, `Error`): + - Compute the GCS artifacts URL using `gcsupload.PathsForJob()` and `Spyglass.GCSBrowserPrefix` + - Write a JSON result to `--output-path`: + ```json + { + "status": "", + "prowjob_artifacts_url": "", + "prowjob_url": "" + } + ``` + - Exit 0 on success, fatal on failure + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--bundle-image-ref` | (required) | URL for the operator bundle image | +| `--channel` | (required) | Operator channel to test | +| `--index-image-ref` | (required) | URL for the operator index image | +| `--operator-package-name` | (required) | Operator package name | +| `--ocp-version` | (required) | OCP version in X.Y format (X >= 4) | +| `--job-name` | (required) | Name of the periodic ProwJob to trigger | +| `--prow-config-path` | (required) | Path to Prow config YAML | +| `--job-config-path` | (required) | Path to Prow job config directory | +| `--output-path` | (required unless dry-run) | File path to write the JSON result | +| `--release-image-ref` | `""` | Pull spec of a specific release payload for OCP deployment | +| `--install-namespace` | `""` | Namespace for operator/catalog installation | +| `--target-namespaces` | `""` | Comma-separated list of target namespaces for the operator | +| `--custom-scorecard-testcase` | `""` | Custom scorecard test case name | +| `--enable-hybrid-overlay` | `false` | Enable hybrid overlay feature on the test cluster | +| `--dry-run` | `false` | Print ProwJob YAML without submitting | + +## Key files + +- `cmd/cvp-trigger/main.go` -- all logic: option parsing, ProwJob construction, parameter injection, submission, watch loop, result output + +## Deployment +CLI tool. Not deployed as a service. Invoked by external systems (e.g. CVP pipeline) that need to trigger operator verification jobs in Prow. Requires in-cluster access to the Prow API server for ProwJob creation. ## High-level CVP ↔ Prow Job Architecture To test the optional operator images built internally in Red Hat, CVP triggers diff --git a/cmd/determinize-ci-operator/README.md b/cmd/determinize-ci-operator/README.md new file mode 100644 index 00000000000..14413e565eb --- /dev/null +++ b/cmd/determinize-ci-operator/README.md @@ -0,0 +1,47 @@ +# determinize-ci-operator + +## What +Normalizes ci-operator configuration YAML files to enforce consistent, deterministic formatting. This is a pure formatting tool -- it does not change semantics. Every config file is read, unmarshalled into Go structs, and re-serialized back to YAML, which eliminates formatting inconsistencies like field ordering, whitespace, and quoting differences. + +The tool also ensures that each config file's `zz_generated_metadata` field matches the metadata derived from its filesystem path (org/repo/branch/variant), treating the filepath as the source of truth. + +## How it works -- full flow + +1. Parse flags via `ConfirmableOptions`: `--config-dir`, `--confirm`, `--org`, `--repo`, `--log-level`, `--only-process-changes` +2. Walk all ci-operator config files in `--config-dir` using `OperateOnCIOperatorConfigDir()`: + - Each `.yaml` file under the config directory is loaded and unmarshalled into a `ReleaseBuildConfiguration` struct + - Files are filtered by `--org`/`--repo` if specified + - If `--only-process-changes` is set, only files modified relative to the upstream branch are processed +3. For each config file: + - If `--confirm` is **not** set: log "Would re-format file" and skip (dry-run mode) + - If `--confirm` is set: overwrite the config's `Metadata` field with the metadata extracted from the filepath, then collect for writing +4. After walking all configs, batch-write all collected configs back to disk via `CommitTo()`: + - Marshal the `ReleaseBuildConfiguration` to YAML using `sigs.k8s.io/yaml` (which produces deterministic output) + - Write to the canonical filepath derived from the config's metadata: `{config-dir}/{org}/{repo}/{org}-{repo}-{branch}[__variant].yaml` + +### What "determinize" means in practice +- Field ordering is fixed by Go struct tag ordering +- YAML formatting (indentation, quoting, flow vs block style) is standardized by the marshaller +- The `zz_generated_metadata` block is always regenerated from the filepath, fixing any drift +- Empty/nil fields are omitted consistently via `omitempty` tags + +### Dry-run vs. confirm +Without `--confirm`, the tool only logs what it would do without writing any files. This is useful for CI checks that verify configs are already determinized (if the tool would change anything, the check fails). + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-dir` | (required) | Path to the ci-operator configuration directory | +| `--confirm` | `false` | Actually write reformatted files; without this, dry-run only | +| `--org` | `""` | Limit to configs in this GitHub org | +| `--repo` | `""` | Limit to configs in this GitHub repo | +| `--log-level` | `info` | Log verbosity level | +| `--only-process-changes` | `false` | Only process files modified vs. the upstream branch | + +## Key files +- `cmd/determinize-ci-operator/main.go` -- entry point, walks configs and re-serializes them +- `pkg/config/options.go` -- `ConfirmableOptions` struct with `--confirm` flag, `OperateOnCIOperatorConfigDir()` for filtered walking +- `pkg/config/load.go` -- `DataWithInfo.CommitTo()` serializes and writes a config to its canonical path + +## Deployment +CLI tool. Run as part of the config generation pipeline in openshift/release (typically via `make jobs` or `auto-config-brancher`). Also used in presubmit checks to verify configs are already in determinized form. diff --git a/cmd/determinize-peribolos/README.md b/cmd/determinize-peribolos/README.md new file mode 100644 index 00000000000..34ffd96f0a3 --- /dev/null +++ b/cmd/determinize-peribolos/README.md @@ -0,0 +1,34 @@ +# determinize-peribolos + +## What +Normalizes Peribolos GitHub organization configuration YAML files to enforce deterministic formatting. Peribolos is the tool that manages GitHub org membership, team structure, and repository settings declaratively. This determinizer ensures the config YAML has consistent field ordering, indentation, and formatting by doing a read-unmarshal-marshal-write roundtrip. + +## How it works -- full flow + +1. Parse the `--config-path` flag (required), which points to the Peribolos org config file (e.g., `config.yaml` for a GitHub organization). +2. Read the file from disk. The reader supports **gzip-compressed files** transparently via `gzip.ReadFileMaybeGZIP()` -- if the file has a gzip header, it is decompressed automatically; otherwise it is read as plain text. +3. Unmarshal the raw bytes into a Prow `org.FullConfig` struct. This struct represents the complete Peribolos configuration for an organization, including: + - Organization metadata (name, description, billing email, company, etc.) + - Member lists (members, admins) + - Team hierarchy (team names, members, maintainers, privacy, child teams) + - Repository settings (if present) +4. Marshal the struct back to YAML using `sigs.k8s.io/yaml`, which produces deterministic output with consistent field ordering based on Go struct tags. +5. Write the normalized YAML back to the same file path with `0666` permissions. + +### What changes in practice +- Field ordering is standardized (follows Go struct field declaration order) +- YAML formatting (indentation, quoting style, flow vs. block) is made consistent +- Empty or default-value fields may be added or removed based on `omitempty` tags +- No semantic changes to the org configuration + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-path` | (required) | Path to the Peribolos org `config.yaml` file (supports gzip) | + +## Key files +- `cmd/determinize-peribolos/main.go` -- entire tool implementation in a single file +- `pkg/util/gzip/gzip.go` -- `ReadFileMaybeGZIP()` for transparent gzip support + +## Deployment +CLI tool. Run as part of org configuration management workflows. Ensures that changes to Peribolos configs produce minimal, meaningful diffs by keeping formatting deterministic. diff --git a/cmd/determinize-prow-config/README.md b/cmd/determinize-prow-config/README.md new file mode 100644 index 00000000000..ac006c1c45e --- /dev/null +++ b/cmd/determinize-prow-config/README.md @@ -0,0 +1,64 @@ +# determinize-prow-config + +## What +Normalizes Prow configuration and plugin configuration YAML files to enforce deterministic formatting. Optionally shards monolithic Prow config into per-org and per-org/repo files for scalability, extracting org- and repo-specific settings from the main config into a directory tree. + +This tool handles two separate config files: the Prow config (`_config.yaml`) and the plugin config (`_plugins.yaml`). + +## How it works -- full flow + +### Prow config normalization +1. Load `{prow-config-dir}/_config.yaml` in **strict mode** via Prow's `LoadStrict()`. If `--sharded-prow-config-base-dir` is set, also load supplemental `_prowconfig.yaml` files from that directory tree. +2. If `--sharded-prow-config-base-dir` is set, shard the Prow config: + - **Branch protection**: Extract per-org and per-org/repo branch protection rules into separate files. Org-level policies are written to `{org}/_prowconfig.yaml`, repo-level to `{org}/{repo}/_prowconfig.yaml`. The rules are removed from the main config. + - **Tide merge types**: Extract per-org and per-org/repo merge method configurations into the shard files. + - **Tide queries**: Split queries by org and repo scope. Each org-scoped query goes to `{org}/_prowconfig.yaml`, each repo-scoped query to `{org}/{repo}/_prowconfig.yaml`. Query copies are deep-copied to avoid mutation. + - **Slack reporter configs**: Extract per-org/repo Slack reporter configurations (the global `*` config stays in main). +3. Marshal the (now stripped) main config to YAML and write it back to `{prow-config-dir}/_config.yaml`. + +### Plugin config normalization +1. Load `{prow-config-dir}/_plugins.yaml` using the Prow plugin agent, with supplemental `_pluginconfig.yaml` files from the config directory. +2. If `--sharded-plugin-config-base-dir` is set, shard the plugin config: + - **Plugins**: Per-org/repo plugin enablement goes to `{org/repo}/_pluginconfig.yaml` + - **Bugzilla**: Org-level defaults go to `{org}/_pluginconfig.yaml`, repo-level to `{org}/{repo}/_pluginconfig.yaml` + - **Approve**: Per-repo approve configs are extracted + - **LGTM**: Per-repo LGTM configs are extracted + - **Triggers**: Per-repo trigger configs are extracted + - **Welcome**: Per-repo welcome message configs are extracted + - **External plugins**: Per-org/repo external plugin registrations are extracted + - **Restricted labels**: Per-org/repo restricted label configs are extracted (global `*` stays in main) +3. Marshal the (now stripped) main plugin config to YAML and write it back. + +### What "determinize" means +Without sharding flags, this is a read-parse-serialize roundtrip that standardizes YAML formatting (field order, indentation, quoting). With sharding flags, it additionally redistributes config into a normalized directory structure. + +### Sharding file structure +``` +{sharded-prow-config-base-dir}/ + {org}/ + _prowconfig.yaml # org-level branch protection, tide queries, merge methods + {repo}/ + _prowconfig.yaml # repo-level branch protection, tide queries, slack reporter, merge methods + +{sharded-plugin-config-base-dir}/ + {org}/ + _pluginconfig.yaml # org-level bugzilla defaults, plugins + {repo}/ + _pluginconfig.yaml # repo-level plugins, approve, lgtm, triggers, welcome, external plugins, bugzilla, restricted labels +``` + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--prow-config-dir` | (required) | Path to the Prow configuration directory containing `_config.yaml` and `_plugins.yaml` | +| `--sharded-prow-config-base-dir` | `""` | Base directory for sharded Prow config output; if set, org/repo-specific config is extracted from main config and written here | +| `--sharded-plugin-config-base-dir` | `""` | Base directory for sharded plugin config output; if set, org/repo-specific plugin config is extracted and written here | + +## Key files +- `cmd/determinize-prow-config/main.go` -- entry point, orchestrates config and plugin config normalization +- `pkg/api/shardprowconfig/shardprowconfig.go` -- `ShardProwConfig()` implements Prow config sharding (branch protection, tide, slack reporter, merge methods) +- `pkg/prowconfigsharding/prowconfigsharding.go` -- `WriteShardedPluginConfig()` implements plugin config sharding (plugins, bugzilla, approve, lgtm, triggers, welcome, external plugins, restricted labels) +- `pkg/config/release.go` -- constants: `ProwConfigFile` (`_config.yaml`), `PluginConfigFile` (`_plugins.yaml`), `SupplementalProwConfigFileName` (`_prowconfig.yaml`), `SupplementalPluginConfigFileName` (`_pluginconfig.yaml`) + +## Deployment +CLI tool. Run as part of the config generation pipeline in openshift/release (via `make jobs` or `auto-config-brancher`). Also used to verify Prow config is determinized in presubmit checks. diff --git a/cmd/docgen/README.md b/cmd/docgen/README.md new file mode 100644 index 00000000000..c47ea821321 --- /dev/null +++ b/cmd/docgen/README.md @@ -0,0 +1,42 @@ +# docgen + +## What +Generates a Go source file containing an annotated YAML reference for the `ReleaseBuildConfiguration` struct (the ci-operator config schema). The output is a string constant used by the step registry web UI to display interactive, commented documentation of every ci-operator config field. + +## How it works -- full flow + +1. Glob all `.go` files in `./pkg/api/` to find the Go source files that define the ci-operator configuration types. +2. Build a `CommentMap` using Prow's `genyaml.NewCommentMap()`: + - Parses the Go source files' AST to extract doc comments from every struct field + - Maps each struct field's JSON/YAML tag to its corresponding doc comment + - Uses a trivial import path resolver (identity function that returns the directory unchanged) +3. Generate an annotated reference YAML by calling `commentMap.GenYaml()` with a fully-populated instance of `ReleaseBuildConfiguration`: + - `genyaml.PopulateStruct()` uses reflection to create an instance of the struct with all fields set to non-zero values (including nested structs, slices, and maps), so every possible field appears in the output + - `GenYaml()` serializes this populated struct to YAML and injects the doc comments as `#`-prefixed comment lines above each field +4. Post-process the generated YAML string into a Go string constant: + - Escape all double quotes (`"` -> `\"`) + - Split the YAML into individual lines + - Wrap each line into a Go string concatenation expression (`"line1\n" + \n"line2\n" + ...`) + - Prepend with `package webreg` declaration and `const ciOperatorReferenceYaml = "..."` +5. Write the result to `./pkg/webreg/zz_generated.ci_operator_reference.go` with `0644` permissions. + +### Output file +The generated file `zz_generated.ci_operator_reference.go` contains a single large string constant `ciOperatorReferenceYaml` in the `webreg` package. This constant is used by the configresolver web UI to render the ci-operator configuration reference documentation page. + +The `zz_generated.` prefix follows the Go convention for generated files that should not be manually edited and are typically excluded from linting. + +### Important: must run from repo root +The tool uses relative paths (`./pkg/api/*.go` and `./pkg/webreg/...`), so it **must** be executed from the ci-tools repository root directory. + +## Flags +None. This tool takes no command-line flags. + +## Key files +- `cmd/docgen/main.go` -- entire tool: source parsing, YAML generation, Go source output +- `pkg/api/types.go` -- the `ReleaseBuildConfiguration` struct and all nested types whose doc comments become the reference documentation +- `pkg/webreg/zz_generated.ci_operator_reference.go` -- the generated output file consumed by the web UI +- `vendor/sigs.k8s.io/prow/pkg/genyaml/genyaml.go` -- `NewCommentMap()` and `GenYaml()` that parse Go comments and produce annotated YAML +- `vendor/sigs.k8s.io/prow/pkg/genyaml/populate_struct.go` -- `PopulateStruct()` that creates a fully-populated struct instance via reflection + +## Deployment +CLI tool. Run as part of the code generation pipeline (typically `make generate` or equivalent). Must be re-run whenever the `pkg/api/` type definitions or their doc comments change. Not deployed as a service. diff --git a/cmd/dptp-controller-manager/README.md b/cmd/dptp-controller-manager/README.md index e48127a1328..ca17e0a0246 100644 --- a/cmd/dptp-controller-manager/README.md +++ b/cmd/dptp-controller-manager/README.md @@ -1,3 +1,79 @@ -# DPTP controller manager +# dptp-controller-manager -Contains controllers owned by dptp. You wil find their code and more detailled READMEs below pkg/controller/ +## What +Multi-controller Kubernetes manager handling image promotion, test image distribution across build clusters, service account secret lifecycle, test image import cleanup, and ephemeral cluster provisioning. Connects to multiple clusters simultaneously. + +## How it works + +### Multi-cluster architecture +- **Main manager**: Created for app.ci cluster with leader election +- **Build cluster managers**: One per build cluster kubeconfig, no leader election, metrics disabled +- **Registry manager**: Special manager for registry cluster (default: app.ci) with 24-hour cache sync period +- All build cluster managers are added as Runnables to the main manager + +### Controllers + +#### promotionreconciler +Watches ImageStreamTags on the registry cluster. When a promotion target IST exists but its commit is stale (doesn't match current branch HEAD on GitHub), enqueues a ProwJob to re-promote. + +- Indexes CI operator configs by promotion target IST +- Subscribes to config changes — when new promotion configs appear, checks if IST exists and creates ProwJob if missing +- Only reconciles tags younger than `--promotionReconcilerOptions.since` (default: 15 days) +- MaxConcurrentReconciles: 100 +- Ignore patterns via `--promotionReconcilerOptions.ignore-image-stream` (regex) + +#### testimagesdistributor +Distributes ImageStreamTags from the registry cluster to all build clusters. When ci-operator configs reference test input images, this controller ensures those images are available on the cluster where the job runs. + +- Creates ImageStreamImport requests on target clusters to sync images +- Creates namespaces and RBAC roles on target clusters as needed +- Blocks imports from forbidden registries (`--testImagesDistributorOptions.forbidden-registry`) +- Can ignore specific clusters (`--testImagesDistributorOptions.ignore-cluster-name`) +- MaxConcurrentReconciles: 1 (conflicts on ImageStream level) + +#### serviceaccount_secret_refresher +Manages service account token and pull secret lifecycle across all clusters. + +- Watches ServiceAccounts and Secrets +- Checks expiration via TTL annotation (`serviaccount-secret-rotator.openshift.io/delete-after`) +- Default TTL: 30 days +- If `--serviceAccountRefresherOptions.remove-old-secrets`: deletes secrets older than 60 days +- Runs per-cluster (controller added for each build cluster) +- MaxConcurrentReconciles: 20 + +#### testimagestreamimportcleaner +Cleans up stale TestImageStreamTagImport objects older than 7 days. If younger, requeues for cleanup at expiry time. + +- Runs per-cluster +- MaxConcurrentReconciles: 10 + +#### ephemeral_cluster_provisioner +Provisions ephemeral clusters for testing via ProwJobs. + +- Watches EphemeralCluster resources in `ephemeral-cluster` namespace +- Creates ProwJobs from PR metadata stored in annotations +- Generates ci-operator config with cluster claim and CLI image +- Manages finalizer: `ephemeralcluster.ci.openshift.io/dependent-prowjob` +- Polling interval configurable (default: 5s) +- MaxConcurrentReconciles: 1 + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--enable-controller` | `promotionreconciler` | Controllers to enable (repeatable) | +| `--ci-operator-config-path` | — | Path to CI operator config | +| `--step-config-path` | — | Path to step registry config | +| `--registry-cluster-name` | `app.ci` | Cluster with CI central registry | +| `--leader-election-namespace` | `ci` | Namespace for leader election | +| `--dry-run` | true | Dry-run mode | +| `--release-repo-git-sync-path` | — | Path to release repo (alternative to config paths) | + +## Key files +- `cmd/dptp-controller-manager/main.go` — controller registration, multi-cluster setup +- `pkg/controller/promotionreconciler/reconciler.go` — promotion logic +- `pkg/controller/test-images-distributor/test_images_distributor.go` — image distribution +- `pkg/controller/serviceaccount_secret_refresher/` — SA secret lifecycle +- `pkg/controller/ephemeralcluster/reconciler.go` — ephemeral cluster provisioning + +## Deployment +Long-lived Deployment on app.ci, namespace ci. Leader election enabled. RBAC distributed across build clusters. diff --git a/cmd/dptp-pools-cm/README.md b/cmd/dptp-pools-cm/README.md new file mode 100644 index 00000000000..7f261a356ec --- /dev/null +++ b/cmd/dptp-pools-cm/README.md @@ -0,0 +1,61 @@ +# dptp-pools-cm + +## What +Controller-manager that runs on the Hive cluster to manage Hive cluster pool infrastructure. It hosts two controllers: + +1. **cluster_pools_pull_secret_provider** -- keeps pull secrets in sync across cluster pool namespaces by copying a source pull secret into every namespace that has a ClusterPool referencing it. +2. **hypershift_namespace_reconciler** -- ensures HyperShift hosted control plane namespaces are excluded from monitoring to prevent alert noise from transient control plane namespaces. + +## How it works -- full flow + +### cluster_pools_pull_secret_provider (default: enabled) + +**Watches:** `ClusterPool` resources and `Secret` resources. + +**Trigger conditions:** +- A `ClusterPool` is created or updated in any namespace except the source pull secret namespace. +- The source pull secret itself changes in the source namespace -- triggers reconciliation for all pools across all namespaces. +- A pull secret copy in a target namespace changes -- triggers reconciliation for pools in that specific namespace. + +**Reconciliation:** +1. Gets the `ClusterPool` from the reconcile request. +2. If the pool is deleted, does nothing (returns nil). +3. Checks if the pool has `spec.pullSecretRef` set. Skips if nil. +4. Checks if `spec.pullSecretRef.name` matches the configured source pull secret name. Skips if it does not. +5. Reads the source pull secret from the configured source namespace. +6. Constructs a copy with the same data, labels, and annotations but targeted at the pool's namespace. +7. Adds the label `dptp-requester: cluster_pools_pull_secret_provider` to the copy. +8. Upserts the secret using `util.UpsertImmutableSecret` (create-or-replace for immutable secrets). + +### hypershift_namespace_reconciler (default: not enabled) + +**Watches:** `Namespace` resources. + +**Predicate:** Only namespaces with the label `hypershift.openshift.io/hosted-control-plane` (value empty or `"true"`) are processed. Delete events are ignored. + +**Reconciliation:** +1. Calls `controllerutil.EnsureNamespaceNotMonitored` on the namespace to remove any monitoring configuration, preventing alert noise from transient HyperShift control plane namespaces. + +### Controller manager setup +- Uses in-cluster config (runs on the Hive cluster itself). +- Leader election is enabled in the `ci` namespace with lock ID `dptp-pools-cm`. +- Registers the Hive v1 scheme for ClusterPool watches. +- Supports dry-run mode (enabled by default). + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--leader-election-namespace` | `ci` | Namespace for leader election | +| `--leader-election-suffix` | `""` | Suffix for leader election lock name (useful for local testing, requires `--dry-run`) | +| `--enable-controller` | `cluster_pools_pull_secret_provider` | Which controllers to enable. Repeatable. Valid values: `cluster_pools_pull_secret_provider`, `hypershift_namespace_reconciler` | +| `--poolsPullSecretProviderOptions.sourcePullSecretNamespace` | `ci-cluster-pool` | Namespace containing the source pull secret | +| `--poolsPullSecretProviderOptions.sourcePullSecretName` | `pull-secret` | Name of the source pull secret | +| `--dry-run` | `true` | Dry-run mode for the controller manager | + +## Key files +- `cmd/dptp-pools-cm/main.go` -- entry point, flag parsing, controller manager setup +- `pkg/controller/cluster_pools_pull_secret_provider/cluster_pools_pull_secret_provider.go` -- pull secret sync controller: reconciler, watch handlers +- `pkg/controller/hypershift_namespace_reconciler/hypershift_namespace_reconciler.go` -- HyperShift namespace controller: reconciler, label predicate + +## Deployment +Long-lived Deployment on the Hive cluster (hosted-mgmt), namespace `ci`. Uses leader election so multiple replicas can run for HA. diff --git a/cmd/entrypoint-wrapper/README.md b/cmd/entrypoint-wrapper/README.md new file mode 100644 index 00000000000..1bbbd8a002b --- /dev/null +++ b/cmd/entrypoint-wrapper/README.md @@ -0,0 +1,68 @@ +# entrypoint-wrapper + +## What +Wraps every multi-stage test step executed by ci-operator. Manages secret copying to writable temp locations, kubeconfig isolation with polling fallback, HOME directory fixups, Git safe directory config, signal forwarding to child processes, and post-execution artifact upload as Kubernetes Secrets. + +Not a standalone service. ci-operator injects this binary as the entrypoint of every test step container. + +## How it works — full flow + +### 1. Parse and validate +- Reads required env vars: `SHARED_DIR`, `NAMESPACE`, `JOB_NAME_SAFE` +- Validates mode flag (one of three modes below) +- Creates Kubernetes client unless dry-run or skip-kubeconfig mode + +### 2. Copy SHARED_DIR to writable temp +`copyDir()` copies files from the mounted `SHARED_DIR` to `$TMPDIR/secret`. This is a **non-recursive, top-level-only copy** — subdirectories are skipped. The `SHARED_DIR` env var is then updated to point to the temp copy. This is how steps pass data to subsequent steps: each step reads the previous step's shared dir snapshot from a mounted Secret, writes to the temp copy, and the wrapper uploads it back. + +### 3. Wait for file (optional) +If `--wait-for-file` is set, blocks until the file appears (polling). Used for observer pods that need to wait for cluster install to complete. `--wait-timeout` caps how long to wait. + +### 4. Spawn kubeconfig upload goroutine +If `uploadKubeconfig` is true, starts a background goroutine that polls for `kubeconfig` (and `kubeconfig-minimal`) in the shared dir and uploads them as Kubernetes Secrets. Uses `wait.PollUntil()` with 1-second intervals. This runs concurrently with the test — observers can read the kubeconfig as soon as it appears. + +### 5. Set up environment and execute child process +Before exec'ing the actual test command: + +- **HOME fixup** (`manageHome()`): If `HOME` is unset or not writable (checked via `syscall.Access`), sets `HOME=/alabama`. This ensures kubectl/oc can write discovery cache. The `/alabama` path is a deliberate magic constant. + +- **Git config** (`manageGitConfig()`): Creates `$HOME/.gitconfig` with `[safe] directory = *` if it doesn't already exist. This disables Git's ownership verification — necessary because build containers run with different UIDs than the mounted repo. + +- **CLI_DIR in PATH**: If `CLI_DIR` env var is set, prepends it to `PATH` so tools installed by previous steps are available. + +- **Kubeconfig isolation** (`manageKubeconfig()`): Creates a temp file copy of the original `KUBECONFIG`. If the original doesn't exist yet (common for observer pods where kubeconfig arrives later), starts a background polling goroutine using `wait.PollImmediateInfinite(time.Second)` that copies the kubeconfig as soon as it appears. Updates `KUBECONFIG` env var to point to the isolated copy. + +- **Signal forwarding**: Registers handlers for `SIGINT` and `SIGTERM`, forwards them to the child process. Runs in a goroutine that exits when the child process completes. + +- **Exec**: Starts the child command with the modified environment, waits for completion. + +### 6. Cleanup and upload +After the child exits: +- Cancels the kubeconfig upload goroutine +- If `updateSharedDir` is true: reads all files from the temp shared dir via `util.SecretFromDir()` (skips directories, broken symlinks, symlinks to directories) and creates/updates a Kubernetes Secret named after `JOB_NAME_SAFE` +- Returns the child's exit code + +## Three modes + +| Mode | Flag value | uploadKubeconfig | updateSharedDir | rwKubeconfig | Use case | +|---|---|---|---|---|---| +| **manage-kubeconfig** | `manage-kubeconfig` (default) | true | true | true | Normal test steps | +| **skip-kubeconfig** | `skip-kubeconfig` | false | false | false | Steps that don't need cluster access | +| **observer** | `observer` | false | false | true | Observer pods — read-only kubeconfig, no upload | + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--mode` | `manage-kubeconfig` | Kubeconfig management mode (see table above) | +| `--dry-run` | `false` | Print the secret instead of creating it | +| `--wait-for-file` | `""` | Path to file to wait for before starting the child | +| `--wait-timeout` | `""` | Max wait duration; requires `--wait-for-file` | + +## Key files +- `cmd/entrypoint-wrapper/main.go` — all logic in one file (~480 lines) +- `pkg/util/secret.go` — `SecretFromDir()` reads directory contents into a k8s Secret + +## Deployment +Not independently deployed. Injected into every multi-stage test step container by ci-operator. + +Container image: built from `images/entrypoint-wrapper/Dockerfile`, base `ubi9/ubi-minimal`. diff --git a/cmd/generate-registry-metadata/README.md b/cmd/generate-registry-metadata/README.md new file mode 100644 index 00000000000..02a17f7d0cc --- /dev/null +++ b/cmd/generate-registry-metadata/README.md @@ -0,0 +1,57 @@ +# generate-registry-metadata + +## What +Generates `.metadata.json` files for each step registry component (references, chains, workflows, observers). These metadata files contain the relative path and OWNERS information for each component and are consumed by the ci-operator-configresolver's web UI to display ownership and navigation information. + +## How it works -- full flow + +### Metadata generation +1. Parse the `--registry` flag (required), pointing to the step registry directory. +2. Walk the entire registry directory tree recursively using `filepath.WalkDir()`. +3. For each `.yaml` file found (which represents a registry component -- a step reference, chain, workflow, or observer definition): + a. Compute the file's path relative to the registry root directory. + b. Look for an `OWNERS` file in the **same directory** as the component. Every registry component directory is **required** to have an OWNERS file. + c. If the OWNERS file is missing, record an error (but continue processing other files). + d. If the OWNERS file exists, read and unmarshal it as a Prow `repoowners.Config` struct (supports `approvers`, `reviewers`, `labels`, etc.). + e. Store the metadata: `{filename.yaml -> {path: relative_path, owners: owners_config}}`. +4. If any errors occurred (missing OWNERS files, read failures, unmarshal failures), aggregate them and exit with a fatal error. + +### Metadata writing +5. For each collected metadata entry, write a `.metadata.json` file: + - The output file is placed in the same directory as the source `.yaml` file + - The filename is derived from the component filename: `{component-name}.metadata.json` (the `.yaml` extension is replaced with `.metadata.json`) + - Example: `ci-operator/step-registry/ipi/install/ipi-install-ref.yaml` produces `ipi-install-ref.metadata.json` in the same directory + - The JSON is pretty-printed with tab indentation via `json.MarshalIndent()` +6. Written with `0644` permissions. + +### Output format +Each `.metadata.json` file contains: +```json +{ + "path": "ipi/install/ipi-install-ref.yaml", + "owners": { + "approvers": ["user1", "user2"], + "reviewers": ["user3"] + } +} +``` + +The `path` field is the component's path relative to the registry root, and `owners` is the full parsed OWNERS config from the component's directory. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--registry` | (required) | Path to the step registry directory | + +## Key files +- `cmd/generate-registry-metadata/main.go` -- entire tool: directory walking, OWNERS parsing, JSON output +- `pkg/api/types.go` -- `RegistryMetadata` (map of filename to `RegistryInfo`) and `RegistryInfo` (path + owners) type definitions +- `pkg/load/load.go` -- `MetadataSuffix` constant (`.metadata.json`), used for consistency with the registry loader that reads these files back + +## Gotchas +- Every registry component directory **must** have an OWNERS file. If any are missing, the tool reports errors for all of them but still processes the rest. +- The tool reads OWNERS files using `gzip.ReadFileMaybeGZIP()`, so gzip-compressed OWNERS files are supported (though uncommon). +- Only `.yaml` files are considered as registry components. Other files (`.md` docs, `.metadata.json` from previous runs, etc.) are ignored. + +## Deployment +CLI tool. Run as part of the registry metadata generation pipeline. The output `.metadata.json` files are checked into the repository alongside the registry components and consumed by the configresolver web UI at runtime. diff --git a/cmd/github-ldap-user-group-creator/README.md b/cmd/github-ldap-user-group-creator/README.md index e1410d62ea9..9af269292ba 100644 --- a/cmd/github-ldap-user-group-creator/README.md +++ b/cmd/github-ldap-user-group-creator/README.md @@ -1,5 +1,78 @@ # github-ldap-user-group-creator +## What +Batch job that synchronizes OpenShift `Group` resources across all CI clusters to match the canonical set of Rover/LDAP group memberships and GitHub-to-Kerberos identity mappings. It creates three categories of groups: + +1. **Per-GitHub-user groups** (`github-com-`) -- one member, the user's Kerberos ID, on all clusters except `hive` +2. **Rover groups** (e.g. `ci-admins`, team groups) -- resolved memberships from LDAP, with optional cluster targeting and renaming +3. **openshift-priv admins group** (`ocp-private-admins`) -- admins/members of the openshift-priv GitHub org, mapped to Kerberos IDs, on `app.ci` only + +It also optionally deletes OpenShift `User` and `Identity` objects for people no longer present in Rover, and uploads user records to BigQuery for analytics. + +This is the second half of a two-stage pipeline: `sync-rover-groups` resolves LDAP memberships and writes YAML files, then this tool consumes those files and applies the resulting groups to clusters. + +## How it works -- full flow + +1. **Load inputs**: + - Read the GitHub-to-Kerberos user mapping from `--github-users-file` (YAML array of `rover.User` with `uid`, `github_username`, `cost_center`) + - Read the resolved Rover group memberships from `--groups-file` (YAML map of group name to Kerberos ID list) + - Optionally load peribolos config to extract openshift-priv org admins/members + - Optionally load group config for cluster targeting and renaming + +2. **BigQuery upload**: Insert all user records into the `ci_analysis_us.users` table in the `openshift-gce-devel` GCP project with the current timestamp. This enables analytics on CI user populations over time. + +3. **Build GitHub-to-Kerberos mapping**: Call `rover.MapGithubToKerberos()` to create a `map[string]string` from GitHub login to Kerberos ID. + +4. **Safety checks**: + - Verify the `ci-admins` group exists in the groups file and has at least 3 members; fatal if not + - Warn if no openshift-priv admins were found + +5. **Optionally delete invalid users** (`--delete-invalid-users`): For each cluster, list all OpenShift `User` objects. Delete any whose name is not a known Kerberos ID, along with their associated `Identity` objects. Hard-coded exceptions: `backplane-cluster-admin` is always skipped, ci-admins members are never deleted (safety valve). + +6. **Build group map** (`makeGroups`): + - Per-GitHub-user groups on all clusters except `hive` + - openshift-priv admins group on `app.ci` only (requires `--peribolos-config`). Logs errors for admins with no GitHub-to-Kerberos mapping unless they are in `--skip-ocp-priv-admin`. + - Rover groups on their configured clusters (all clusters by default, overridable via config). Groups can be renamed via `rename_to` in the config, with the original name stored in a `rover-group-name` label. + +7. **Ensure groups on clusters** (`ensureGroups`): + - First pass: list all groups on each cluster labeled `dptp-requester: github-ldap-user-group-creator`. Delete any that are no longer in the desired set or no longer targeted to that cluster. Never deletes `ci-admins`. + - Second pass: concurrently (up to `--concurrency` goroutines via semaphore), upsert each group on its target clusters. Validates group names are non-empty and members have no duplicates or blanks before upserting. + - Upsert logic: attempt `Create`; if `AlreadyExists` and members differ, `Delete` then `Create` (OpenShift Group objects cannot be updated for the `Users` field). If members match, no-op. + - Retries with exponential backoff (4 steps, factor 2, starting at 1 second). + - Skips clusters disabled by Prow config. + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--github-users-file` | (required) | YAML file with GitHub-to-Kerberos user mappings (produced by `sync-rover-groups`) | +| `--groups-file` | (required) | YAML file with resolved Rover group memberships (produced by `sync-rover-groups`) | +| `--gcp-credentials-file` | (required) | GCP service account JSON for BigQuery writes | +| `--config-file` | `""` | Group config YAML for per-group cluster targeting and renaming | +| `--peribolos-config` | `""` | Peribolos org config file; used to extract openshift-priv admins | +| `--org-from-peribolos-config` | `openshift-priv` | Org to read admins/members from in the peribolos config | +| `--skip-ocp-priv-admin` | (empty) | GitHub logins to exclude from the ocp-private-admins group (repeatable) | +| `--dry-run` | `true` | When true, log intended changes without modifying clusters | +| `--delete-invalid-users` | `false` | Delete OpenShift User/Identity objects for users not in Rover | +| `--concurrency` | `60` | Max concurrent goroutines for group upsert operations | +| `--log-level` | `info` | Log verbosity | +| Kubernetes flags | (in-cluster) | `--kubeconfig`, `--context`, etc. via Prow's `KubernetesOptions` | + +## Key files + +- `cmd/github-ldap-user-group-creator/main.go` -- all logic: option parsing, group construction (`makeGroups`), cluster synchronization (`ensureGroups`, `upsertGroup`), BigQuery upload, user deletion +- `pkg/group/config.go` -- `Config` type with `ClusterGroups` and per-group `GroupConfig` (cluster targeting via `ResolveClusters`, `RenameTo`) +- `pkg/rover/types.go` -- `User` type (`UID`, `GitHubUsername`, `CostCenter`), `MapGithubToKerberos()` helper +- `pkg/rover/bigquery.go` -- `UserItem` type with `Save()` for BigQuery insertion + +## Deployment +Runs as a periodic Prow job (CronJob) on app.ci. Consumes output files from `sync-rover-groups` which runs earlier in the pipeline. Requires kubeconfigs for all CI build clusters and GCP credentials for BigQuery. + +All groups it creates are labeled `dptp-requester: github-ldap-user-group-creator` for ownership tracking and cleanup. + +## Related +- `cmd/sync-rover-groups` -- upstream: produces the `--groups-file` and `--github-users-file` this tool consumes +- `pkg/api` -- constants: `CIAdminsGroupName`, `DPTPRequesterLabel`, `GitHubUserGroup()`, cluster names ## What it does `github-ldap-user-group-creator` is a tool to maintain the groups on CI clusters. diff --git a/cmd/gpu-scheduling-webhook/README.md b/cmd/gpu-scheduling-webhook/README.md new file mode 100644 index 00000000000..c3c992234c3 --- /dev/null +++ b/cmd/gpu-scheduling-webhook/README.md @@ -0,0 +1,38 @@ +# gpu-scheduling-webhook + +## What +Kubernetes mutating admission webhook that automatically injects tolerations onto pods requesting GPU or KVM device resources. Without this webhook, pods requesting `nvidia.com/gpu` or `devices.kubevirt.io/kvm` would remain unschedulable because the nodes providing those resources are tainted to prevent non-GPU/KVM workloads from landing on them. + +## How it works -- full flow + +### Webhook registration +The webhook is built with controller-runtime's `WebhookManagedBy` builder, which registers it as a defaulting webhook for Pod resources. When a pod is created, the API server sends an admission review to this webhook. + +### GPU toleration injection +1. The webhook checks all init containers and regular containers for resource requests or limits containing `nvidia.com/gpu` (defined as `api.NvidiaGPUResource`). +2. If any container needs a GPU, the webhook checks whether the pod already has the required toleration. +3. If not present, it appends a toleration: `nvidia.com/gpu=true:NoSchedule`. + +### KVM/virt-launcher toleration injection +1. The webhook checks all init containers and regular containers for resource requests or limits containing `devices.kubevirt.io/kvm`. +2. If any container needs KVM, the webhook checks whether the pod already has the required toleration. +3. If not present, it appends a toleration: `ci-workload=virt-launcher:NoSchedule`. + +Both checks are idempotent -- if the toleration already exists, the webhook logs that fact and does nothing. + +### Health probes +The webhook registers `/healthz` and `/readyz` endpoints on the health probe address for liveness and readiness checks. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--cert-dir` | (none) | Directory containing `tls.crt` and `tls.key` for the webhook server | +| `--port` | 0 (controller-runtime default: 9443) | Port the webhook server listens on | +| `--health-probe-addr` | `:8081` | Address for health probe endpoints (`/healthz`, `/readyz`) | + +## Key files +- `cmd/gpu-scheduling-webhook/main.go` -- everything: entry point, cobra command, toleration definitions, pod mutation logic, health probes, webhook server setup +- `pkg/api/types.go` -- defines `NvidiaGPUResource` constant used for the GPU resource name + +## Deployment +Long-lived Deployment on build farm clusters, registered as a `MutatingWebhookConfiguration` for Pod resources. Uses controller-runtime's webhook server with TLS certificates from `--cert-dir`. Requires in-cluster kubeconfig. diff --git a/cmd/gsm-secret-sync/README.md b/cmd/gsm-secret-sync/README.md index 3ad3eabf6f9..95eabd19726 100644 --- a/cmd/gsm-secret-sync/README.md +++ b/cmd/gsm-secret-sync/README.md @@ -1,7 +1,56 @@ -# GSM Secret Sync - -A tool for synchronizing Google Secret Manager (GSM) resources with OpenShift CI secret collections based on the configuration defined in [_config.yaml](https://github.com/openshift/release/blob/main/core-services/sync-rover-groups/_config.yaml). +# gsm-secret-sync + +## What +Reconciler that manages GCP-side resources for Google Secret Manager (GSM): service accounts, IAM policy bindings, and secrets. It reads a declarative configuration file, compares the desired state against the actual state in GCP, computes a diff, and applies the necessary create/delete operations. This is the infrastructure counterpart to ci-secret-generator -- it ensures the GCP project has the right service accounts and secrets in place before ci-secret-generator writes secret values. + +## How it works -- full flow + +1. **Load configuration.** Reads command-line flags and validates them. Loads GCP credentials from the service account key file (censored from logs). Loads GCP project configuration from environment via `gsm.GetConfigFromEnv()`. + +2. **Parse desired state.** Calls `gsm.GetDesiredState(configFile, projectConfig)` which parses the config file and produces four sets: + - Desired service accounts (one per secret collection) + - Desired secrets (individual secret entries within collections) + - Desired IAM bindings (granting service accounts access to secrets) + - Desired collections (groupings of secrets) + +3. **Fetch actual state.** Creates three GCP API clients and fetches current state: + - **Resource Manager client** (`resourcemanager.NewProjectsClient`): retrieves the current project IAM policy. + - **IAM client** (`iamadmin.NewIamClient`): lists existing "updater" service accounts (those managed by this tool) in the GCP project. + - **Secret Manager client** (`secretmanager.NewClient`): lists all existing secrets in the GSM project. + +4. **Compute diff.** Calls `gsm.ComputeDiff(...)` which compares desired vs. actual state across all four dimensions: + - Service accounts to create (in desired but not actual) + - Service accounts to delete (in actual but not desired) + - Secrets to create (in desired but not actual) + - Secrets to delete (in actual but not desired) + - Consolidated IAM policy (if bindings differ from current policy) + +5. **Log change summary.** Logs the total number of changes and details at debug/trace level: + - Service accounts to create/delete + - Secrets to create/delete + - IAM policy binding updates + +6. **Apply changes.** If `--dry-run` is false, calls `actions.ExecuteActions(ctx, iamClient, secretsClient, projectsClient)` which: + - Creates new service accounts + - Deletes obsolete service accounts + - Creates new secrets + - Deletes obsolete secrets + - Updates the project IAM policy with consolidated bindings + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config` | (required) | Path to the GSM config file defining desired service accounts, secrets, and collections | +| `--gcp-service-account-key-file` | (required) | Path to GCP service account key file (JSON format) with permissions to manage IAM, secrets, and projects | +| `--dry-run` | `false` | When true, compute and log changes without applying them | +| `--log-level` | `info` | Log verbosity level | + +## Key files +- `cmd/gsm-secret-sync/main.go` -- entry point: config loading, GCP client creation, diff computation, action execution +- `pkg/gsm-secrets/` -- core library: `GetDesiredState`, `ComputeDiff`, `Actions`, `ExecuteActions`, `GetAllSecrets`, `GetUpdaterServiceAccounts`, `GetProjectIAMPolicy` +## Deployment +Runs as a postsubmit Prow job (`branch-ci-openshift-release-master-gsm-secrets-reconciler`), triggered when the GSM config in openshift/release changes. Requires a GCP service account with broad permissions: `iam.admin`, `secretmanager.admin`, and `resourcemanager.projectIamAdmin` on the target GCP project. ## Overview Secrets in Google Secret Manager are organized into **secret collections** - logical groupings that define access boundaries and management scope. GSM Secret Sync manages the lifecycle of these secret collections in Google Cloud Platform by: diff --git a/cmd/helpdesk-faq/README.md b/cmd/helpdesk-faq/README.md new file mode 100644 index 00000000000..3f4b2b81b5f --- /dev/null +++ b/cmd/helpdesk-faq/README.md @@ -0,0 +1,60 @@ +# helpdesk-faq + +## What +HTTP server that exposes FAQ items stored in a Kubernetes ConfigMap as a JSON API. Provides a read-only endpoint for the CI helpdesk FAQ system, with a 15-minute in-memory cache to reduce API server load. + +The FAQ data originates from Slack interactions (via a separate Slack bot component) and is stored in the `helpdesk-faq` ConfigMap in the `ci` namespace. This server makes that data available over HTTP for web frontends and other consumers. + +## How it works -- full flow + +### Startup +1. Parse flags, validate options, load in-cluster kubeconfig. +2. Create a controller-runtime client for the local cluster. +3. Initialize a `ConfigMapClient` that wraps the kube client with a 15-minute cache. +4. Start the HTTP server with graceful shutdown support. + +### API endpoints + +#### `GET /api/health` +Returns `{"ok": true}`. Used for liveness/readiness probes. + +#### `GET /api/v1/faq-items` +1. Call `GetSerializedFAQItems()` on the ConfigMap client. +2. If cached items exist and the cache is less than 15 minutes old, return cached data. +3. Otherwise, fetch the `helpdesk-faq` ConfigMap from the `ci` namespace. +4. Sort entries by key (timestamp) in reverse chronological order (newest first). +5. Deserialize each entry from JSON into a `FaqItem` struct. +6. Wrap all items in a `Page` object: `{"data": [...]}` +7. If a `?callback=` query parameter is present, return JSONP (wrapping the response in `();` with `Content-Type: application/javascript`). Otherwise return plain JSON. + +### Data model +Each FAQ item is stored as a JSON value in the ConfigMap, keyed by its Slack timestamp: +``` +FaqItem: + question: {author, topic, subject, body} + timestamp: string (Slack message timestamp) + thread_link: string (URL to Slack thread) + contributing_info: [{author, timestamp, body}, ...] + answers: [{author, timestamp, body}, ...] +``` + +### ConfigMap client operations +The `ConfigMapClient` also supports `UpsertItem`, `RemoveItem`, and `GetFAQItemIfExists` for the Slack bot's write path (those are not exposed via the HTTP API in this server, but are used by the Slack integration). + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--port` | `8080` | HTTP server port | +| `--log-level` | `info` | Log verbosity | +| `--gracePeriod` | `10s` | Grace period for server shutdown | +| Kubernetes flags | (in-cluster) | Standard Prow `KubernetesOptions` | + +## Key files + +- `cmd/helpdesk-faq/main.go` -- HTTP server setup, route handlers, JSONP support +- `pkg/helpdesk-faq/client.go` -- `ConfigMapClient`: ConfigMap read/write, 15-minute caching, sorted retrieval +- `pkg/helpdesk-faq/types.go` -- `FaqItem`, `Question`, `Reply` types + +## Deployment +Long-lived Deployment on app.ci in the `ci` namespace, port 8080. Uses in-cluster service account for ConfigMap access. diff --git a/cmd/image-graph-generator/README.md b/cmd/image-graph-generator/README.md index ef82c4b85b7..406508c0fd0 100644 --- a/cmd/image-graph-generator/README.md +++ b/cmd/image-graph-generator/README.md @@ -1,8 +1,71 @@ -# Image Graph Generator +# image-graph-generator -This tool is responsible for reading multiple ci-operator configurations and generating a graph based on the connections -of all the organizations, repositories, branches, and images that are specified in each configuration. +## What +CLI tool that builds and maintains an image dependency graph in a Dgraph database by reading ci-operator configurations, image mirroring mappings, and OpenShift manifests from the openshift/release repository. The resulting graph enables querying parent/child relationships between container images across the entire CI system. +## How it works -- full flow + +### 1. Initialize Dgraph client +Connects to the Dgraph GraphQL endpoint specified by `--graphql-endpoint-address` using the `shurcooL/graphql` client. + +### 2. Load existing graph state +The `Operator.Load()` method populates in-memory caches by querying Dgraph for all existing data: +- **Images**: queries all `Image` nodes, caching `name -> id` mappings +- **Organizations**: queries all `Organization` nodes +- **Repositories**: queries all `Repository` nodes +- **Branches**: queries all `Branch` nodes +- **Manifests**: walks `clusters/app.ci/` in the release repo, parsing YAML files for `ImageStream` and `BuildConfig` objects + +### 3. Process mirror mappings +`UpdateMirrorMappings()` walks `core-services/image-mirroring/` in the release repo, reading files prefixed with `mapping_`. Each line maps a source image to a destination in the app.ci registry (`registry.ci.openshift.org`). For each destination image: +- Parses the namespace, imagestream name, and tag +- Creates or updates the image node in Dgraph with the source URL + +### 4. Process OpenShift manifests +`AddManifestImages()` processes the ImageStream and BuildConfig objects loaded in step 2: +- **ImageStreams**: for each tag, creates/updates an image node with the tag's `from` reference as the source +- **BuildConfigs**: for each BuildConfig, creates/updates the output image node. If the build uses a DockerStrategy with a `from` reference, that reference becomes a parent image in the graph + +### 5. Process ci-operator configurations +`OperateOnCIOperatorConfigs()` walks `ci-operator/config/` in the release repo. For each config file (skipping `openshift-priv`): +- Creates/updates branch references for the org/repo/branch +- For each promotion target, processes every image in `images`: + - Determines the full image name: `{namespace}/{name}:{tag}` or `{namespace}/{tag}:{imageName}` + - Identifies parent images from `from` references and `inputs.as` entries + - Creates or updates the image node in Dgraph with parent/child relationships + - Records whether the image is multi-arch (has `additional_architectures`) + - Skips images listed in `excluded_images` + - Internal base images (`root`, `src`, `bin`) are marked with `fromRoot: true` + +### Graph data model +The Dgraph schema includes these node types: +- **Organization**: org name and ID +- **Repository**: repo name, linked to organization +- **Branch**: branch name, linked to repository +- **Image**: name, namespace, imageStreamRef, source URL, fromRoot flag, multiArch flag, linked to branches and parent images + +All mutations use GraphQL via the `Client` interface (supports both real Dgraph and a fake in-memory client for testing). + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--release-repo` | (required) | Path to the local checkout of the openshift/release repository | +| `--graphql-endpoint-address` | (required) | URL of the Dgraph GraphQL endpoint | + +## Key files +- `cmd/image-graph-generator/main.go` -- entry point, Dgraph client initialization, orchestration of Load/UpdateMirrorMappings/AddManifestImages/OperateOnCIOperatorConfigs +- `pkg/image-graph-generator/operator.go` -- `Operator` struct, `Load()`, `OperateOnCIOperatorConfigs()`, ci-operator config callback +- `pkg/image-graph-generator/images.go` -- `UpdateImage()`, `addImageRef()`, `updateImageRef()`, `loadImages()`, image URL parsing +- `pkg/image-graph-generator/mirror_mappings.go` -- `UpdateMirrorMappings()`, mirror mapping file parsing +- `pkg/image-graph-generator/manifests.go` -- `loadManifests()`, `AddManifestImages()`, ImageStream and BuildConfig processing +- `pkg/image-graph-generator/organizations.go` -- organization CRUD operations +- `pkg/image-graph-generator/repositories.go` -- repository CRUD operations +- `pkg/image-graph-generator/branches.go` -- branch CRUD operations +- `pkg/image-graph-generator/client.go` -- `Client` interface, `fakeClient` for testing +- `pkg/image-graph-generator/graphql.go` -- GraphQL mutation/query type definitions + +## Deployment +CLI tool. Requires a running Dgraph instance with the appropriate GraphQL schema deployed. Typically run as a periodic Prow job with a local checkout of openshift/release. The `image-graph-generator` is expected to operate against a [Dgraph](https://dgraph.io/) database. The schema is defined and maintained in the `types.graphql` file. @@ -23,4 +86,4 @@ Usage of image-graph-generator: Address of the Dgraph's graphql endpoint. -image-mirroring-path string Path to image mirroring mapping files. -``` \ No newline at end of file +``` diff --git a/cmd/job-run-aggregator/README.md b/cmd/job-run-aggregator/README.md index 4c9c4cae8b2..c6b30e07bfa 100644 --- a/cmd/job-run-aggregator/README.md +++ b/cmd/job-run-aggregator/README.md @@ -1,32 +1,111 @@ -The job-run-aggregator finds multiple runs of the same job for the same payload and analyzes the overall result -and the individual junit results. - -The analysis allows failures within (ideally) a standard deviation of the norm for individual tests. -This will allow a single payload to checked by multiple parallel job runs and the average results for each test -checked to ensure that a regression hasn't happened. -That property allows us to have less than perfect test results to start and still be able to latch improvements into -the failure percentages, giving a path to improvement. - -## Development and Debugging tips - -When you run this in debugging mode, use a credential that has read-only access. This way, you -can set breakpoints and study the behavior without risk of overwriting anything. -You will, of course, get "permission denied" errors if write access is required. -At that point, you can (cautiously) switch to a credential that has write access. - -This is a way to build (without optimazations) for debugging: - -``` -cd ci-tools -go build -gcflags='-N -l' `grep "module " go.mod |awk '{print $2}'`/cmd/job-run-aggregator -``` - -Example command lines: - -``` -# Run in dry run mode with read credential -dlv exec ./job-run-aggregator -- upload-test-runs --dry-run --bigquery-dataset ci_data --google-service-account-credential-file ~/project-reader.json +# job-run-aggregator + +## What +Cobra-based multi-command CLI for statistical analysis of CI job runs. It ingests job run data from GCS into BigQuery, analyzes pass/fail rates across multiple runs of the same job (or across jobs for a payload), and detects regressions by comparing against historical baselines. Used by the release controller to gate payload acceptance. + +## Subcommands + +### `upload-disruptions` +Uploads backend disruption data from GCS job artifacts into the `BackendDisruption` BigQuery table. +- Reads `backend-disruption` prefixed files from each job run's GCS artifacts. +- Uses 10 concurrent worker goroutines to process job runs. +- Tracks which job runs have already been uploaded to avoid duplicate inserts (BigQuery does not prevent duplicates). +- Only processes jobs where `CollectDisruption` is true in the jobs table. + +### `upload-alerts` +Uploads alert firing data from GCS job artifacts into the `Alerts` BigQuery table. +- Reads `alert` prefixed files from each job run's GCS artifacts. +- Populates zeros for known alerts that were not observed in a run, ensuring correct percentile calculations. +- Maintains a `KnownAlertsCache` of all alert/namespace/level combinations ever seen per release. +- Processes all jobs (not filtered by `CollectDisruption`). + +### `analyze-job-runs` +Aggregates multiple runs of a single job to determine pass/fail for the overall payload. +- Locates job runs by either `--payload-tag` (release controller payloads) or `--aggregation-id` (per-PR payload promotion). +- Waits up to `--timeout` (default 5h30m) for all job runs to complete. +- Uses a `weeklyAverageFromTenDaysAgo` pass/fail calculator that compares current results against a 6-window weekly average baseline. +- Queries job run states from BigQuery or directly from the ProwJob cluster (`--query-source`). +- Writes JUnit XML and a spyglass summary to the working directory. + +### `analyze-test-case` +Analyzes test case pass rates across multiple different jobs for a payload, used to gate payloads on cross-job test stability. +- Filters jobs by `--platform`, `--network`, `--infrastructure`, and `--include-job-names`/`--exclude-job-names`. +- Supports test groups: `install`, `upgrade`, `overall`. +- Enforces `--minimum-successful-count` (default 1) successful test runs across matching jobs. +- For PR payloads, uses `--payload-invocation-id` and `--explicit-gcs-prefixes`. + +### `analyze-historical-data` +Compares new BigQuery query results against a current baseline file for alerts, disruptions, or test pass rates. +- Supports data types: `alerts`, `disruptions`, `tests`. +- For alerts and disruptions: compares new vs current data with `--leeway` percentage threshold. +- For tests: generates historical test data without comparison. +- Outputs results to `--output-file` (default `results_{datatype}.json`). + +### `prime-job-table` +Inserts or updates job metadata in the BigQuery jobs table. +- Reads all job names and generates entries with release, platform, network, and other variant data. +- Supports `--dry-run` mode. + +### `create-releases` +Creates the BigQuery release tables schema. + +### `upload-releases` +Uploads release/changelog data to BigQuery for specified `--releases` and `--architectures`. +- Parses release changelogs from the release controller API. + +## Common flags (shared across subcommands via BigQuery coordinates and authentication) +| Flag | Default | What it controls | +|---|---|---| +| `--bigquery-project` | (from DataCoordinates) | Google Cloud project ID for BigQuery | +| `--bigquery-dataset` | (from DataCoordinates) | BigQuery dataset ID | +| `--google-service-account-credential-file` | (from Authentication) | Path to GCP service account key file | +| `--google-storage-bucket` | `test-platform-results` | GCS bucket holding test artifacts | + +### analyze-job-runs specific flags +| Flag | Default | What it controls | +|---|---|---| +| `--job` | (required) | Name of the job to inspect | +| `--payload-tag` | (one required) | Payload tag to aggregate (mutually exclusive with `--aggregation-id`) | +| `--aggregation-id` | (one required) | Matches `release.openshift.io/aggregation-id` label on ProwJobs | +| `--explicit-gcs-prefix` | (none) | Override GCS prefix for per-PR payload jobs | +| `--working-dir` | `job-aggregator-working-dir` | Directory for caches and output | +| `--timeout` | `5h30m` | Maximum wait time for aggregation | +| `--job-start-time` | now | Estimated job start time in RFC3339 format | +| `--query-source` | `bigquery` | Source for job states: `bigquery` or `cluster` | + +### analyze-test-case specific flags +| Flag | Default | What it controls | +|---|---|---| +| `--test-group` | `install` | Test group: `install`, `upgrade`, or `overall` | +| `--platform` | (none) | Filter by platform: aws, gcp, azure, vsphere, metal, ovirt, libvirt | +| `--network` | (none) | Filter by network: ovn, sdn | +| `--infrastructure` | (none) | Filter by infrastructure: ipi, upi | +| `--minimum-successful-count` | 1 | Minimum required successful test runs | +| `--payload-invocation-id` | (none) | For PR payloads, matches the prowjob label | +| `--explicit-gcs-prefixes` | (none) | Comma-separated `jobname=prefix` pairs for PR payloads | +| `--timeout` | `3h30m` | Maximum wait time | + +## Key files +- `cmd/job-run-aggregator/main.go` -- entry point, delegates to `pkg/jobrunaggregator.NewJobAggregatorCommand()` +- `pkg/jobrunaggregator/cmd.go` -- cobra command tree assembly, registers all subcommands +- `pkg/jobrunaggregator/jobrunaggregatoranalyzer/cmd.go` -- `analyze-job-runs` flag definitions and options builder +- `pkg/jobrunaggregator/jobrunaggregatoranalyzer/analyzer.go` -- core analysis: wait for jobs, collect results, calculate pass/fail +- `pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go` -- `weeklyAverageFromTenDaysAgo` statistical calculator +- `pkg/jobrunaggregator/jobruntestcaseanalyzer/cmd.go` -- `analyze-test-case` flag definitions +- `pkg/jobrunaggregator/jobruntestcaseanalyzer/analyzer.go` -- cross-job test case analysis +- `pkg/jobrunaggregator/jobrunbigqueryloader/disruption.go` -- `upload-disruptions` implementation +- `pkg/jobrunaggregator/jobrunbigqueryloader/alert.go` -- `upload-alerts` implementation, known alerts zero-fill +- `pkg/jobrunaggregator/jobrunbigqueryloader/uploader.go` -- generic job run upload orchestration, 10-worker concurrency +- `pkg/jobrunaggregator/jobtableprimer/cmd.go` -- `prime-job-table` implementation +- `pkg/jobrunaggregator/releasebigqueryloader/cmd.go` -- `create-releases` and `upload-releases` implementations +- `pkg/jobrunaggregator/jobrunhistoricaldataanalyzer/cmd.go` -- `analyze-historical-data` implementation +- `pkg/jobrunaggregator/jobrunaggregatorlib/` -- shared utilities: BigQuery client, GCS client, job run locators, Google auth + +## Deployment +CLI tool. Invoked via periodic Prow jobs and by the release controller for payload gating. Requires GCP service account credentials with BigQuery and GCS access. +## Development +```sh # Run command to create tables in write mode (in the a dataset called "my_dataset" (someone's dataset created for testing) dlv exec ./job-run-aggregator -- create-releases --bigquery-dataset my_dataset --google-service-account-credential-file ~/project-write.json diff --git a/cmd/job-trigger-controller-manager/README.md b/cmd/job-trigger-controller-manager/README.md new file mode 100644 index 00000000000..51a7e9cedd9 --- /dev/null +++ b/cmd/job-trigger-controller-manager/README.md @@ -0,0 +1,68 @@ +# job-trigger-controller-manager + +## What +Kubernetes controller that reconciles `PullRequestPayloadQualificationRun` (PRPQR) custom resources into ProwJobs. This is the execution engine for payload qualification testing: when a PRPQR is created (typically by the `payload-testing` Prow plugin), this controller resolves the ci-operator configuration, generates the appropriate ProwJobs, creates them in the cluster, and tracks their status back to the PRPQR status. + +## How it works -- full flow + +### Controller setup +1. Registers a controller-runtime manager watching PRPQR objects in the configured namespace (default: `ci`) +2. Also registers a `pjstatussyncer` sub-controller that watches ProwJobs and syncs their status back to the owning PRPQR +3. Watches for PRPQR create and update events only (not deletes) + +### Reconciliation loop +When a PRPQR is created or updated: + +1. **Fetch existing state**: Get the PRPQR and list any ProwJobs already created for it (matched by the `pullrequestpayloadqualificationrun` label) + +2. **Handle deletion**: If the PRPQR is being deleted, abort all associated ProwJobs by setting their state to `Aborted` + +3. **Trigger jobs**: For each job spec in `prpqr.Spec.Jobs.Jobs`: + - Skip if a ProwJob already exists for this job (matched by name hash) + - Resolve the ci-operator config by calling the config-resolver service, injecting the test from `MetadataWithTest` + - For aggregated jobs (`AggregatedCount > 0`): generate an aggregator ProwJob plus N child ProwJobs + - For multi-ref jobs: generate a ProwJob that tests multiple PRs together + - For regular jobs: generate a single ProwJob + - Query the prowjob-dispatcher for cluster assignment + - Create the ProwJob(s) in the cluster + - Wait `--job-trigger-wait-seconds` (default 20s) after creation to let the ProwJob controller pick it up + +4. **Build ProwJob specs**: Each ProwJob is built by: + - Creating a `periodic` Prow job spec with the resolved ci-operator config + - Setting extra refs for the PR(s) under test + - Applying Prow config defaults + - Adding labels linking back to the PRPQR + +5. **Update PRPQR status**: Write job statuses (ProwJob name, state, description) back to the PRPQR's `.status.jobs` field. Set the `AllJobsTriggered` condition when complete. + +### ProwJob status syncer +A separate sub-controller (`pjstatussyncer`) watches ProwJobs with the PRPQR label and syncs status changes back to the parent PRPQR. This keeps the PRPQR status up-to-date as jobs progress through Pending, Running, Success, Failure, etc. + +### Job timeout +Aggregator and multi-ref jobs have configurable timeouts (`--aggregator-job-timeout`, `--multi-ref-job-timeout`). Jobs that exceed their timeout are marked as timed out in the PRPQR status. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `true` | Dry-run mode: does not create real ProwJobs | +| `--namespace` | `ci` | Namespace to watch for PRPQR objects | +| `--job-trigger-wait-seconds` | `20` | Seconds to wait after creating a ProwJob for status to appear | +| `--aggregator-job-timeout` | `6` | Hours before an aggregator job is considered timed out | +| `--multi-ref-job-timeout` | `6` | Hours before a multi-ref job is considered timed out | +| `--dispatcher-address` | `http://prowjob-dispatcher.ci.svc.cluster.local:8080` | Address of the prowjob-dispatcher service | +| `--config-path` | (Prow) | Prow config path (for job defaults) | + +## Key files +- `cmd/job-trigger-controller-manager/main.go` -- entry point, controller-runtime manager setup +- `pkg/controller/prpqr_reconciler/prpqr_reconciler.go` -- reconciler logic: job generation, creation, status management +- `pkg/controller/prpqr_reconciler/pjstatussyncer/` -- ProwJob-to-PRPQR status sync sub-controller +- `pkg/api/pullrequestpayloadqualification/v1/` -- PRPQR CRD types + +## Related +- `cmd/payload-testing-prow-plugin` -- creates PRPQR objects this controller reconciles +- `cmd/multi-pr-prow-plugin` -- also creates PRPQR objects + +## Deployment +Long-lived controller-runtime Deployment on app.ci in the `ci` namespace. Requires in-cluster access with permissions to create/list/update ProwJobs and PRPQR resources. + +Uses the ci-operator config resolver service (`https://config.ci.openshift.org`) and the prowjob-dispatcher for cluster assignment. diff --git a/cmd/ldap-users-from-github-owners-files/README.md b/cmd/ldap-users-from-github-owners-files/README.md new file mode 100644 index 00000000000..64e15e64a92 --- /dev/null +++ b/cmd/ldap-users-from-github-owners-files/README.md @@ -0,0 +1,47 @@ +# ldap-users-from-github-owners-files + +## What +Maps GitHub usernames found in OWNERS files to LDAP (Kerberos) IDs by cross-referencing against LDAP directory export data. Used to determine which Red Hat employees are approvers for a given repository subtree. Has two operating modes: either output a JSON array of LDAP usernames (for RBAC provisioning), or output a YAML mapping file of GitHub login to Kerberos ID (for other tools). + +## How it works -- full flow + +### LDAP mapping construction +1. Read the file specified by `--ldap-file`, which is expected to be the output of an `ldapsearch` command: + ``` + ldapsearch -LLL -x -h ldap.corp.redhat.com -b ou=users,dc=redhat,dc=com '(rhatSocialURL=GitHub*)' rhatSocialURL uid + ``` +2. Split the file by double newlines (LDAP entry separator). +3. For each entry, extract: + - `uid: ` from lines starting with `uid: ` + - GitHub username from lines starting with `rhatSocialURL: GitHub->`, parsed from the URL path (handles trailing slashes) +4. Build a `map[string]string` from GitHub username to Kerberos ID. Duplicate GitHub usernames produce warnings. + +### Mapping-only mode (`--mapping-file`) +If `--mapping-file` is set, write the GitHub-to-Kerberos mapping as YAML to the specified path and exit. This mode skips OWNERS file processing entirely. + +### OWNERS file processing (default mode) +1. Read `OWNERS_ALIASES` from the repository root (`--repo-base-dir`) to resolve alias references. +2. Walk the subdirectory (`--repo-sub-dir`, default `.`) recursively, finding all `OWNERS` files. +3. For each OWNERS file (processed concurrently with goroutines): + - Unmarshal the YAML and extract the `approvers` list + - Resolve any aliases by looking up names in `OWNERS_ALIASES` + - Look up each resolved approver (case-insensitive) in the GitHub-to-Kerberos mapping + - Collect the Kerberos IDs (lowercased) +4. Deduplicate all collected Kerberos IDs. +5. Output a sorted JSON array of unique LDAP usernames to stdout. + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--ldap-file` | `""` | Path to ldapsearch output file (required) | +| `--repo-base-dir` | `""` | Base directory of the target repository (used to find `OWNERS_ALIASES`) | +| `--repo-sub-dir` | `.` | Subdirectory relative to `--repo-base-dir` to walk for OWNERS files | +| `--mapping-file` | `""` | If set, output the GitHub-to-Kerberos mapping YAML to this path and exit | + +## Key files + +- `cmd/ldap-users-from-github-owners-files/main.go` -- all logic: LDAP file parsing (`createLDAPMapping`), OWNERS walking (`getAllSecretUsers`), mapping output + +## Deployment +CLI tool. Not deployed as a long-running service. Typically invoked as part of automation pipelines that need to resolve repository ownership to corporate identities. diff --git a/cmd/lensserver/README.md b/cmd/lensserver/README.md index b681073db6c..5d9ed3eb1fc 100644 --- a/cmd/lensserver/README.md +++ b/cmd/lensserver/README.md @@ -1,7 +1,63 @@ -# Lensserver +# lensserver -This binary provides additional [lenses][0] for Prows [spyglass log viewer][1]. +## What +Spyglass lens server that provides a "CI-Operator steps" visualization for Prow's log viewer (Deck/Spyglass). It renders an interactive step graph showing the execution timeline, dependencies, duration, and status of each ci-operator step in a job run. +Spyglass is Prow's artifact viewer. Lenses are plugins that render specific artifact types. This server hosts the `steps` lens which reads the step graph JSON artifact produced by ci-operator and renders it as an interactive HTML visualization. + +## How it works -- full flow + +1. Initialize Prow config agent from the provided config options +2. Create a `jobs.JobAgent` with a fake (no-op) ProwJob listing client -- real ProwJob data is not needed for artifact rendering +3. Create a storage artifact opener for GCS/S3 using the provided credentials +4. Register the `stepgraph.Lens` as a local lens with: + - Name: `steps` + - Title: `CI-Operator steps` + - Priority: `6` +5. Create a `common.LensServer` from Prow's spyglass library, configured with: + - Listen address: `127.0.0.1:1235` + - The storage artifact fetcher (reads artifacts from GCS/S3) + - A pod log artifact fetcher (unused in practice) + - The step graph lens +6. Start the HTTP server + +### Step graph lens rendering +The `stepgraph.Lens` (in `pkg/lenses/stepgraph/`) works as follows: +1. Expects exactly one artifact (the step graph JSON file produced by ci-operator, typically at `artifacts/ci-operator-step-graph.json`) +2. Reads and unmarshals the artifact into a `[]Step` slice, where each step contains: + - Step name, description, dependencies + - Start time, finish time, duration + - Success/failure status + - Kubernetes manifests applied by the step +3. Sorts steps by start time +4. Serializes any embedded Kubernetes manifests to YAML for display +5. Renders an HTML template (`static/template.html`) with the step data, producing an interactive graph visualization + +### Why a separate server +Prow's Spyglass architecture supports "local lenses" that run as sidecar HTTP servers alongside Deck. The lens server communicates with Deck over localhost. This allows custom lens implementations without modifying Deck itself. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| Prow config flags | | Standard `configflagutil.ConfigOptions` (`--config-path`, `--job-config-path`, etc.) | +| `--gcs-credentials-file` | `""` | GCS service account key for reading artifacts | +| `--s3-credentials-file` | `""` | S3 credentials for reading artifacts | + +## Key files +- `cmd/lensserver/main.go` -- entry point, server setup, fake PJ client +- `pkg/lenses/stepgraph/stepgraph.go` -- lens implementation (Header, Body, Callback, Config) +- `pkg/lenses/stepgraph/static/template.html` -- HTML/CSS/JS template for the step graph visualization +- `pkg/api/graph.go` -- `CIOperatorStepDetails` and `CIOperatorStepGraph` types + +## Deployment +Not currently deployed. Historically designed as a sidecar container to Prow Deck (listening on `127.0.0.1:1235`), but no lensserver sidecar exists in the current Deck Deployment — the Deck pod only contains `deck` and `git-sync` containers. + +For local development, use `hack/run-lens.sh`. + +## Related +- Deck references this lens server at `127.0.0.1:1235` in its Spyglass lens configuration +- ci-operator writes the step graph artifact to `artifacts/ci-operator-step-graph.json` during job execution +- The step graph shows step dependencies, execution order, timing, and pass/fail status with expandable manifest details Run it together with [Deck][3] via the hack script: ```sh diff --git a/cmd/multi-arch-builder-controller/README.md b/cmd/multi-arch-builder-controller/README.md index 91f0ec85e02..943892d61f6 100644 --- a/cmd/multi-arch-builder-controller/README.md +++ b/cmd/multi-arch-builder-controller/README.md @@ -1,8 +1,70 @@ # multi-arch-builder-controller -The controller reconciles the MultiArchBuildConfig and generates multiple builds one for each architecture that exists on the cluster. Once all builds succeed, the controller uses the manifest-tool binary to create a new image based on the output configuration that includes the manifest list with all images that have been built per architecture correspondingly. +## What +Kubernetes controller that reconciles `MultiArchBuildConfig` custom resources into per-architecture OpenShift `Build` objects, assembles a multi-architecture container image manifest, and optionally mirrors the result to external registries. This enables building and publishing multi-arch images (e.g., amd64 + arm64 + s390x) within the CI infrastructure. +## How it works -- full flow +### Startup +1. Discovers available architectures by listing cluster nodes and collecting their `kubernetes.io/arch` labels +2. Registers the controller with watches on `MultiArchBuildConfig` resources and owned `Build` resources + +### Reconciliation loop +When a `MultiArchBuildConfig` (MABC) is created or updated, the controller progresses through a state machine: + +#### Phase 1: Create builds +- For each discovered architecture, create an OpenShift `Build` object +- Each Build is a copy of the MABC's `BuildSpec` with `nodeSelector` set to `kubernetes.io/arch: ` +- The output image tag is suffixed with the architecture name (e.g., `myimage-amd64`, `myimage-arm64`) +- Builds are created with an owner reference back to the MABC + +#### Phase 2: Wait for builds to complete +- On each reconcile, list Builds owned by the MABC +- If not all builds are finished, return and wait for the next event (Build status change triggers re-reconcile via owner reference) +- If any build fails, set MABC state to `Failure` and stop + +#### Phase 3: Push manifest list +- Once all per-arch builds succeed, use `manifestpusher.PushImageWithManifest()` to create a multi-architecture manifest list +- The manifest list points to each architecture-specific image in the internal registry (`image-registry.openshift-image-registry.svc:5000`) +- The target image reference is derived from the MABC's output spec +- On failure, set MABC state to `Failure` + +#### Phase 4: Mirror to external registries +- If `spec.externalRegistries` is set, use `oc image mirror` to push the multi-arch manifest to each external registry +- If no external registries are configured, skip with a success condition +- On failure, set MABC state to `Failure` + +#### Completion +- Set MABC state to `Success` +- On subsequent reconciles, skip MABCs already in `Success` or `Failure` state + +### Status conditions +The MABC status tracks progress through conditions: +- `CreateBuildsDone` -- all per-architecture Builds created +- `BuildsCompleted` -- all Builds finished (success or failure) +- `PushManifestDone` -- manifest list pushed to internal registry +- `ImageMirrorDone` -- manifest mirrored to external registries (or skipped) + +### Concurrency +Max concurrent reconciles is 1 to avoid conflicts on shared resources. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `true` | Dry-run mode: no actual resource creation | +| `--docker-cfg` | `/.docker/config.json` | Path to registry credentials for manifest push and image mirror | + +## Key files +- `cmd/multi-arch-builder-controller/main.go` -- entry point, node architecture discovery, manager setup +- `pkg/controller/multiarchbuildconfig/multiarchbuildconfig.go` -- reconciler: Build creation, manifest push orchestration, image mirroring, status management +- `pkg/controller/multiarchbuildconfig/mirror.go` -- `oc image mirror` wrapper for external registry mirroring +- `pkg/api/multiarchbuildconfig/v1/` -- MABC CRD types +- `pkg/manifestpusher/` -- multi-arch manifest list assembly and push + +## Deployment +Long-lived controller-runtime Deployment on a heterogeneous (multi-arch) build cluster. Requires in-cluster access with permissions to create Builds, read nodes (for architecture discovery), and access the internal image registry. + +Registry credentials (`--docker-cfg`) must include push access to both the internal registry and any configured external registries. ```console $ ./multi-arch-builder-controller --help Usage of ./multi-arch-builder-controller: @@ -14,4 +76,4 @@ Usage of ./multi-arch-builder-controller: ## Requirements - `manifest-tool` binary included in the container image -- target registry credentials mounted on /.docker/config.json \ No newline at end of file +- target registry credentials mounted on /.docker/config.json diff --git a/cmd/multi-pr-prow-plugin/README.md b/cmd/multi-pr-prow-plugin/README.md index 4cf192459dc..f72a4ed712c 100644 --- a/cmd/multi-pr-prow-plugin/README.md +++ b/cmd/multi-pr-prow-plugin/README.md @@ -1,3 +1,97 @@ The `multi-pr-prow-plugin` is an external prow plugin that facilitates running presubmit tests from sources built using multiple pull requests. The included pull requests can be from the same, or a different, repo. It creates and manages GitHub `check_runs` to keep share the state and logs of the jobs with the user. + +# multi-pr-prow-plugin + +## What +Prow webhook plugin that enables testing changes from multiple pull requests together in a single CI job. When a user comments `/testwith /// /# ...`, the plugin constructs a ProwJob (as a periodic) that checks out code from all specified PRs and runs the requested test against the combined sources. This is essential for validating cross-repository changes that must land together. + +Results are reported back to the origin PR as GitHub Check Runs. + +## User-facing commands +All triggered via PR comments: + +| Command | What it does | +|---|---| +| `/testwith /// ` | Run the named test with code from multiple PRs | +| `/testwith //// ` | Same, but for a variant-qualified test | +| `/testwith abort` | Abort all active multi-PR jobs where this PR is the origin | + +PR references use the format `org/repo#number`. Full GitHub URLs (`https://github.com/org/repo/pull/123`) are also accepted and auto-converted. Up to 20 PRs can be included per command. Multiple `/testwith` lines in a single comment are supported. + +### Examples +``` +/testwith openshift/kubernetes/master/e2e openshift/kubernetes#1234 openshift/installer#999 +/testwith openshift/origin/master/gcp/e2e-gcp openshift/origin#5678 +/testwith abort +``` + +## How it works -- full flow + +### On `/testwith` command +1. **Trust check**: Verify the commenter is a trusted user for the repo via Prow's `TrustedUser` mechanism. Untrusted users are rejected with an error comment. + +2. **Parse command**: Extract the job specification from the comment. The job spec format is `///` or `////`. Also extract the list of additional PRs. + +3. **Resolve PRs**: For each referenced PR, call `GetPullRequest` to get its current state (base ref, head SHA, user login, title). For PRs targeting the same org/repo as the test, normalize the base branch to the job metadata's branch to handle renamed default branches (e.g., `master` to `main`). + +4. **Resolve CI config**: Call the ci-operator config resolver service to get the `ReleaseBuildConfiguration` for the test's org/repo/branch/variant. Find the matching test definition by `As` name. + +5. **Build ProwJob**: Generate a periodic-type ProwJob using `prowgen`: + - Inject `--test-from` via `InjectTestFrom` to run the specific test from the resolved config + - Set a custom hash input matching the job name for deterministic naming + - Query the prowjob-dispatcher for cluster assignment (results cached for 15 minutes) + - Enforce a minimum timeout of 8 hours (`defaultMultiRefJobTimeout`) + - Build `Refs` and `ExtraRefs` from all PRs, grouped by base org/repo/branch. The test's own org/repo becomes the primary `Refs`; all others become `ExtraRefs`. Path aliases are resolved via the config resolver + - Fetch base SHAs via `GetRef("heads/")` for each ref + - Apply Prow defaults via `DefaultPeriodic` + +6. **Submit ProwJob**: Create the ProwJob in Kubernetes in the configured namespace with a `ci.openshift.io/testwith` label encoding the origin PR. + +7. **Report via Check Runs**: Asynchronously wait for the ProwJob to get a status URL (polling every 5s for up to 60s), then create a GitHub Check Run on the origin PR with `status: in_progress`. The Check Run body links to the job logs and lists all included PRs. + +### Job naming +Jobs are named: `multi-pr----[-]` + +### On `/testwith abort` +1. List ProwJobs matching the origin PR's org, repo, number, and the `ci.openshift.io/testwith` label. +2. Set `status.state = Aborted` on all non-complete jobs via `Update` (not Patch, to avoid overwriting concurrent state changes by the ProwJob controller). + +### Reporter sync loop +A background goroutine runs every `--job-sync-seconds` (default 60s). For each tracked job: +- Fetch the ProwJob's current state from Kubernetes +- If completed (success, failure, error, aborted), update the Check Run with the mapped conclusion (`success`/`failure`/`failure`/`cancelled`) and remove the job from tracking +- If the ProwJob is not found, remove from tracking +- If the job was created over 25 hours ago, remove from tracking (stale entry) + +Job tracking state is persisted to a JSON file (`--job-config`) for crash recovery. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--log-level` | `info` | Log verbosity | +| `--hmac-secret-file` | `/etc/webhook/hmac` | GitHub webhook HMAC secret | +| `--namespace` | `ci` | Namespace for ProwJob creation | +| `--ci-op-config-dir` | (required) | Path to CI operator configuration directory | +| `--dispatcher-address` | `http://prowjob-dispatcher.ci.svc.cluster.local:8080` | Address of the prowjob-dispatcher for cluster assignment | +| `--job-config` | (required) | Path to JSON file tracking active Check Runs | +| `--job-sync-seconds` | `60` | Interval in seconds between Check Run sync cycles | + +Standard Prow config flags (`--config-path`), GitHub flags, Kubernetes flags, and `githubeventserver.Options` are also supported. + +## Key files +- `cmd/multi-pr-prow-plugin/main.go` -- entry point, flag parsing, kube client setup, reporter and server initialization +- `cmd/multi-pr-prow-plugin/server.go` -- webhook handler, command parsing, ProwJob generation (config resolution, ref building, cluster dispatch), abort logic +- `cmd/multi-pr-prow-plugin/report.go` -- Check Run reporter: creation on job start, periodic sync loop, state-to-conclusion mapping, JSON config persistence + +## Deployment +Long-lived webhook Deployment on app.ci. Requires: +- Kubeconfig with access to the app.ci cluster (for ProwJob CRUD) +- GitHub App or token with Check Run permissions +- Network access to the prowjob-dispatcher and ci-operator config resolver services + +Listens for GitHub `issue_comment` events. + +## Related +- `cmd/job-trigger-controller-manager` -- reconciles PRPQR objects into ProwJobs (used by payload-testing, not this plugin) diff --git a/cmd/ocp-build-data-enforcer/README.md b/cmd/ocp-build-data-enforcer/README.md index 75928515fd8..d7cf8310961 100644 --- a/cmd/ocp-build-data-enforcer/README.md +++ b/cmd/ocp-build-data-enforcer/README.md @@ -1,7 +1,58 @@ -# OCP Build data enforcer +# ocp-build-data-enforcer -This tool: +## What +Validates that Dockerfiles in OCP component repositories have `FROM` directives that match the base image configuration in the [ocp-build-data](https://github.com/openshift/ocp-build-data) repository. When mismatches are found, it can either print unified diffs or create pull requests to fix them. +This ensures the Dockerfiles used for CI match the ART (Automated Release Tooling) build configuration used for producing official release artifacts. + +## How it works -- full flow + +1. **Load ocp-build-data configs**: Read all image configs from the ocp-build-data directory for the specified major.minor version. Each config defines the expected `FROM` images (stages) for a component. + +2. **Process each config concurrently** (via `errgroup`): + - Skip repos in the `openshift-priv` org + - Fetch the Dockerfile from GitHub using `github.FileGetterFactory` (from the `release-4.6` branch by default) + - Parse the Dockerfile using `imagebuilder.ParseDockerfile` and extract stages via `imagebuilder.NewStages` + - Compare the number of stages against what ocp-build-data expects; error if they differ + - For each stage, compare the `FROM` image reference against the expected value from `config.Stages()` + - Build a list of replacements needed + +3. **Apply replacements**: For each `FROM` mismatch, perform a smart line-level replacement in the Dockerfile content (preserving comments, whitespace, and non-FROM lines). The replacement searches forward from the `FROM` directive's line until it finds the actual text to replace. + +4. **Output**: + - If `--create-prs` is false (default) or `--pr-creation-ceiling` is 0: print unified diffs for each file + - If `--create-prs` is true: for each diff (up to `--pr-creation-ceiling`): + - Clone the repo via `git.ClientFactory` + - Checkout the target branch (from `content.source.git.branch.target` if set, otherwise the repo default branch) + - Write the updated Dockerfile + - Use `PRCreationOptions.UpsertPR()` to create or update a PR with a descriptive body explaining the change and linking to the ocp-build-data config + +### PR body template +The auto-generated PR body explains: +- This is autogenerated by ocp-build-data-enforcer +- It updates base images to match ocp-build-data +- The user can create an alternate PR with the same changes instead of merging +- Contact #aos-art for questions + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--ocp-build-data-repo-dir` | `../ocp-build-data` | Path to the ocp-build-data repository checkout | +| `--major` | `4` | Major OCP version to target | +| `--minor` | `6` | Minor OCP version to target | +| `--create-prs` | `false` | Whether to create GitHub PRs for mismatches | +| `--pr-creation-ceiling` | `5` | Maximum number of PRs to create (0 = print diffs only) | +| PR creation flags | -- | `--self-approve`, `--github-token-path`, etc. via `PRCreationOptions` | + +## Key files + +- `cmd/ocp-build-data-enforcer/main.go` -- all logic: option parsing, Dockerfile comparison (`processDockerfile`, `updateDockerfile`), diff/PR processing (`diffProcessor`) +- `pkg/api/ocpbuilddata/` -- `LoadImageConfigs`, `OCPImageConfig`, `MajorMinor` types +- `pkg/github/prcreation/prcreation.go` -- `PRCreationOptions.UpsertPR()` for PR creation + +## Deployment +CLI tool. Can be run manually or as a periodic Prow job. Requires GitHub token for fetching Dockerfiles and (if creating PRs) for pushing changes and creating pull requests. * Iterates over the content in https://github.com/openshift/ocp-build-data/tree/openshift-4.21/images for all openshift versions * Downloads the Dockerfile specified in `content.source.Dockerfile` (Default: `Dockerfile`) * Checks if it `From` directive matches the build-cluster equivalent of the de-referenced `from.stream` diff --git a/cmd/payload-testing-prow-plugin/README.md b/cmd/payload-testing-prow-plugin/README.md new file mode 100644 index 00000000000..c29d470f049 --- /dev/null +++ b/cmd/payload-testing-prow-plugin/README.md @@ -0,0 +1,103 @@ +# payload-testing-prow-plugin + +## What +Prow webhook plugin that triggers release payload qualification tests against PR code. Users comment `/payload`, `/payload-job`, `/payload-aggregate`, or their `-with-prs` variants on a PR, and the plugin creates `PullRequestPayloadQualificationRun` (PRPQR) custom resources. These are reconciled by the `job-trigger-controller-manager` into actual ProwJobs that build a payload with the PR's changes and run the specified release qualification jobs against it. + +This is how OpenShift developers validate that their changes do not break release-blocking or informing jobs before merging. + +## User-facing commands +All triggered via PR comments: + +| Command | What it does | +|---|---| +| `/payload ` | Run qualification jobs for a release (e.g., `/payload 4.10 nightly informing`) | +| `/payload-with-prs ` | Same, but include additional PRs in the payload | +| `/payload-job [...]` | Run specific named jobs (space-separated) | +| `/payload-job-with-prs ` | Run a specific job with additional PRs | +| `/payload-aggregate ` | Run a job N times with aggregation | +| `/payload-aggregate-with-prs ` | Aggregated run with additional PRs | +| `/payload-abort` | Abort all active payload jobs for this PR | + +### Parameters +- ``: OCP version like `4.10`, `4.14` +- ``: `nightly`, `ci`, or `konflux-nightly` +- ``: `blocking`, `informing`, `periodics`, or `all` +- ``: Additional PRs in `org/repo#number` format +- ``: Number of aggregated runs (integer) + +### Constraints +- Only works on repos that promote official OpenShift images (checked via `PromotesOfficialImages`) +- The `-with-prs` commands can only be used once per comment (no multi-line batching) +- User must be trusted (org member or trusted GitHub App) +- Sharded jobs are automatically expanded to all shards when no specific shard suffix is provided + +## How it works -- full flow + +### On `/payload` command +1. **Trust check**: Verify the commenter is trusted for the repo. GitHub Apps listed in `--trusted-app` are also accepted. + +2. **Validate repo**: Fetch the PR, resolve the ci-operator config for the repo's base branch, and verify it promotes official images. If not, respond that the repo does not contribute to OpenShift images. + +3. **Resolve jobs**: For `/payload` commands, call the release controller's API to get the list of jobs matching the OCP version, release type (nightly/ci), and job category (blocking/informing/all). Filter out any jobs matching `SKIP_JOB_REGEX_*` environment variables (with `SKIP_JOB_EXPIRE_*` expiration dates). + +4. **Resolve tests**: For `/payload-job` commands (no OCP/release specified), resolve each job name to its ci-operator test metadata by scanning all configs under `openshift` and `openshift-eng` orgs via the config agent. + +5. **Handle sharding**: If a job has `ShardCount > 1` in its config and neither the user specified a shard suffix nor an aggregation count, automatically expand to all shards (e.g., `job-1of3`, `job-2of3`, `job-3of3`). + +6. **Resolve additional PRs**: For `-with-prs` variants, fetch each additional PR's details (base ref, base SHA, head SHA, author, title). + +7. **Build PRPQR**: Create a `PullRequestPayloadQualificationRun` custom resource with: + - Labels: `dptp-requester: payload-testing`, org/repo/pull/baseRef/event-GUID labels + - Name: `-` (counter increments for multiple specs in one comment) + - Spec contains: list of `ReleaseJobSpec` entries, release controller config (OCP/release/specifier), and `PullRequestUnderTest` entries for the origin PR and any additional PRs + +8. **Create in Kubernetes**: Submit the PRPQR to the cluster. The `job-trigger-controller-manager` watches for these and creates actual ProwJobs. + +9. **Post comment**: Post a comment listing the triggered jobs with a link to the payload-testing UI: `https://pr-payload-tests.ci.openshift.org/runs//`. If additional PRs were included, also comment on those PRs with a cross-reference. + +### On `/payload-abort` +1. List all PRPQRs for the PR using label selectors (org, repo, pull number). +2. For each PRPQR, find ProwJobs in triggered or pending state. +3. Abort each active ProwJob by setting `status.state = Aborted`. +4. For aggregator jobs (identified by the `aggregator-` prefix and `AggregationIDLabel`), also abort all aggregated job runs sharing the same aggregation ID. + +### On PR close/merge +Automatically prune (delete) all PRPQRs associated with the closed PR to prevent accumulation of stale resources. + +### Job skip mechanism +Environment variables control temporary job skips: +- `SKIP_JOB_REGEX_`: regex pattern matching job names to skip +- `SKIP_JOB_EXPIRE_`: RFC3339 expiration date for the skip + +Expired skips produce a warning log but do not block jobs. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--log-level` | `info` | Log verbosity | +| `--hmac-secret-file` | `/etc/webhook/hmac` | GitHub webhook HMAC secret | +| `--namespace` | `ci` | Namespace for PRPQR creation | +| `--ci-op-config-dir` | (required) | Path to CI operator configuration directory | +| `--release-repo-git-sync-path` | `/var/repo/release` | Path to git-synced openshift/release repo for config agent | +| `--trusted-app` | (none) | GitHub App slug allowed to issue /payload (repeatable) | + +Standard Prow GitHub flags, Kubernetes flags, and `githubeventserver.Options` are also supported. + +## Key files +- `cmd/payload-testing-prow-plugin/main.go` -- entry point, flag parsing, kube client setup, config agent initialization with universal symlink watcher +- `cmd/payload-testing-prow-plugin/server.go` -- webhook handlers, command parsing (7 regex patterns), PRPQR building, abort logic, PR close pruning, error formatting +- `cmd/payload-testing-prow-plugin/rcjobresolver.go` -- release controller job resolver: fetches job lists from release controller API, applies skip filters +- `cmd/payload-testing-prow-plugin/filetestresolver.go` -- resolves job names to ci-operator test metadata by scanning config files + +## Deployment +Long-lived webhook Deployment on app.ci. Requires: +- Kubeconfig with access to the app.ci cluster (for PRPQR CRUD) +- Network access to the release controller API +- Git-synced copy of openshift/release at `--release-repo-git-sync-path` + +Listens for GitHub `issue_comment` and `pull_request` (closed) events. + +## Related +- `cmd/payload-testing-ui` -- read-only web UI displaying PRPQR results +- `cmd/job-trigger-controller-manager` -- reconciles PRPQR objects into actual ProwJobs +- [Docs](https://docs.ci.openshift.org/release-oversight/payload-testing/) diff --git a/cmd/payload-testing-ui/README.md b/cmd/payload-testing-ui/README.md index 2cfe96ce8a2..a3b146a2636 100644 --- a/cmd/payload-testing-ui/README.md +++ b/cmd/payload-testing-ui/README.md @@ -1,52 +1,56 @@ -payload-testing-ui -================== - -`payload-testing-ui` is the visualization web server for the -`PullRequestPayloadQualificationRun` `CustomResource`. Its purpose is to be -linked from pull requests in Github from which qualification runs are created to -display information about them. - -Testing -------- - -The server only requires a read-only `kubeconfig` targeting a cluster where the -`CustomResource` objects are configured. Only `list`, `get`, and `watch` -permissions are required (the UI is entirely read-only). The production DPTP -deployment lives in [`app.ci`][deployment] and uses a service account with only -those permissions. - -If no changes to the CRD are necessary, the easiest local setup is to create a -`kubeconfig` for that same service account targeting the same cluster: - -```console -$ cat > kubeconfig.yaml </`) +1. Fetch the specific PRPQR by namespace/name from the Kubernetes API. +2. Render a detail page showing: + - **Sources**: for each PR under test -- repository link, PR link with title and author, head SHA, base ref and SHA + - **Release controller configuration**: OCP version, release type, specifier, with links to the release status page and the configuration JSON in openshift/release + - **Jobs**: each job's test name with color-coded status (green for success, red for failure, yellow for aborted). If a ProwJob URL exists, the job name links to the logs + - **Status conditions**: timestamps, types, reasons, and messages from the PRPQR status + +### Status mapping +Job names in the detail view are derived from `ReleaseJobSpec.JobName(PeriodicPrefix)`. The status is matched by comparing job names, with aggregator jobs having their `aggregator-` prefix stripped for matching. + +Color classes: +- `text-success` -- SuccessState +- `text-danger` -- FailureState +- `text-warning` -- AbortedState + +### Static assets +Static files (CSS, JS) are served from an embedded filesystem (`pkg/html/StaticFS`) at the `/static/` URL path. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--log-level` | `info` | Log verbosity | +| `--port` | `8080` | HTTP server port | +| `--namespace` | (required) | Kubernetes namespace where PRPQR objects live | + +Standard Prow instrumentation flags (`--health-port`) are also supported. + +## Key files +- `cmd/payload-testing-ui/main.go` -- entry point, flag parsing, kubeconfig loading, HTTP route setup +- `cmd/payload-testing-ui/server.go` -- server implementation: list and detail handlers, HTML templates with Go template functions for PR/repo/author/SHA links + +## Deployment +Long-lived Deployment on app.ci. Runs in-cluster with a kubeconfig that has read access to PRPQR objects. Serves HTTP on port 8000, with health probes on port 8081. + +Health check endpoint: `/readyz` + +## Related +- `cmd/payload-testing-prow-plugin` -- creates PRPQR objects and links to this UI in PR comments +- `cmd/job-trigger-controller-manager` -- reconciles PRPQRs into ProwJobs (populates the status that this UI displays) diff --git a/cmd/pipeline-controller/README.md b/cmd/pipeline-controller/README.md new file mode 100644 index 00000000000..06d1e5a4f1d --- /dev/null +++ b/cmd/pipeline-controller/README.md @@ -0,0 +1,139 @@ +# pipeline-controller + +## What +Combined Prow webhook plugin and Kubernetes controller that manages two-stage CI test pipelines. Repositories opt in via configuration files, and the controller automatically triggers second-stage tests (non-always-run presubmits) after all first-stage tests (always-run presubmits) pass. This enables a "fast feedback first, then full validation" pattern where expensive tests only run after quick sanity checks succeed. + +Supports three trigger modes: +- **Automatic**: second stage triggers automatically when first stage passes +- **LGTM**: second stage triggers when the `lgtm` label is added (or when `pipeline-auto` label is set) +- **Manual**: users trigger second stage via `/pipeline required` + +## User-facing commands + +| Command | What it does | +|---|---| +| `/pipeline required` | Immediately trigger all second-stage tests for this PR | +| `/pipeline auto` | Add the `pipeline-auto` label so second-stage tests trigger automatically (LGTM-mode repos only) | + +On PR open in configured repos (automatic or LGTM mode), a notification comment is posted explaining the pipeline behavior. + +## How it works -- full flow + +### Configuration +Two YAML config files define which repos are enrolled: + +**Main config** (`--config-file`): repos with automatic or manual trigger modes +```yaml +orgs: + - org: openshift + repos: + - name: installer + mode: + trigger: auto + branches: + - main + - release-4.17 +``` + +**LGTM config** (`--lgtm-config-file`): repos where second stage triggers on LGTM label +```yaml +orgs: + - org: openshift + repos: + - name: cluster-etcd-operator +``` + +Both files are hot-reloaded via polling every 3 minutes. The `branches` field is optional; when empty, all branches are enabled. The default trigger mode is `auto` if not specified. + +### Job categorization +The `ConfigDataProvider` runs every 10 minutes and categorizes each repo's static presubmits into five groups: + +| Category | Criteria | Role | +|---|---|---| +| `alwaysRequired` | `always_run: true`, `optional: false` | First stage -- must all pass | +| `conditionallyRequired` | `run_if_changed` or `skip_if_only_changed` set, `optional: false` | First stage -- must pass if triggered | +| `protected` | `always_run: false`, no run conditions, `optional: false`, no pipeline annotations | Second stage -- always triggered | +| `pipelineConditionallyRequired` | Has `pipeline_run_if_changed` annotation | Second stage -- triggered if files match pattern | +| `pipelineSkipOnlyRequired` | Has `pipeline_skip_if_only_changed` annotation | Second stage -- triggered unless all files match skip pattern | + +### Webhook handlers + +#### PR opened (`handlePullRequestCreation`) +For configured repos in automatic or LGTM mode: +1. Check if the PR's base branch is enabled +2. If pipeline-controlled jobs exist, post an informational comment explaining the pipeline behavior and the mode (automatic/LGTM/manual) +3. For manual mode repos with non-always-run jobs, post a different comment suggesting `/test ?` and `/pipeline required` + +#### PR labeled with LGTM (`handleLabelAddition`) +For repos in LGTM config when `lgtm` label is added: +1. Skip if `pipeline-auto` label is already present (reconciler handles automatic triggering) +2. Check branch enablement +3. Post a comment with `/test` commands for second-stage tests + +#### PR opened/pushed/reopened (`handlePipelineContextCreation`) +For configured repos: +1. Get changed files for the PR +2. Evaluate `pipeline_run_if_changed` patterns against changed files; create pending GitHub status contexts for matching tests +3. Evaluate `pipeline_skip_if_only_changed` patterns; create pending contexts for tests where NOT all files match the skip pattern +4. Create pending contexts for all `protected` tests +5. All pending contexts use the message "Waiting for pipeline condition to trigger this job" + +#### `/pipeline required` (`handleIssueComment`) +1. Verify the repo is in either config +2. Fetch PR details and check branch enablement +3. For LGTM repos, `/pipeline auto` adds the `pipeline-auto` label and caches it +4. For `/pipeline required`, generate and post `/test` commands for all second-stage tests (forced, ignoring manual-control detection) + +### Reconciler (controller-runtime) +Watches all ProwJobs in the configured namespace. On each reconciliation: + +1. **Filter**: Only process presubmit ProwJobs with `Refs` for repos in the config +2. **Check trigger mode**: For main config repos, require `trigger: auto`. For LGTM config repos, require the `pipeline-auto` label (checked via cache, then GitHub API) +3. **Verify first stage complete**: + - List all ProwJobs for the same PR/SHA + - Build a map of the latest ProwJob per job name for the current SHA + - `alwaysRequired` tests must all exist and be in SuccessState + - `conditionallyRequired` tests must be in SuccessState if they exist (not required to exist) + - `protected` tests must NOT already exist (prevents re-triggering) +4. **Check PR state**: Query GitHub (with 5-minute cache) to verify the PR is open and not a draft +5. **Deduplicate**: Use a sync.Map keyed by `org/repo/number/baseRef/SHA` to ensure second-stage tests are triggered only once per push +6. **Send comment**: Post a comment with `/test` rerun commands for protected jobs and conditionally-matching pipeline jobs + +The reconciler uses `MaxConcurrentReconciles: 1` to avoid races. + +### Pipeline-auto cache +A `sync.Map`-based cache tracks PRs with the `pipeline-auto` label. Entries have a 48-hour TTL and are cleaned hourly. This avoids repeated GitHub API calls during reconciliation. + +### Closed PR cache +A separate cache tracks PR open/closed state with a 5-minute refresh interval and 24-hour retention. Draft PRs are treated as closed to prevent unintended test triggering. + +### Deduplication cleanup +A background goroutine runs every 24 hours to clean up old entries from the `ids` sync.Map that prevents duplicate second-stage triggers. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `false` | Run in dry-run mode | +| `--config-file` | (required) | YAML config with repos in automatic/manual mode | +| `--lgtm-config-file` | (required) | YAML config with repos in LGTM-triggered mode | +| `--hmac-secret-file` | `/etc/webhook/hmac` | GitHub webhook HMAC secret | + +Standard Prow config flags (`--config-path`), GitHub flags, Kubernetes flags, and `githubeventserver.Options` are also supported. + +## Key files +- `cmd/pipeline-controller/main.go` -- entry point, flag parsing, handler registration, manager setup, watcher and config data provider initialization +- `cmd/pipeline-controller/reconciler.go` -- controller-runtime reconciler: first-stage completion check, deduplication, PR state validation, comment posting +- `cmd/pipeline-controller/helpers.go` -- `sendComment` and `acquireConditionalContexts`: evaluate `pipeline_run_if_changed`/`pipeline_skip_if_only_changed` patterns, detect manual intervention, generate test commands +- `cmd/pipeline-controller/config_watcher.go` -- YAML config file watcher with 3-minute polling, supports both string and object repo formats +- `cmd/pipeline-controller/config_data_provider.go` -- periodic presubmit categorization (every 10 minutes) into the five job groups +- `cmd/pipeline-controller/pipeline_auto_cache.go` -- 48-hour TTL cache for `pipeline-auto` label status + +## Deployment +Long-lived Deployment on app.ci combining a webhook server and a controller-runtime manager. The webhook server handles GitHub events; the manager watches ProwJobs for reconciliation. + +Requires: +- Kubeconfig with access to ProwJobs in the ProwJob namespace +- GitHub token or App for creating comments, statuses, and adding labels +- Both config files mounted (typically via ConfigMaps with git-sync) + +Listens for GitHub `pull_request` (opened, synchronize, reopened, labeled) and `issue_comment` events. diff --git a/cmd/pj-rehearse/README.md b/cmd/pj-rehearse/README.md new file mode 100644 index 00000000000..51251ad9be8 --- /dev/null +++ b/cmd/pj-rehearse/README.md @@ -0,0 +1,129 @@ +# pj-rehearse + +## What +Prow webhook plugin that rehearses proposed Prow job config changes before they merge into openshift/release. When a PR modifies ci-operator configs, Prow job definitions, or step registry content, pj-rehearse detects which jobs are affected, creates temporary ProwJob copies with the proposed changes, and runs them to verify nothing breaks. + +This is a safety net: it catches configuration mistakes before they land in production and start failing real CI jobs. + +## User-facing commands +All triggered via PR comments on openshift/release: + +| Command | What it does | +|---|---| +| `/pj-rehearse` | Rehearse up to 10 affected jobs | +| `/pj-rehearse more` | Rehearse up to 20 affected jobs | +| `/pj-rehearse max` | Rehearse up to 35 affected jobs | +| `/pj-rehearse {test-name} {other}` | Rehearse specific named jobs only | +| `/pj-rehearse auto-ack` | Rehearse and auto-add `rehearsals-ack` label if all pass | +| `/pj-rehearse list` | Re-post the list of affected jobs | +| `/pj-rehearse skip` | Opt out of rehearsals (adds `rehearsals-ack` label) | +| `/pj-rehearse ack` | Acknowledge results (adds `rehearsals-ack` label) | +| `/pj-rehearse reject` | Remove `rehearsals-ack` label | +| `/pj-rehearse abort` | Abort all active rehearsal jobs for this PR | +| `/pj-rehearse network-access-allowed` | Allow rehearsing jobs with `restrict_network_access: false` (must be org member, cannot be PR author) | + +## How it works — full flow + +### On PR creation +1. Detect all affected jobs by comparing base branch vs PR branch configs +2. Post a GitHub comment with a markdown table listing affected jobs +3. If more jobs than `--max-limit` (35), upload the full list to GCS and link it +4. If zero jobs affected, auto-add `rehearsals-ack` label (nothing to rehearse) + +### On new push to PR +1. Abort all existing rehearsal ProwJobs for this PR +2. Remove `network-access-rehearsals-ok` label +3. Remove `rehearsals-ack` label (unless PR author is in `--sticky-label-authors`) +4. Recompute affected jobs and post updated list + +### On `/pj-rehearse` command +1. Block if `needs-ok-to-test` label is present (untrusted PR) +2. Post acknowledgement comment +3. Clone the repo, checkout the PR, rebase onto base branch (up to 4 retries for transient git failures) +4. Run `DetermineAffectedJobs()` to find what changed (see below) +5. Run `SetupJobs()`: inline ci-operator config into job specs, resolve registry references, upload resolved configs to GCS, create temporary ConfigMaps for changed templates/profiles, convert periodics to presubmits +6. If more affected jobs than the limit, intelligently select a subset balancing across source types (so each category of change is represented) +7. Validate the assembled Prow config +8. Create ProwJob resources in the cluster +9. If `auto-ack` mode: wait for all jobs to complete (up to 4h), add `rehearsals-ack` on success + +### How it decides which jobs to rehearse +A job is "affected" if any of these changed in the PR: +- **Prow presubmit/periodic config**: spec, agent, cluster, optional/always-run settings +- **CI-operator config**: new config files, changed tests or build settings +- **Step registry**: any step, chain, workflow, or observer that a job references (transitively) + +Jobs are **excluded** if: +- `Hidden: true` +- Missing the `pj-rehearse.openshift.io/can-be-rehearsed: "true"` label +- Presubmit has empty `Branches` list +- `restrict_network_access: false` without both `network-access-rehearsals-ok` AND `approved` labels + +### Smart subset selection (when over limit) +When there are more affected jobs than the limit, it doesn't just take the first N alphabetically. It: +1. Groups jobs by source type (ChangedPresubmit, ChangedPeriodic, ChangedCiopConfig, ChangedRegistryContent, etc.) +2. Allocates budget evenly: `limit / numSourceTypes` per type +3. Fills remaining slots from underutilized types +This ensures all categories of change get coverage, not just the alphabetically first ones. + +## Label workflow +``` +PR opened + ├─ no affected jobs → auto-add rehearsals-ack + └─ affected jobs found → post list, wait for user + ├─ /pj-rehearse → run rehearsals (label not added) + ├─ /pj-rehearse auto-ack → run + add label on all-pass + ├─ /pj-rehearse skip → add label (opt out) + ├─ /pj-rehearse ack → add label (manual ack) + └─ no action → label absent, blocks merge + +New push + ├─ abort active rehearsals + ├─ remove rehearsals-ack (unless sticky-label-authors) + ├─ remove network-access-rehearsals-ok + └─ recompute + repost affected list +``` + +The `rehearsals-ack` label is a Tide merge requirement. PRs cannot merge without it. + +## What rehearsal ProwJobs look like +- Name: `rehearse-{prNumber}-{originalJobName}` +- Labels: `ci.openshift.io/rehearse: {prNumber}`, source type tracking labels +- Context: `ci/rehearse/{repo}/{branch}/{shortName}` +- `Optional: true` — rehearsal failures don't block merge via status checks (only the label workflow does) +- No `reporter_config` — rehearsals never send Slack/email notifications +- All ConfigMap volume mounts are replaced with temporary copies containing the PR's proposed changes + +## Key files +- `cmd/pj-rehearse/main.go` — entry point, flag parsing +- `cmd/pj-rehearse/server.go` — webhook handlers, GitHub comment/label logic +- `pkg/rehearse/rehearse.go` — core orchestration: `DetermineAffectedJobs()`, `SetupJobs()`, `RehearseJobs()` +- `pkg/rehearse/jobs.go` — job setup, config inlining, periodic-to-presubmit conversion +- `pkg/rehearse/configmaps.go` — temporary ConfigMap management for changed profiles/templates +- `pkg/diffs/diffs.go` — diff detection between base and PR configs + +## Deployment +Long-lived webhook Deployment on app.ci (RBAC/ServiceAccount). Plugin also deployed on core-ci. + +Listens for GitHub webhook events: PR `opened`/`synchronize`, issue comments `created`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--normal-limit` | 10 | Max jobs for `/pj-rehearse` | +| `--more-limit` | 20 | Max jobs for `/pj-rehearse more` | +| `--max-limit` | 35 | Max jobs for `/pj-rehearse max`, also threshold for GCS upload of full list | +| `--gcs-bucket` | `test-platform-results` | Where resolved configs and large job lists are stored | +| `--gcs-credentials-file` | `/etc/gcs/service-account.json` | GCS service account key | +| `--gcs-browser-prefix` | `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/` | URL prefix for GCS links in PR comments | +| `--prowjob-kubeconfig` | (in-cluster) | Override kubeconfig for the cluster where ProwJobs are created | +| `--no-registry` | false | Disable step registry comparison (skip registry-sourced affected jobs) | +| `--no-templates` | false | Disable template comparison | +| `--sticky-label-authors` | (empty) | Comma-separated PR authors whose `rehearsals-ack` label survives new pushes | +| `--hmac-secret-file` | `/etc/webhook/hmac` | GitHub webhook HMAC secret for signature verification | +| `--dry-run` | true | When true: fake k8s client, prints YAML instead of creating jobs | + +## Related +- `cmd/config-change-trigger` — similar concept but for postsubmit jobs, not rehearsals +- GCS path pattern: `pj-rehearse/{org}/{repo}/{prNumber}/{sha}/...` +- Prometheus metric: `pj_rehearse_handlers_in_flight` gauge diff --git a/cmd/pod-scaler/README.md b/cmd/pod-scaler/README.md index c1443b3f120..7adcc3b9222 100644 --- a/cmd/pod-scaler/README.md +++ b/cmd/pod-scaler/README.md @@ -1,7 +1,75 @@ # pod-scaler -The `pod-scaler` component automatically applies resource requests and limits to containers in batch workloads. Data is fetched from Prometheus, aggregated by using Pod labels stored in the the `kube_pod_labels` metric, and served in a mutating admission webhook for future workloads. - +## What +Three-mode resource optimization system: a producer queries Prometheus for historical CPU and memory usage, an admission webhook mutates pod resource requests based on 80th percentile of historical data, and a UI serves a dashboard of collected metrics. Prevents both over-provisioning (wasted cluster capacity) and under-provisioning (OOMKills and throttling). + +## How it works + +### Producer mode (`--mode producer`) +Queries Prometheus on every build cluster every 2 hours (or once with `--produce-once`). Three metric prefixes, each with CPU and memory: + +| Prefix | Filter | Labels collected | +|---|---|---| +| `prowjobs` | `label_created_by_prow="true"` | context, org, repo, base_ref, job, type | +| `pods` | `label_created_by_ci="true", step=""` | org, repo, branch, variant, target, build, release, app | +| `steps` | `label_created_by_ci="true", step!=""` | org, repo, branch, variant, target, step | + +CPU query: `rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[3m])` +Memory query: `container_memory_working_set_bytes{container!="POD",container!=""}` + +Data stored as Circonus log-linear histograms (no lookup table, compact) in GCS cache (or local dir). Pruning: max 25 entries per metadata key, entries older than 90 days removed. + +### Admission webhook mode (`--mode consumer.admission`) +Registered on `/pods` endpoint. For each pod admission: + +1. Check annotation `ci-workload-autoscaler.openshift.io/scale` (default: true) +2. Handle Build pods: backfill labels from Build object +3. Handle rehearsal pods: extract actual config from CONFIG_SPEC +4. Determine if measured: random `--percentage-measured` chance. Measured pods get label `pod-scaler.openshift.io/measured=true` and anti-affinity against unmeasured pods +5. Calculate resources via `mutatePodResources()`: + - Query both measured and unmeasured historical data + - Take **maximum** of both + - 80th percentile of merged histogram + - Apply **120% multiplier** + - Apply **never-reduce rule**: `max(determined, configured)` + - Cap: `--cpu-cap` (default 10 cores), `--memory-cap` (default 20Gi) + - Remove all CPU limits (don't throttle) + - Ensure memory limit >= 200% of request +6. High-priority scheduling: if CPU >= `--cpu-priority-scheduling` (default 8), set priority class `high-priority-nonpreempting` +7. Measured pod CPU increase: `--measured-pod-cpu-increase` (default 50%) additional CPU for measurement accuracy, capped by node allocatable + +Node allocatable cache refreshes every 15 minutes, groups nodes by `ci-workload` label. + +### UI mode (`--mode consumer.ui`) +Serves web dashboard with org/repo/branch/variant/target/step hierarchy for browsing resource usage data. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--mode` | — | `producer`, `consumer.admission`, or `consumer.ui` | +| `--cache-dir` | — | Local cache directory (dev) | +| `--cache-bucket` | — | GCS bucket for cache | +| `--gcs-credentials-file` | — | GCS service account credentials | +| `--produce-once` | false | Exit after single query cycle | +| `--ignore-latest` | 0 | Duration of latest data to ignore | +| `--port` | — | Webhook port (admission mode) | +| `--serving-cert-dir` | — | TLS cert directory (admission mode) | +| `--cpu-cap` | 10 | Max CPU request in cores | +| `--memory-cap` | 20Gi | Max memory request | +| `--cpu-priority-scheduling` | 8 | CPU threshold for high-priority class | +| `--percentage-measured` | 0 | Percentage of pods to measure (0-100) | +| `--measured-pod-cpu-increase` | 50 | Extra CPU % for measured pods | +| `--ui-port` | — | UI server port | + +## Key files +- `cmd/pod-scaler/main.go` — mode routing, flag parsing +- `cmd/pod-scaler/producer.go` — Prometheus queries, cache updates +- `cmd/pod-scaler/admission.go` — webhook handler, resource mutation +- `cmd/pod-scaler/frontend.go` — UI serving +- `pkg/pod-scaler/types.go` — CachedQuery, histogram storage, pruning + +## Deployment +Three separate Deployments on app.ci: producer, admission webhook, UI. Producer connects to all build cluster Prometheus instances. ## Producer The producer reads Prometheus data a couple times daily and updates a static data store in GCS after digesting the metrics. The storage format records time periods for which data fetching failed, to enable eventually consistent data collection in the face of Prometheus errors or network outages. @@ -40,4 +108,4 @@ Run end-to-end tests as normal: ```shell make local-e2e TESTFLAGS='-run TestAdmission' -``` \ No newline at end of file +``` diff --git a/cmd/pr-reminder/README.md b/cmd/pr-reminder/README.md new file mode 100644 index 00000000000..226c255b123 --- /dev/null +++ b/cmd/pr-reminder/README.md @@ -0,0 +1,61 @@ +# pr-reminder + +## What +Sends daily Slack DMs to team members reminding them about open pull requests that need their review. Also posts unassigned/unreviewed PRs to configured team channels. Designed to be run as a periodic CronJob. + +## How it works -- full flow + +### User resolution +1. Load the team config file (teams, members, repos) +2. Load the rover groups config to build a GitHub-username-to-Kerberos-ID mapping +3. For each team member (identified by Kerberos ID), look up their Slack user by email (`{kerberos}@redhat.com`) +4. Match Kerberos IDs to GitHub usernames via the rover mapping +5. Users without a GitHub ID or Slack ID are dropped with a warning + +### PR discovery +1. Collect the union of all repos across all teams +2. Fetch open PRs from each repo via the GitHub API +3. For each user, check each PR in their configured repos: + - Skip if the PR has unactionable labels (`do-not-merge/work-in-progress`, `needs-rebase`) + - Skip if the PR is ready to merge (has both `lgtm` and `approved`) + - Check if the user was requested to review (by direct request, team request, or assignment) + - If requested, check if the PR **requires attention**: the user has never reviewed it, or new commits have been pushed since their last review +4. For channel notifications, find PRs that have zero reviews and no assignees (truly unreviewed) + +### Message formatting +- Each user receives a DM listing their pending PRs, sorted by most recent update first +- PRs are color-coded by age: green circle (< 2 days), orange circle (< 1 week), red circle (> 1 week) +- A "NEW" badge appears for PRs updated in the last 24 hours +- Relevant labels (`approved`, `lgtm`, `do-not-merge/hold`) are shown per PR +- If a user has more than 40 PRs, the message is split into multiple chunks to stay within Slack's block limits + +### Channel notifications +Teams can optionally configure a `channel` field. For each such channel, the tool posts unassigned/unreviewed PRs from the team's repos. The `omitBots` option suppresses PRs from GitHub bot accounts. + +### Validate-only mode +With `--validate-only`, the tool validates the config (team structure, user resolution) and exits without sending messages. + +## Config format +```yaml +teams: + - teamMembers: [kerberos1, kerberos2] + teamNames: [team-slug] # GitHub team slugs for team-level review requests + repos: [org/repo] + channel: "#team-channel" # optional: post unassigned PRs here + omitBots: true # optional: skip bot PRs in channel posts +``` + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-path` | (required) | Path to the team/repo config file | +| `--github-users-file` | (required) | Path to the rover groups config (GitHub-to-Kerberos mapping) | +| `--slack-token-path` | (required) | Path to the Slack bot token file | +| `--validate-only` | `false` | Validate config and exit without sending messages | +| `--log-level` | `info` | Log output level | + +## Key files +- `cmd/pr-reminder/main.go` -- entry point, config loading, PR discovery, Slack message construction and sending + +## Deployment +Runs as a periodic CronJob (typically daily on weekday mornings). Not a long-lived service. diff --git a/cmd/prcreator/README.md b/cmd/prcreator/README.md index 15c44b4d6eb..caf03ebfd38 100644 --- a/cmd/prcreator/README.md +++ b/cmd/prcreator/README.md @@ -1,3 +1,59 @@ -# PRCreator +# prcreator -A simple CLI that will upsert a GitHub pr if the current working directory is in an unclean Git state. +## What +Generic CLI utility that creates or updates a GitHub pull request from uncommitted changes in the current working directory. It is the final step used by many CI automation tools that generate changes locally and need to open a PR. + +The tool handles the full lifecycle: checking for changes, committing, pushing (to a fork or directly to the upstream repo), and creating or updating the PR via the GitHub API. If no changes are detected, it exits cleanly without creating a PR. + +## How it works -- full flow + +1. **Gather options**: Parse required flags (`--pr-title`, `--organization`, `--repo`, `--branch`) and PR creation options (GitHub auth, self-approve, source mode). + +2. **Delegate to `PRCreationOptions.UpsertPR()`**: Pass the current directory (`.`), org, repo, branch, and PR title along with optional body, assignee, and git commit message. + +3. **Inside UpsertPR** (from `pkg/github/prcreation`): + - Check for uncommitted changes via `bumper.HasChanges()`. If none, log and return. + - Determine the bot username from the GitHub token. + - Configure local git user/email. + - Derive a branch name from the PR match title (lowercased, spaces/colons replaced with hyphens). + +4. **Push strategy** (controlled by `--pr-source-mode`): + - **`fork`** (default, requires `--github-token-path`): + - Check if a fork exists for the bot user; create one if not (waits up to 6 minutes for GitHub to provision it) + - Commit and push to the fork via `bumper.GitCommitAndPush()` using the PAT + - Create a cross-repo PR (head: `:`) + - **`branch`** (requires `--github-app-id` and `--github-app-private-key-path`): + - Create a local branch, stage all changes, commit + - Push directly to the upstream repo using Prow's `GitClientFactory` with App auth + - Create a same-repo PR (head: ``) + +5. **PR creation/update** (`ensurePR`): + - Use `bumper.UpdatePullRequestWithLabels()` to create a new PR or update an existing one matching the title + - If `--self-approve`, add `lgtm` and `approved` labels + - If `--pr-assignee` is set, append `/cc @assignee` to the PR body + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--pr-title` | (required) | Title of the PR to create | +| `--pr-message` | `""` | Body/description of the PR | +| `--git-message` | `""` | Git commit message (if different from PR body) | +| `--pr-assignee` | `""` | Comma-separated list of GitHub usernames to assign/cc on the PR | +| `--organization` | `openshift` | GitHub organization for the PR | +| `--repo` | `release` | GitHub repository for the PR | +| `--branch` | `main` | Target branch for the PR | +| `--self-approve` | `false` | Add `lgtm` and `approved` labels to the PR | +| `--pr-source-mode` | `fork` | How to push: `fork` (cross-repo via PAT) or `branch` (same-repo via App auth) | +| `--github-token-path` | `""` | Path to GitHub PAT (required for fork mode) | +| `--github-app-id` | `""` | GitHub App ID (required for branch mode) | +| `--github-app-private-key-path` | `""` | Path to GitHub App private key (required for branch mode) | +| `--github-endpoint` | `https://api.github.com` | GitHub API endpoint | + +## Key files + +- `cmd/prcreator/main.go` -- entry point: option parsing, delegates to `UpsertPR()` +- `pkg/github/prcreation/prcreation.go` -- `PRCreationOptions`, `UpsertPR()`, fork/branch push strategies, PR creation/update logic + +## Deployment +CLI tool. Not deployed as a service. Invoked by other automation jobs (e.g. `auto-config-brancher`, `private-org-sync`, `ci-operator-yaml-creator`) as the final step to turn local changes into a GitHub PR. diff --git a/cmd/private-org-peribolos-sync/README.md b/cmd/private-org-peribolos-sync/README.md index fbfc0259023..751d9c13b0e 100644 --- a/cmd/private-org-peribolos-sync/README.md +++ b/cmd/private-org-peribolos-sync/README.md @@ -1,12 +1,51 @@ -# Private org peribolos sync +# private-org-peribolos-sync -This tool generates a mapping table of peribolos repository configurations for the given private -organization. -It walks through the release repository path, given by `--release-repo-path`, and detects which of the repositories -are promoting official images. The repositories that are specified in `--include-repo`, will be included if they exist in the release repository's path as well. Furthermore, it will get the required information for each of them from GitHub -and will generate its peribolos configuration. -Finally, the tool will update the peribolos configuration given by `--peribolos-config`. +## What +CLI tool that generates the repository section of a Peribolos configuration file for the `openshift-priv` GitHub organization. It discovers which repos need private mirrors (by scanning ci-operator configs for repos that build official images, plus a whitelist), fetches their GitHub metadata, and writes the resulting repo definitions into the Peribolos config. +Peribolos is the tool that manages GitHub organization membership and repository settings declaratively. This tool ensures that every repo with a private mirror has correct settings (description, merge methods, archive status, etc.) in the `openshift-priv` org. + +## How it works -- full flow + +1. Read the existing Peribolos config file (`--peribolos-config`), which may be gzip-compressed +2. Initialize a GitHub client for API calls +3. Walk the ci-operator config directory (`{release-repo-path}/ci-operator/config/`) to find all repos that build official images (`api.BuildsAnyOfficialImages` with `WithoutOKD`) +4. Add all repos from the whitelist file (these bypass the `--only-org` filter) +5. For each discovered `{org}/{repo}`, call `gc.GetRepo(org, repo)` to fetch the current GitHub repository metadata: + - Description, Homepage, HasIssues, HasProjects, HasWiki + - AllowMergeCommit, AllowSquashMerge, AllowRebaseMerge + - Archived, DefaultBranch +6. Compute the private repo name using `MirroredRepoName()`: + - Flattened orgs: repo name unchanged + - Non-flattened orgs: `{org}-{repo}` prefix +7. Build an `org.Repo` for each, always setting `Private: true` +8. Apply `org.PruneRepoDefaults()` to remove values that match Peribolos defaults +9. Replace the `Repos` map in the destination org's config with the generated repos +10. Marshal and write the updated Peribolos config back to the same file + +### Important behavior +- The tool completely replaces the `Repos` section of the destination org. It does not merge with existing repo entries. +- The GitHub API call for each repo can be slow for large numbers of repos, as they are fetched sequentially. +- If a repo fails to fetch (e.g. deleted or renamed), the tool fatals. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--peribolos-config` | (required) | Path to the Peribolos YAML config file to update | +| `--release-repo-path` | (required) | Path to openshift/release repository directory | +| `--destination-org` | (required) | Name of the Peribolos org to populate (typically `openshift-priv`) | +| `--only-org` | `""` | Only include repos from this source organization | +| `--flatten-org` | (repeatable) | Additional orgs whose repos should not have org prefix | +| `--whitelist-file` | `""` | Path to YAML file listing repos to include regardless of official image status | +| GitHub flags | | Standard Prow GitHub options (`--github-token-path`, etc.) | + +## Key files +- `cmd/private-org-peribolos-sync/main.go` -- all logic in this single file +- `pkg/privateorg/flatten.go` -- `MirroredRepoName()` and `DefaultFlattenOrgs` +- `pkg/config/whitelist.go` -- whitelist configuration + +## Deployment +CLI tool. Not a long-running service. Called by `auto-peribolos-sync` which wraps it in a PR workflow. Packaged in the `ci_auto-peribolos-sync_latest` container image alongside `auto-peribolos-sync`. ## Repository Naming Convention To prevent collisions when multiple organizations have repositories with the same name, this tool uses a special naming convention: diff --git a/cmd/private-org-sync/README.md b/cmd/private-org-sync/README.md index a58f3dde123..37fd7dadb0c 100644 --- a/cmd/private-org-sync/README.md +++ b/cmd/private-org-sync/README.md @@ -1,10 +1,80 @@ -# Private Org Sync +# private-org-sync -This tool automatically syncs code from which "official" images are built -from their public git repo locations to their mirrors in private GitHub -organization. It is intended to be run in a Prow periodic job with an interval -of less than an hour. +## What +Syncs git repository content from public organizations (openshift, openshift-eng, operator-framework, etc.) to their private mirrors in openshift-priv. Handles shallow clones with exponential depth retry, merge fallback for diverged histories, and org name flattening. +## How it works — full flow + +### 1. Discover repositories +- Walks ci-operator config directory to find repos that build official images (`api.BuildsAnyOfficialImages()`) +- Merges with whitelist for repos not in config +- Groups by org/repo, then iterates branches + +### 2. Repository naming +- **Flattened orgs** (openshift, openshift-eng, operator-framework, redhat-cne, openshift-assisted, ViaQ by default + `--flatten-org` values): repo name unchanged in openshift-priv + - Example: `openshift/installer` -> `openshift-priv/installer` +- **Non-flattened orgs**: prefixed with org name + - Example: `migtools/filebrowser` -> `openshift-priv/migtools-filebrowser` + +### 3. Branch filtering +- Checks source branch existence via `git ls-remote --heads` (retries 5x with exponential backoff) +- **Release branches** (`openshift-*`, `release-*`): error if source doesn't exist +- **Misc branches**: skip with warning if source doesn't exist +- If source and destination already at same commit: skip (no-op) + +### 4. Exponential depth retry +Starts with a shallow fetch and progressively deepens until push succeeds: + +| Depth level | Git flag | Commits fetched | +|---|---|---| +| 1 | `--depth=2` | 2 | +| 2 | `--depth=4` | 4 | +| 3 | `--depth=8` | 8 | +| 4 | `--depth=16` | 16 | +| 5 | `--depth=32` | 32 | +| 6 | `--depth=64` | 64 | +| 7 | `--unshallow` | Full history | + +At each level: +1. Fetch from source at current depth +2. Push to destination (`git push --tags {destURL} FETCH_HEAD:refs/heads/{branch}`) +3. If push fails with "shallow update not allowed" or "rejected because remote contains work": increase depth and retry + +Each shallow fetch has 3 internal retries for transient "shallow file has changed" errors. + +### 5. Merge fallback +When push still fails after unshallowing (histories have diverged): +1. `git fetch {destURL} {branch}` — get destination state +2. `git checkout FETCH_HEAD` — switch to destination +3. `git merge {srcRemote}/{branch} -m "DPTP reconciliation from upstream"` — merge source +4. If merge conflict: `git merge --abort`, log warning, return nil (graceful skip) +5. `git push --tags {destURL} HEAD:{branch}` — push merged result + +### 6. Error handling +- Fatal: token file unreadable, git dir creation fails, invalid options +- Skipped: misc branch missing, destination repo missing (unless `--fail-on-missing-destination`), merge conflicts +- All errors accumulated and reported at end + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--token-path` | — | GitHub token file path | +| `--target-org` | — | Destination organization (e.g., openshift-priv) | +| `--config-dir` | — | CI operator config directory | +| `--git-name` | — | Git author name for merge commits | +| `--git-email` | — | Git author email | +| `--flatten-org` | — | Orgs that keep original repo names (repeatable) | +| `--only-org` | — | Mirror only repos from this org | +| `--only-repo` | — | Mirror only this specific repo (org/repo format) | +| `--confirm` | false | Execute real operations | +| `--fail-on-missing-destination` | false | Error if destination repo doesn't exist | + +## Key files +- `cmd/private-org-sync/main.go` — full sync logic, mirror(), exponential retry, merge fallback +- `pkg/privateorg/flatten.go` — `MirroredRepoName()`, default flattened orgs + +## Deployment +Periodic Prow job. Both source and destination are queried with `git ls-remote`: if they are already in sync, no further git operations are done. When the destination is detected to be an empty repo, full `git fetch` is done against a source, diff --git a/cmd/private-prow-configs-mirror/README.md b/cmd/private-prow-configs-mirror/README.md index 79ed19c894b..aa40d757e92 100644 --- a/cmd/private-prow-configs-mirror/README.md +++ b/cmd/private-prow-configs-mirror/README.md @@ -1,8 +1,57 @@ -# Private Prow Configs Mirror +# private-prow-configs-mirror -The purpose of this tool is to generate the prow's configuration for the repositories of the `openshift-priv` -organization. +## What +CLI tool that mirrors Prow configuration (branch protection, Tide settings, plugin configs, etc.) from public repositories to their `openshift-priv` equivalents. This ensures that private forks have the same Prow policies as their public counterparts without manual duplication. +It reads the full Prow config and plugin config from the release repo, determines which repos have private mirrors (by scanning ci-operator configs for repos that build official images + the whitelist), and injects matching entries for `openshift-priv`. + +## How it works -- full flow + +1. Load all Prow config and plugin config from `--release-repo-path` +2. Strip all job definitions (presubmits, postsubmits, periodics) from the loaded Prow config -- this tool only handles non-job config +3. Load the plugins config from `core-services/prow/02_config/_plugins.yaml` +4. Query the GitHub API via `ghClient.GetRepos("openshift-priv", false)` to get the list of repos that actually exist in the private org +5. Scan ci-operator configs to find all repos building official images + add whitelisted repos, then cross-reference against repos that actually exist in `openshift-priv` to build the `orgReposWithOfficialImages` mapping +6. Clean all existing `openshift-priv` entries from plugin configs to avoid retaining stale names +7. For each Prow config section, inject private org equivalents: + +### Config sections mirrored +| Section | Behavior | +|---|---| +| **Branch protection** | Delete existing `openshift-priv` org entry, then copy per-repo rules from source orgs into `openshift-priv` repos | +| **Tide context policy** | Copy per-repo context policies to `openshift-priv` repos | +| **Tide merge type** | Mirror org/repo and org/repo@branch merge method settings; drop stale `openshift-priv` repo-level entries | +| **Tide queries** | Add `openshift-priv/{repo}` to every Tide query that includes the public `{org}/{repo}`, remove stale priv entries | +| **PR status base URLs** | Copy per-repo PR status URLs to `openshift-priv` | +| **Plank decoration configs** | Copy per-repo decoration configs to `openshift-priv` | +| **Job URL prefix config** | Copy per-repo job URL prefixes to `openshift-priv` | +| **Approve plugin** | Add `openshift-priv/{repo}` to approve plugin repo lists | +| **LGTM plugin** | Add `openshift-priv/{repo}` to LGTM plugin repo lists | +| **Bugzilla plugin** | Copy per-repo Bugzilla options to `openshift-priv` | +| **Plugins** | Compute the union of org-level and repo-level plugins for each mirrored repo, extract common plugins to `openshift-priv` org level, keep repo-specific differences at repo level | + +8. Write updated Prow config to `core-services/prow/02_config/_config.yaml` +9. Write updated plugin config (sharded) via `WriteShardedPluginConfig` + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--release-repo-path` | (required) | Path to openshift/release repository directory | +| `--config-dir` | (derived from release-repo-path) | ci-operator config directory | +| `--whitelist-file` | `""` | Path to YAML file listing repos to include | +| `--flatten-org` | (repeatable) | Additional orgs whose repos should not have org prefix | +| `--dry-run` | `true` | Use API tokens but do not mutate | +| GitHub flags | | Standard Prow GitHub options (`--github-token-path`, etc.) | + +## Key files +- `cmd/private-prow-configs-mirror/main.go` -- all logic in this single file +- `pkg/privateorg/flatten.go` -- `MirroredRepoName()` and `DefaultFlattenOrgs` +- `pkg/config/whitelist.go` -- whitelist configuration +- `pkg/prowconfigsharding/` -- plugin config sharding for output +- `pkg/prowconfigutils/` -- `ExtractOrgRepoBranch()` for parsing merge type keys + +## Deployment +CLI tool. Not a long-running service. Invoked by `auto-config-brancher` as part of the periodic config generation pipeline. Runs in the `ci_auto-config-brancher_latest` container image. It walks through all the ci-operator configuration files to get the information of which repositories are promoting official images. When a configuration is detected, the tool generates the configuration for the corresponding repository of the `openshift-priv` organization instead. diff --git a/cmd/promoted-image-governor/README.md b/cmd/promoted-image-governor/README.md index 6490480723a..3e0a31e5cb0 100644 --- a/cmd/promoted-image-governor/README.md +++ b/cmd/promoted-image-governor/README.md @@ -1,5 +1,58 @@ # promoted-image-governor +## What +Garbage collector for promoted ImageStreamTags on the app.ci cluster and build farm clusters. It compares the set of tags that _should_ exist (as determined by ci-operator promotion configs) against what _actually_ exists in the cluster, and deletes orphaned tags that are no longer promoted by any configuration. Also optionally generates image mirroring mapping files for the `periodic-image-mirroring-openshift` job. + +## How it works -- full flow + +### Determine promoted tags +1. Walk the ci-operator config directory (`--ci-operator-config-path`) +2. For each config, call `release.PromotedTags()` to collect all ImageStreamTagReferences that the config promotes +3. If any promotion target uses `TagByCommit`, compile a regex to ignore commit-hash tags (`namespace/name:[0-9a-f]{5,40}`) + +### Determine mirrored tags +1. Walk the release controller mirror config directory (`--release-controller-mirror-config-dir`) +2. Parse each JSON config to extract `ImageStreamRef` entries (namespace, name, excluded tags) +3. Tags mirrored by the release controller are exempt from deletion + +### Find tags to delete (on app.ci) +1. For each ImageStream that has at least one promoted tag, fetch the full ImageStream from the cluster +2. Collect all tags present in the ImageStream's `.status.tags` +3. Remove tags that are in the promoted set, in the ignored-tags regex set, or mirrored by the release controller +4. The remaining tags are orphans -- delete them + +### Delete tags on build farm clusters +1. For each ImageStream with promoted tags, check if it exists on each build farm cluster +2. If the ImageStream doesn't exist on app.ci at all, delete the entire ImageStream on the build farm +3. Otherwise, compare tags: any tag present on the build farm but not on app.ci gets deleted + +### Explain mode (`--explain`) +Instead of deleting, prints a table showing each queried ImageStreamTag and which ci-operator config promotes it (or "unknown" / "does not exist"). + +### Mapping file generation (`--openshift-mapping-dir` + `--openshift-mapping-config`) +When these flags are set, the tool generates image mirroring mapping files instead of performing deletions: +1. For each promoted tag in the configured source namespace, look up the target mappings in the mapping config +2. Write mapping files to `--openshift-mapping-dir` in the format `source destination1 destination2...` +3. These files are consumed by the `periodic-image-mirroring-openshift` job to mirror images from the internal registry to external registries (e.g., `quay.io/openshift`) + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--ci-operator-config-path` | (required) | Path to the ci-operator config directory | +| `--dry-run` | `true` | Print tags to delete without actually deleting | +| `--ignored-image-stream-tags` | (none) | Regex patterns for tags to skip (can repeat) | +| `--release-controller-mirror-config-dir` | (required) | Path to release controller mirror config JSON files | +| `--openshift-mapping-dir` | (empty) | Output directory for mirroring mapping files | +| `--openshift-mapping-config` | (empty) | Path to the openshift mapping config (must pair with `--openshift-mapping-dir`) | +| `--explain` | (none) | Print promotion source for specific ISTs (namespace/name:tag format, can repeat) | +| `--log-level` | `info` | Log output level | +| `--kubeconfig` / `--build-cluster-kubeconfig` | (various) | Kubeconfigs for app.ci and build farm clusters | + +## Key files +- `cmd/promoted-image-governor/main.go` -- entry point, promoted tag collection, orphan detection, deletion, mapping generation + +## Deployment +Runs as a periodic CronJob. When used for tag cleanup, it connects to app.ci (via in-cluster config or kubeconfig) and optionally to build farm clusters. When used for mapping file generation, it only reads configs and writes files (no cluster access needed beyond app.ci for the explain mode). ## What it does `promoted-image-governor` is a tool with the following features: @@ -50,4 +103,4 @@ uses `promoted-image-governor` to ensure the mapping files to be aligned with th $ istag=ocp/4.9:cli make explain tag explanation ocp/4.9:cli openshift/oc@release-4.9 -``` \ No newline at end of file +``` diff --git a/cmd/prow-job-dispatcher/README.md b/cmd/prow-job-dispatcher/README.md index bdc80199516..3735a26fde0 100644 --- a/cmd/prow-job-dispatcher/README.md +++ b/cmd/prow-job-dispatcher/README.md @@ -1,7 +1,69 @@ # prow-job-dispatcher -As designed in [[DPTP-1152] Choose a cluster for prow jobs](https://docs.google.com/document/d/1aiuZ70jtvZiQBo2P8NgacRj0GmqUH6DRxE4KFFph1RM/edit) this tool chooses a cluster in the CI build farm for Prow jobs. +## What +Assigns Prow CI jobs to build farm clusters based on cloud provider affinity, cluster capacity, capability matching, and historical job volume. Exposes an HTTP API for real-time scheduling decisions and periodically rebalances the full assignment map. +## How it works + +### Job assignment algorithm (`findClusterForJobConfig()`) +For each Prow job config file: +1. Determine cloud provider from job's e2e tests (checks `ci-operator.openshift.io/cloud` label or `CLUSTER_TYPE` env var) +2. If all tests target same cloud provider, prefer clusters on that provider +3. Check if the most-used cluster has spare capacity (< 75% of fair share distribution) +4. If yes: assign to most-used cluster (locality benefit) +5. If no: find cluster with minimum current volume across all providers (or specific provider if determined) +6. Only assign to clusters with 100% capacity + +### Matching priority (in `DetermineClusterForJob()`) +Jobs are routed through this priority chain (first match wins): +1. Non-Kubernetes jobs — skip +2. vSphere jobs — direct assignment to vsphere cluster +3. SSH Bastion jobs — assign to configured bastion cluster +4. Explicit `ci.openshift.io/cluster` label — direct assignment +5. `capability/*` labels — match to cluster with all required capabilities +6. KVM device requests — route to KVM cluster list +7. Cloud provider mapping — deterministic by e2e test cloud +8. No-builds jobs — route to NoBuilds cluster list +9. Job/Path groups — match from config Groups section (job names or path regexes) +10. Build farm files — check BuildFarm configurations +11. Default cluster — fallback + +### Scheduling modes +- **Full dispatch** (weekly, Sundays 7 AM UTC, or on config change): queries Prometheus for 14 days of job volumes, calculates fair share per cluster based on capacity, assigns all jobs +- **Delta dispatch** (every 5 minutes): assigns new/modified jobs not in full dispatch +- **Config monitoring** (every 1 minute): detects cluster config changes, triggers re-dispatch if capacity/capabilities changed +- **HTTP scheduling** (on-demand): POST `/` with `{"job": "name"}` returns `{"cluster": "name"}` + +### Ephemeral cluster handling +Round-robin scheduling for Konflux ephemeral clusters with 24-hour TTL cache. + +### PR creation +After full dispatch, optionally creates PR to openshift/release with updated job assignments. Labels: `rehearsals-ack`, `priority/ci-critical`. Sends Slack notification to ops channel. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--prow-jobs-dir` | — | Root directory of Prow job configs | +| `--config-path` | — | Dispatcher config file | +| `--cluster-config-path` | `core-services/sanitize-prow-jobs/_clusters.yaml` | Cluster configuration | +| `--jobs-storage-path` | — | Gob file for job assignments | +| `--prometheus-days-before` | 14 | Historical days for volume query (1-15) | +| `--prometheus-url` | thanos-querier URL | Prometheus endpoint | +| `--create-pr` | false | Create PR with updated assignments | +| `--slack-token-path` | — | Slack API token | +| `--ops-channel-id` | `CHY2E1BL4` | Slack ops channel | +| `--enable-cluster` | — | Enable disabled clusters (repeatable) | +| `--disable-cluster` | — | Disable enabled clusters (repeatable) | + +## Key files +- `cmd/prow-job-dispatcher/main.go` — orchestration, cron scheduling, HTTP server +- `pkg/dispatcher/config.go` — `DetermineClusterForJob()`, matching chain +- `pkg/dispatcher/server.go` — HTTP API handlers +- `pkg/dispatcher/prowjobs.go` — thread-safe job assignment storage +- `pkg/dispatcher/prometheus_volumes.go` — volume distribution calculation + +## Deployment +Long-lived Deployment on app.ci, namespace ci. Container listens on port 8888; Service exposes it on port 8080. * It starts off by figuring out how many runs of each Prow jobs we had in the last seven days by querying the Prometheus instance in Prow-monitoring stack. * It groups all jobs from a Prow job file together and will always try to put all of them on the same cluster. * If a job has config stating it must be on a specific cluster, that will always be respected. This could lead to a job with tests on different clusters. We should not have many of those cases. diff --git a/cmd/publicize/README.md b/cmd/publicize/README.md new file mode 100644 index 00000000000..b63d649fc3a --- /dev/null +++ b/cmd/publicize/README.md @@ -0,0 +1,70 @@ +# publicize + +## What +Prow webhook plugin that merges commit history from private (`openshift-priv`) repositories back into their public upstream counterparts. When a PR is merged in a private repo, an org member can comment `/publicize` to push those changes to the public repo. + +This is the reverse direction of the private org mirroring pipeline: while `ci-operator-config-mirror` and `private-org-sync` push configs and repos into `openshift-priv`, `publicize` pushes merged code back out to the public. + +## User-facing commands + +| Command | What it does | +|---|---| +| `/publicize` | Merge the private repo's commit history into the configured public upstream repo | + +## How it works -- full flow + +### Prerequisites checked before publicizing +1. The comment must be on a pull request (not an issue) +2. The comment author must be a member of the repo's organization (`ghc.IsMember()`) +3. The pull request must be merged +4. The private repo must have an upstream mapping in the publicize config (e.g. `openshift-priv/installer` -> `openshift/installer`) + +### Merge and push flow +1. Clone the **destination** (public) repo using the git client +2. Checkout the PR's base branch (e.g. `main`, `release-4.16`) +3. Fetch the base branch from the **source** (private) repo using HTTPS with token auth +4. Configure git user name and email, disable GPG signing +5. Merge `FETCH_HEAD` using the `merge` strategy with commit message `"DPTP reconciliation from downstream"` +6. If merge succeeds and not dry-run, push to the public repo's central remote +7. Post a comment on the PR with a link to the new merge commit in the public repo + +### Error handling +- If the merge has conflicts, the tool posts an error comment explaining that a new PR must be created in the private repo to resolve the conflicts first, then `/publicize` can be used on that PR +- If the repo is not configured, the comment explains that no upstream mapping exists + +### Config hot-reload +The publicize config (`repositories` mapping) is watched via `prowconfig.GetCMMountWatcher`. When the ConfigMap is updated, the config is automatically reloaded without restarting the pod. + +## Config format +```yaml +repositories: + openshift-priv/installer: openshift/installer + openshift-priv/cluster-version-operator: openshift/cluster-version-operator +``` +Keys are `{private-org}/{repo}`, values are `{public-org}/{repo}`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-path` | `/etc/config/config.yaml` | Path to the publicize config YAML (mounted from ConfigMap) | +| `--hmac-secret-file` | `/etc/webhook/hmac` | GitHub webhook HMAC secret for signature verification | +| `--github-login` | (required) | GitHub username for push operations | +| `--git-name` | `""` | Git commit author name | +| `--git-email` | `""` | Git commit author email | +| `--dry-run` | `true` | Use API tokens but do not push or create comments | +| GitHub flags | | Standard Prow GitHub options | +| Event server flags | | Standard `githubeventserver.Options` (port, etc.) | + +## Key files +- `cmd/publicize/main.go` -- entry point, flag parsing, server setup, config watcher +- `cmd/publicize/server.go` -- webhook handler, prerequisite checks, merge/push logic, help provider +- `pkg/util/gzip/gzip.go` -- `ReadFileMaybeGZIP()` for config loading + +## Deployment +Long-lived webhook Deployment on `app.ci`. Listens for GitHub `issue_comment` events. + +Registers: +- `handleIssueComment` for `/publicize` commands +- Help provider for Prow's `/help` command + +Health endpoint served via `pjutil.NewHealth()`. diff --git a/cmd/qci-appci/README.md b/cmd/qci-appci/README.md index 467790b7dc8..04f18a6292e 100644 --- a/cmd/qci-appci/README.md +++ b/cmd/qci-appci/README.md @@ -1,12 +1,64 @@ # qci-appci -The name `qci-appci` comes from this tool being a reverse proxy of the image repository `quay.io/openshift/ci` to which -all images used by tests in the integrated registry of the CI cluster `app.ci` are being mirrored. -The proxy is developed in the context of migrating CI registry from `app.ci` to `quay.io` and -works as the _face_ of CI registry for human users and for some cases, a component running in the CI infrastructure, -e.g., a container on a CI build-farm, referring an image that is promoted during CI. - -# Functionality +## What +TLS-terminating reverse proxy that fronts `quay.io/openshift/ci` and authenticates users via OpenShift cluster tokens. This allows CI workloads and developers authenticated on the app.ci cluster to pull images from Quay without needing separate Quay credentials. The proxy translates OCP bearer tokens into Quay robot account credentials transparently. + +## How it works -- full flow + +### Authentication flow +1. Client sends a Docker registry auth request: `GET /v2/auth` with Basic auth (username = anything, password = OCP token) +2. The proxy validates the OCP token via a `TokenReview` against the app.ci cluster +3. For human users (not service accounts), it additionally performs a `SubjectAccessReview` to check `get` permission on `imagestreams/layers` in the `ocp` namespace +4. If valid, the proxy generates a short-lived JWT token (signed with `--token-secret-file`) and returns it +5. Client uses this JWT for subsequent pull requests + +### Proxying flow +1. Client sends a pull request (e.g., `GET /v2/openshift/ci/manifests/...`) with `Authorization: Bearer ` +2. The proxy validates the JWT +3. If valid, replaces the JWT with the Quay robot account's bearer token and forwards the request to `quay.io` +4. Returns the response from Quay to the client + +### Robot token maintenance +A background goroutine (`robotTokenMaintainer`) runs every `--interval` (default 30s): +1. Checks if the current Quay robot token is still valid by hitting `GET https://quay.io/v2` +2. If expired or invalid, renews it by authenticating with robot username/password against `https://quay.io/v2/auth?service=quay.io&scope=repository:openshift/ci:pull` +3. Uses exponential backoff (3 retries) on failures + +### Token types +- **Cluster token**: an OCP bearer token from the app.ci cluster, validated via TokenReview + SubjectAccessReview +- **App token (JWT)**: a short-lived HMAC-signed JWT issued by this proxy, containing the authenticated user's ID and expiry +- **Robot token**: a Quay.io bearer token obtained using the robot account credentials, used for actual Quay API calls + +### Special cases +- The Quay robot account itself can authenticate directly (username/password checked against the robot credentials) +- Service accounts (`system:serviceaccount:*`) bypass the SubjectAccessReview check -- only TokenReview is needed +- Health check endpoint: `GET /healthz` returns 200 OK + +### Request logging +All requests are logged with method, URI, status code, response size, duration, and whether a bearer token was present. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--listen-addr` | `127.0.0.1:8400` | Address to listen on | +| `--exposed-host` | `quay-proxy.ci.openshift.org` | Hostname used in `Www-Authenticate` headers | +| `--gracePeriod` | `10s` | Graceful shutdown duration | +| `--robot-username-file` | (required) | Path to the Quay robot username file | +| `--robot-password-file` | (required) | Path to the Quay robot password file | +| `--token-secret-file` | (required) | Path to the HMAC secret for signing JWTs | +| `--token-validity` | `21600s` (6h) | How long issued JWTs are valid | +| `--tls-cert-file` | (required) | Path to the TLS certificate | +| `--tls-key-file` | (required) | Path to the TLS private key | +| `--interval` | `30s` | How often to refresh the Quay robot bearer token | + +## Key files +- `cmd/qci-appci/main.go` -- entry point, reverse proxy setup, token services (robot, cluster, app), auth handlers, request routing + +## Deployment +Long-lived Deployment on app.ci, exposed via a Route. Serves TLS directly. Requires in-cluster access for TokenReview and SubjectAccessReview API calls. + +The external hostname (`quay-proxy.ci.openshift.org`) must be configured in the OCP cluster as an additional image registry so that `oc image` and container runtimes can pull from it. +## Functionality - The human users from `app.ci` can pull the images in the repo `quay.io/openshift/ci`: @@ -20,7 +72,7 @@ where `ci_ci-operator_latest` stands for the image stream tag `ci-operator:lates - The robot from the `openshift` org can too. This robot provides the read-only access to the repo. More details about this comes later. -# How the authentication of `podman` works +## How the authentication of `podman` works [This artical](https://access.redhat.com/solutions/3625131) illustrates how a client is authenticated against quay.io. More verbose output of `podmand` with `--log-level=trace` below shows that `podman` makes a similar process. @@ -58,7 +110,7 @@ Unauthorized Then, `podman` did a basic auth as it was instructed. The bearer token is returned from the server in the body. The second attempt to access `/v2` was done with the bearer token and this time, it passed as expected. The bearer token is used for authorization to access any other endpoint to `quay.io`. -# How `qci-appci` works +## How `qci-appci` works The proxy manipulates the above process: `app.ci` maintains a valid token to QCI with the provided robot's username and password. @@ -70,7 +122,7 @@ Otherwise, the request will be denied with `401`. The generated token is a [JWT token](https://jwt.io/) signed by a secret provided to `qci-appci`. Any request to `qci-appci` other than path `/v2/auth` will require a valid JWT token. Otherwise, the request gets `401`. `qci-appci` replaces the valid token in the request with the QCI token and forwards the request to `quay.io` and forwards the response from `quay.io` to the client. -# Authorization for human users +## Authorization for human users For human users, `qci-app.ci` authorizes the requests with token that can pull images from `ocp` on `app.ci`. [Our document](https://docs.ci.openshift.org/how-tos/use-registries-in-build-farm/#human-users) tells our users to bind their group to the role `qci-image-puller`. In reality, this condition becomes unnecessary as `ocp` allows all authenticaed users to pull its images by the following `RoleBinding`: diff --git a/cmd/rebalancer/README.md b/cmd/rebalancer/README.md index 5f2c91004e3..fd35c3efd4e 100644 --- a/cmd/rebalancer/README.md +++ b/cmd/rebalancer/README.md @@ -1,14 +1,42 @@ -Rebalancer -========== +# rebalancer -Rebalancer is a tool used to rebalance jobs between specific cluster profiles. -This tool is needed when Boskos leases are in short supply for some profiles. +## What +CLI tool that redistributes CI tests across equivalent cluster profiles to balance Boskos lease usage. When multiple cloud profiles can serve the same purpose (e.g. `aws-1` and `aws-2`), rebalancer assigns each test to the profile with the least accumulated workload, then rewrites the ci-operator config files in place. -Usage ------ +## How it works -- full flow -```bash -oc --context app.ci whoami -t > /tmp/token -# go to release repository folder and execute: -/path/to/rebalancer --profiles='azure4,azure-2' --prometheus-bearer-token-path=/tmp/token -``` +1. **Parse profile groups**: the `--profiles` flag is specified one or more times, each value being a comma-separated list of equivalent profiles (e.g. `--profiles aws-1,aws-2 --profiles gcp-1,gcp-2`). These form groups within which load is balanced. + +2. **Query Prometheus for job volumes**: using the `--prometheus-*` flags, it queries Prometheus for job execution volumes over the past `--prometheus-days-before` days (default 14). The result is a `map[string]float64` mapping full job names to their execution weight (volume). This uses `pkg/dispatcher.NewPrometheusVolumes` and `GetJobVolumes()`. + +3. **Walk ci-operator configs**: it reads all ci-operator configuration from `ci-operator/config/` in the current working directory via `config.OperateOnCIOperatorConfigDir`. + +4. **Greedy assignment**: for each test in each config that has a `MultiStageTestConfiguration` with a `ClusterProfile` belonging to one of the configured groups: + - Find the profile in the group with the minimum accumulated workload (bucket value). + - If the test's current profile differs from the best choice, reassign it and log the change. + - Add the test's Prometheus-derived volume weight to the chosen profile's bucket. + +5. **Write updated configs**: modified configs are committed back to disk via `config.DataWithInfo.CommitTo()`. + +6. **Log final weights**: the accumulated weight per profile is logged for visibility. + +### Job name construction +The tool constructs the expected Prow job name for each test to look it up in the Prometheus volume data. The format is: `{periodic|branch|pull}-ci-{org}-{repo}-{branch}[-{variant}]-{testname}`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--profiles` | (required, repeatable) | Comma-separated list of profiles forming one equivalence group. Specify multiple times for multiple groups | +| `--prometheus-days-before` | 14 | Number of days of historical data to query from Prometheus (range [1,15]) | +| `--prometheus-username` | (from PrometheusOptions) | Prometheus basic auth username | +| `--prometheus-password-path` | (from PrometheusOptions) | Path to file containing Prometheus password | +| `--prometheus-bearer-token-path` | (from PrometheusOptions) | Path to file containing Prometheus bearer token | + +## Key files +- `cmd/rebalancer/main.go` -- entry point, profile group parsing, greedy assignment loop, config rewriting +- `pkg/dispatcher/prometheus_volumes.go` -- Prometheus client for job volume queries +- `pkg/dispatcher/prometheus.go` -- `GetJobVolumesFromPrometheus()` query implementation +- `pkg/config/load.go` -- `OperateOnCIOperatorConfigDir()` for walking ci-operator configs + +## Deployment +CLI tool. Typically invoked by `auto-config-brancher` as part of an automated periodic job that proposes changes to the openshift/release repository. Must be run from a directory containing `ci-operator/config/` (the release repo). diff --git a/cmd/registry-replacer/README.md b/cmd/registry-replacer/README.md index 0dc2d3a45d9..4a3189c4ebe 100644 --- a/cmd/registry-replacer/README.md +++ b/cmd/registry-replacer/README.md @@ -1,7 +1,63 @@ -# Registry replacer +# registry-replacer -A small utility used to make sure that all builds use a cluster-local registry. It: +## What +Automated tool that ensures all `FROM` directives in Dockerfiles used by ci-operator image builds reference images through the CI build cluster's internal registry rather than external registries (like `registry.ci.openshift.org`). It scans ci-operator configs, fetches the corresponding Dockerfiles from GitHub, extracts registry references, and adds `inputs[].as` replacement directives to the ci-operator config so that images are pulled from the build cluster during CI. +It can also prune unused replacements, prune unused base images, ensure Dockerfiles match the ocp-build-data repo, and optionally create a PR with the changes. + +## How it works -- full flow + +### Per-config processing +For each ci-operator config in `--config-dir`, the tool: + +1. **Skip check**: Skip if the repo is in `--ignore-repos` or the org is in `--ignore-orgs` +2. **Dockerfile alignment** (if `--ensure-correct-promotion-dockerfile`): Update `contextDir` and `dockerfilePath` in image build steps to match what's defined in the [ocp-build-data](https://github.com/openshift/ocp-build-data) repo +3. **Fetch Dockerfiles**: For each image build step, fetch the Dockerfile from GitHub (using HTTP, not git clone, for performance) +4. **Apply existing replacements**: Simulate what the build tools would do -- apply `from` replacements and `inputs[].as` replacements to the Dockerfile +5. **Extract registry references**: Parse the Dockerfile to find all `FROM` directives referencing external registries (e.g., `registry.ci.openshift.org/ocp/builder:...`) +6. **Ensure replacements**: For each external registry reference found, add an `inputs` entry mapping the image to a `base_images` entry so the build cluster pulls it locally +7. **Prune unused replacements** (if `--prune-unused-replacements`): Remove `inputs[].as` entries that don't match any `FROM` directive in the Dockerfile +8. **Prune ocp/builder replacements** (if `--prune-ocp-builder-replacements`): Remove replacements targeting `ocp/builder` for configs that promote to `ocp` +9. **Prune unused base images** (if `--prune-unused-base-images`): Resolve the full config (including step registry) and remove `base_images` entries not referenced by any image build, test step, or operator substitution +10. **Write changes**: If the config changed, write it back to disk + +### PR creation (if `--create-pr`) +After processing all configs: +1. Check if any files changed +2. Commit and push to a `registry-replacer` branch on the user's fork +3. Create or update a PR against `openshift/release` with a description of what was changed + +### Concurrency +Processing is parallelized with `--concurrency` (default 500) goroutines via a semaphore, since each config independently fetches its Dockerfile over HTTP. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--config-dir` | (required) | Path to ci-operator config directory | +| `--create-pr` | `false` | Automatically create/update a PR with changes | +| `--github-user-name` | `openshift-bot` | GitHub username for PR creation | +| `--self-approve` | `false` | Add `approved` and `lgtm` labels to the PR | +| `--ensure-correct-promotion-dockerfile` | `false` | Align Dockerfiles with ocp-build-data | +| `--ensure-correct-promotion-dockerfile-ignored-repos` | (none) | Repos to skip for Dockerfile alignment (can repeat) | +| `--ignore-repos` | (none) | Repos to skip entirely (can repeat) | +| `--ignore-orgs` | (none) | Orgs to skip entirely (can repeat) | +| `--concurrency` | `500` | Max concurrent goroutines | +| `--ocp-build-data-repo-dir` | `../ocp-build-data` | Path to ocp-build-data repo | +| `--current-release-major` | `4` | Current release major version | +| `--current-release-minor` | `6` | Current release minor version | +| `--prune-unused-replacements` | `false` | Remove replacements that don't match any FROM directive | +| `--prune-ocp-builder-replacements` | `false` | Remove all ocp/builder-targeting replacements | +| `--prune-unused-base-images` | `false` | Remove base_images not referenced anywhere | +| `--apply-replacements` | `true` | Whether to apply Dockerfile replacements (false also disables pruning) | +| `--registry` | (empty) | Path to step registry (needed for `--prune-unused-base-images`) | + +## Key files +- `cmd/registry-replacer/main.go` -- entry point, the `replacer()` function, Dockerfile parsing, replacement logic, PR creation +- `pkg/dockerfile/` -- Dockerfile parsing utilities, registry reference extraction +- `pkg/api/ocpbuilddata/` -- ocp-build-data config loading + +## Deployment +Runs as a periodic CronJob. Fetches Dockerfiles via GitHub HTTP API (not git), so it needs a GitHub token. When creating PRs, it pushes to a fork and uses the GitHub API to create/update the PR. * Finds all ci-operator configs with at least one images directive * Downloads the corresponding Dockerfile * If it has a reference to the api.ci registry, updates the ci-operator config to replace that with a `base_image` diff --git a/cmd/release/README.md b/cmd/release/README.md index 73feee9db66..597b694fc71 100644 --- a/cmd/release/README.md +++ b/cmd/release/README.md @@ -1,8 +1,85 @@ -# `release` - -`release` is a command-line program that can be used to interact with and -extract various types of data from the `openshift/release` repository. - +# release + +## What +Developer CLI for inspecting and querying openshift/release repository data. Built with Cobra subcommands, it can list and print ci-operator configs (with optional registry resolution), Prow job configs, step registry components (steps, chains, workflows with tree display), and cluster profile details. Designed for local use -- not deployed as a service. + +## How it works -- subcommands + +### `release config [paths...]` +Operates on ci-operator configuration files. + +| Flag | What it does | +|---|---| +| `--list` / `-l` | Print file paths only (no contents) | +| `--resolve` / `-r` | Resolve registry references before printing (loads step registry) | +| (default) | Print raw YAML of each config | + +If no paths are given, defaults to `ci-operator/config` under the root directory. Paths can be relative to the config directory. + +### `release job [paths...]` +Operates on Prow job configuration files. + +| Flag | What it does | +|---|---| +| `--list` / `-l` | Print file paths only | +| (default) | Print raw YAML of each job config | + +Defaults to `ci-operator/jobs` under the root directory. + +### `release registry` +Operates on step registry components. Without a subcommand, lists all steps, chains, and workflows with type prefixes. + +#### `release registry step [names...]` +| Flag | What it does | +|---|---| +| `--list` / `-l` | List all step names | +| `--resolve` / `-r` | Print resolved step (for steps, same as unresolved) | +| `--tree` | Print step name in tree format | +| (default) | Print YAML of the step definition | + +#### `release registry chain [names...]` +| Flag | What it does | +|---|---| +| `--list` / `-l` | List all chain names | +| `--resolve` / `-r` | Resolve chain (inline all referenced steps and sub-chains) | +| `--tree` | Display chain hierarchy with indentation | +| (default) | Print YAML of the chain definition | + +#### `release registry workflow [names...]` +| Flag | What it does | +|---|---| +| `--list` / `-l` | List all workflow names | +| `--resolve` / `-r` | Resolve workflow (inline all pre/test/post steps and chains) | +| `--tree` | Display workflow hierarchy: pre/test/post sections with chains and steps indented | +| (default) | Print YAML of the workflow definition | + +### `release profile [names...]` +Lists cluster profiles with their details. If no names given, lists all known profiles. For each profile, outputs: +- `profile`: name +- `cluster_type`: associated cloud provider type +- `lease_type`: Boskos lease type + +## Flags (global) + +| Flag | Default | What it controls | +|---|---|---| +| `-C` / `--root-dir` | `.` | Path to the root of the openshift/release repository | +| `--config-dir` | `ci-operator/config` (relative to root) | Override path to ci-operator config directory | +| `--job-config` | `ci-operator/jobs` (relative to root) | Override path to Prow job config directory | +| `--registry` | `ci-operator/step-registry` (relative to root) | Override path to step registry directory | + +## Key files + +- `cmd/release/main.go` -- entry point, Cobra root command with pflag integration +- `pkg/cmd/release/release.go` -- root command definition, global flags, subcommand registration +- `pkg/cmd/release/config.go` -- `config` subcommand: list/print/resolve ci-operator configs +- `pkg/cmd/release/jobs.go` -- `job` subcommand: list/print Prow job configs +- `pkg/cmd/release/registry.go` -- `registry` subcommand and sub-subcommands (step/chain/workflow): list/print/resolve/tree display +- `pkg/cmd/release/profile.go` -- `profile` subcommand: cluster profile listing +- `pkg/cmd/release/util.go` -- shared utilities: path joining, registry loading, YAML printing + +## Deployment +Not deployed. Local developer tool for inspecting release repo data. Install with `go install ./cmd/release` from the ci-tools repo. ## Arguments `release` expects to be executed from the root of the repository. The diff --git a/cmd/repo-brancher/README.md b/cmd/repo-brancher/README.md index cfb199224a0..b6dad55e4d2 100644 --- a/cmd/repo-brancher/README.md +++ b/cmd/repo-brancher/README.md @@ -1,5 +1,64 @@ # repo-brancher +## What +Fast-forwards future OCP release branches to match the current development branch HEAD. During the OCP lifecycle, future release branches exist as placeholders that are kept in sync with the development branch until code freeze, when they diverge. This tool performs that continuous synchronization via `git push` of the dev branch content to the future branch. + +## How it works -- full flow + +1. **Discover repos**: Iterates over all ci-operator config files in the config directory (via `promotion.FutureOptions.OperateOnCIOperatorConfigDir`, excluding OKD). Each config that promotes images to the current release identifies a repo and its development branch. + +2. **Skip ignored repos**: Repos or orgs listed in `--ignore` are skipped entirely. + +3. **Determine target branches**: For each repo, calls `promotion.DetermineReleaseBranch()` to map development branches to future release branches: + - `master`/`main` maps to `release-{futureVersion}` + - `openshift-{currentVersion}` maps to `openshift-{futureVersion}` + - If the future branch would be the same as the current branch, it is skipped. + +4. **Clone and push** (per repo): + - Creates a directory under `--git-dir` (or a temp dir). + - Runs `git init` and `git fetch --depth 1` to shallow-clone the development branch. + - For each future branch, runs `git ls-remote` to check if the branch exists. + - If `--confirm` is set, pushes `FETCH_HEAD` to `refs/heads/{futureBranch}`. + +5. **Progressive deepening on push failure**: If a push fails because the remote has diverged (non-fast-forward), the tool progressively deepens the clone: + - Depths increase exponentially: 1, 2, 4, 8, 16, 32, 64, 128 additional commits (via `git fetch --deepen`). + - After 8 attempts, falls back to `git fetch --unshallow` (full history). + - Maximum 9 retry iterations. If all fail, the config is recorded as failed. + +6. **Error handling**: Failed configs are tracked in a set. If any config fails, the tool exits with code 1 after processing all repos. + +7. **Retry logic**: Every `git` command is retried up to 3 times with exponential backoff (1s, 2s, 4s) to handle transient network errors. + +8. **Token censoring**: When `--confirm` is set and a token is loaded, log output is censored to prevent token leakage. + +### Authentication +When `--confirm` is set, the tool constructs HTTPS remote URLs with embedded credentials: `https://{username}:{token}@github.com/{org}/{repo}`. The token is read from `--token-path`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--current-release` | (required) | Current OCP development version | +| `--future-release` | (required, repeatable) | Future release versions whose branches should be fast-forwarded | +| `--config-dir` | (required, via `FutureOptions`) | Path to ci-operator config directory | +| `--confirm` | false | Actually push branches (dry-run by default, logs "Would create new branch") | +| `--username` | (required with `--confirm`) | GitHub username for push authentication | +| `--token-path` | (required with `--confirm`) | Path to file containing GitHub token | +| `--git-dir` | (temp dir) | Directory for git operations. If unset, creates and cleans up a temp dir. | +| `--ignore` | (empty, repeatable) | Org or org/repo to skip. Can be passed multiple times. | +| `--current-promotion-namespace` | (empty) | Promotion namespace filter | + +## Key files +- `cmd/repo-brancher/main.go` -- full implementation (single file) +- `pkg/promotion/promotion.go` -- `FutureOptions`, `DetermineReleaseBranch()` + +## Deployment +Runs as the periodic Prow job [`periodic-openshift-release-fast-forward`](https://prow.ci.openshift.org/?job=periodic-openshift-release-fast-forward). Defined in `ci-operator/jobs/infra-periodics.yaml` in the openshift/release repo. + +The `fast-forwarding-config-manager` (under `branchingconfigmanagers`) automatically updates this job's `--current-release` and `--future-release` arguments as the OCP lifecycle progresses. + +## Related +- `cmd/blocking-issue-creator` -- creates merge-blocker issues on the branches this tool fast-forwards +- `cmd/branchingconfigmanagers/fast-forwarding-config-manager` -- manages this job's version arguments ## What it does The `repo-brancher` automatically fast forwards the git content of future release branches (which diff --git a/cmd/repo-init/README.md b/cmd/repo-init/README.md index 450bfbb5931..c76f3b65fa4 100644 --- a/cmd/repo-init/README.md +++ b/cmd/repo-init/README.md @@ -1,5 +1,101 @@ # repo-init +## What +Interactive onboarding tool for bootstrapping new repositories into OpenShift CI. Generates Prow configuration (Tide queries, plugin config) and ci-operator configuration (build root, tests, images, promotions) from user-provided inputs. Operates in three modes: + +- **CLI**: Interactive terminal prompts for local use +- **API**: REST server for programmatic config generation, validation, and PR creation +- **UI**: Embedded React/PatternFly web wizard (serves static frontend assets) + +## How it works -- full flow + +### CLI mode (`--mode cli`) + +1. **Collect information** interactively (or via `--config` JSON flag): + - Org, repo, branch + - Whether it promotes images, promotes with OpenShift, needs base/OS images + - Go version, build commands, test build commands, canonical Go import path + - Unit/integration test scripts (name, from-image, command) + - End-to-end tests (name, cluster profile, command, workflow, CLI requirement) + - Operator bundle configuration (optional) + - Release type and version for non-promoting repos with e2e tests + +2. **Check for existing config**: If a config already exists at `ci-operator/config//` in the release repo, abort. + +3. **Update Prow config** (`updateProwConfig`): + - Load existing Prow config from the release repo + - Check if Tide queries already exist for this org/repo; if so, skip + - Copy Tide queries from a reference repo: `openshift/cluster-version-operator` for OCP components, `openshift/ci-tools` for non-OCP repos + - Replace the org/repo in the copied queries + - Write the per-repo prowconfig YAML to the appropriate sharded path + +4. **Update plugin config** (`updatePluginConfig`): + - Load the plugin config from `core-services/prow/02_config` + - If neither org nor repo has plugins configured, add all plugins from `openshift` + `openshift/origin` + - If org has plugins but repo doesn't, add only the repo-specific plugins missing from org-level + - Add external plugins if not configured at org level + - Add `approve` (self-approval disabled) and `lgtm` (review acts as lgtm) config for the repo + +5. **Generate ci-operator config** (`generateCIOperatorConfig`): + - Build root: Go image from `openshift/release:golang-` + - Base images: `base` (from promotion target) and/or `os` (centos:7) if needed + - Promotion: configure targets matching openshift/origin's namespace/name + - Releases: `initial` and `latest` integration releases for promoting repos + - Tests: container tests from user input, e2e tests with multi-stage configurations + - Operator bundle: OLM operator testing configuration + - Resources: default limits (4Gi memory) and requests (200Mi memory, 100m CPU) + - Write to `ci-operator/config///--.yaml` + +6. **Print replay command**: Output the JSON config so the run can be reproduced non-interactively. + +### API mode (`--mode api`) + +Runs an HTTP server with the following endpoints: + +| Endpoint | Method | What it does | +|---|---|---| +| `POST /api/auth` | POST | OAuth flow: exchange GitHub code for access token, return user info | +| `GET /api/cluster-profiles` | GET | List available cluster profiles | +| `GET /api/configs?org=X&repo=Y` | GET | Load existing ci-operator configs for an org/repo | +| `POST /api/configs` | POST | Generate config from `initConfig` JSON; optionally create PR (`?generatePR=true`) or just convert (`?conversionOnly=true`) | +| `POST /api/config-validations` | POST | Validate partial or full configs (base images, container images, tests, operator bundles, operator substitutions) | +| `GET /api/server-configs` | GET | Return non-secret server config (GitHub client ID, redirect URI) | + +The API server maintains a pool of `--num-repos` (default 4) local clones of openshift/release, with locking to prevent concurrent access conflicts. When generating configs with PR creation: +- Runs `ci-operator-checkconfig`, `ci-operator-prowgen`, and `sanitize-prow-jobs` to mimic `make jobs` +- Pushes changes to the user's fork and returns a PR creation URL + +### UI mode (`--mode ui`) + +Serves an embedded React application (built from `cmd/repo-init/frontend/dist`) as static assets. The UI communicates with the API server for all backend operations. + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--mode` | `cli` | Operating mode: `cli`, `api`, or `ui` | +| `--release-repo` | `""` | Path to the root of the openshift/release repository (required for CLI mode) | +| `--config` | `""` | JSON configuration to use instead of interactive prompts (CLI mode) | +| `--port` | `0` | HTTP server port (required for API and UI modes) | +| `--num-repos` | `4` | Number of openshift/release clones to maintain for API mode | +| `--server-config-path` | `""` | Directory containing server config files: `github-client-id`, `github-client-secret`, `github-redirect-uri` | +| `--disable-cors` | `false` | Disable CORS restrictions (for local development) | +| `--loglevel` | `debug` | Log level | +| `--log-style` | `json` | Log format: `json` or `text` | +| GitHub flags | -- | `--github-token-path`, `--github-endpoint`, etc. via `GitHubOptions` | +| Instrumentation flags | -- | `--health-port`, `--metrics-port` via `InstrumentationOptions` | + +## Key files + +- `cmd/repo-init/main.go` -- entry point, CLI mode implementation: interactive prompts, config generation (`generateCIOperatorConfig`), Prow/plugin config updates +- `cmd/repo-init/api.go` -- API server: REST handlers, config validation, PR creation, release repo pool management +- `cmd/repo-init/frontend.go` -- UI server: embedded static asset serving +- `cmd/repo-init/frontend/` -- React/PatternFly web application source + +## Deployment +- **API server**: Deployment on app.ci (`repo-init-apiserver`), `ci` namespace, port 8080. Maintains multiple clones of openshift/release for concurrent requests. +- **UI server**: Deployment on app.ci (`repo-init-ui`), `ci` namespace, port 8080. Serves the React frontend. +- **CLI**: Local developer usage only. The `repo-init` component allows a user to on-board a new repository to the CI Test Platform. ## CLI @@ -91,4 +187,4 @@ and make pr-deploy-repo-init-ui ``` -After this, you should have a working copy of the `repo-init` component deployed that you can test with. \ No newline at end of file +After this, you should have a working copy of the `repo-init` component deployed that you can test with. diff --git a/cmd/result-aggregator/README.md b/cmd/result-aggregator/README.md new file mode 100644 index 00000000000..bda4e88212b --- /dev/null +++ b/cmd/result-aggregator/README.md @@ -0,0 +1,60 @@ +# result-aggregator + +## What +HTTP server that collects CI failure reasons and pod-scaler resource configuration warnings from ci-operator and pod-scaler, exposing them as Prometheus counters. This is the central collection point for CI error classification metrics, enabling dashboards and alerting on failure patterns. + +## How it works -- full flow + +### Server startup +1. Parse flags and validate: `--passwd-file` is required. +2. Register two Prometheus counter vectors: `ci_operator_error_rate` and `pod_scaler_admission_high_determined_resource`. +3. Set up HTTP routes with basic auth middleware. +4. Start serving metrics on the default Prow metrics port (via `metrics.ExposeMetrics`). +5. Listen on `--address` (default `:8080`) with graceful shutdown support. + +### POST /result -- ci-operator error reporting +ci-operator calls this endpoint at the end of each job run to report success or failure. + +1. Authenticate via HTTP basic auth against the password file. +2. Decode the JSON body into a `results.Request` struct. +3. Validate required fields: `job_name`, `type`, `state`, `reason`, `cluster`. +4. Increment the `ci_operator_error_rate` Prometheus counter with labels: `job_name`, `type` (presubmit/postsubmit/periodic), `state` (succeeded/failed), `reason` (colon-delimited chain of failure reasons), `cluster`. + +### POST /pod-scaler -- resource configuration warnings +pod-scaler calls this endpoint when it determines a higher resource amount than what was configured. + +1. Authenticate via HTTP basic auth. +2. Decode the JSON body into a `results.PodScalerRequest` struct. +3. Validate required fields: `workload_name`, `workload_type`, `configured_amount`, `determined_amount`, `resource_type`. +4. Increment the `pod_scaler_admission_high_determined_resource` counter with labels: `workload_name`, `workload_type`, `configured_amount`, `determined_amount`, `resource_type`, `measured` (true/false), `workload_class`. + +### Authentication +Authentication uses a password file where each line is `username:password` (CSV format, colon-delimited). The `multi` validator wraps one or more `passwdFile` validators, returning true if any delegate validates the credentials. The file is re-read on every request (no caching). + +### Client-side reporting +The `pkg/results` package provides the client side: `Options.Reporter()` creates a `Reporter` that POSTs to the result-aggregator's `/result` endpoint. Each ci-operator invocation calls `reporter.Report(err)` at completion, which extracts the chain of `results.Reason` values from the error and sends one request per reason chain. + +The default server address is `https://result-aggregator-ci.apps.ci.l2s4.p1.openshiftapps.com`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--address` | `:8080` | Address to listen on | +| `--log-level` | `info` | Log level | +| `--gracePeriod` | `10s` | Grace period for server shutdown | +| `--passwd-file` | (required) | Path to file with `username:password` lines for basic auth | + +## Key files +- `cmd/result-aggregator/main.go` -- HTTP server, request handlers, Prometheus counter registration, validation +- `cmd/result-aggregator/validator.go` -- password file parsing (`passwdFile`), multi-validator support +- `pkg/results/report.go` -- client-side `Reporter` and `PodScalerReporter` implementations, `Request` and `PodScalerRequest` types +- `pkg/results/error.go` -- `Error` type with `Reason` chains, `ForReason()` builder, `Reasons()` extractor +- `pkg/results/results.go` -- `Reason` type definition + +## Deployment +Long-lived Deployment on app.ci in the `ci` namespace. Exposed via a Route. The password file is mounted from a Secret. Prometheus scrapes the metrics port. + +## Related +- Every ci-operator invocation reports to this server via the `--report-address` and `--report-credentials-file` flags. +- pod-scaler-admission reports resource warnings via the `/pod-scaler` endpoint. +- Prometheus counter `ci_operator_error_rate` powers CI failure dashboards. diff --git a/cmd/retester/README.md b/cmd/retester/README.md new file mode 100644 index 00000000000..efcfb7212de --- /dev/null +++ b/cmd/retester/README.md @@ -0,0 +1,78 @@ +# retester + +## What +Automated retest controller that periodically scans open PRs matching Tide merge criteria and issues `/retest-required` comments on PRs where required Prow jobs have failed. This eliminates the need for humans to manually babysit flaky test failures. + +It uses a backoff mechanism to avoid infinite retest loops: it tracks how many times each PR has been retested (per PR SHA and per base SHA combination) and eventually puts PRs on hold if they exceed configured limits. + +## How it works -- full flow + +### Periodic sync loop +1. On startup, load the retester config file (org/repo policies) and restore the backoff cache from disk or S3 +2. Every `--interval` (default 1h), run the sync: + - Query GitHub (via Tide's GraphQL queries from Prow config) for all open PRs that match Tide merge criteria but have failing status contexts + - Filter to only PRs in enabled orgs/repos (per retester config) + - Filter to only PRs that have at least one failing context corresponding to a non-optional (required) Prow presubmit + - For each remaining candidate, decide whether to retest, pause, or hold + +### Retest-or-backoff decision +For each candidate PR, the controller: +1. Gets the current base branch SHA +2. Looks up the retester policy for the org/repo (checking repo-level, then org-level, then global defaults) +3. Checks the backoff cache for this PR: + - If `RetestsForPrSha` >= `max_retests_for_sha`: post `/hold` with explanation (PR is consistently failing) + - If `RetestsForBaseSha` >= `max_retests_for_sha_and_base`: pause (wait for a new base SHA from merges) + - Otherwise: post `/retest-required` comment and increment the counters +4. If the PR SHA changes (new push), counters reset for the new SHA +5. If the base SHA changes (something merged into the target branch), the base-SHA counter resets + +### Cache persistence +The backoff cache maps PR keys to retest counts and SHAs. It can be stored as: +- A local JSON file (`--cache-file`) +- An S3 object (`--cache-file-on-s3` with `--cache-file` as the S3 key) + +Cache records expire after `--cache-record-age` (default 168h / 7 days) since the last time they were considered. + +## Configuration hierarchy +The retester config file has three levels, where more specific levels override general ones: + +``` +retester: + enabled: true + max_retests_for_sha: 3 + max_retests_for_sha_and_base: 1 + orgs: + openshift: + enabled: true + max_retests_for_sha: 5 + repos: + origin: + enabled: true + max_retests_for_sha_and_base: 2 +``` + +- `max_retests_for_sha`: maximum total retests for a given PR commit SHA before putting the PR on hold +- `max_retests_for_sha_and_base`: maximum retests for a given PR SHA + base branch SHA combination before pausing (waiting for a new base SHA) +- `enabled`: must be true at some level for the org/repo to be processed. A repo-level `false` overrides an org-level `true`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--dry-run` | `true` | Uses API tokens but does not post comments | +| `--run-once` | `false` | Run a single sync then exit | +| `--interval` | `1h` | Time between sync cycles | +| `--cache-file` | (empty) | Path to persist the backoff cache (local file or S3 key) | +| `--cache-file-on-s3` | `false` | Use AWS S3 for cache persistence instead of local file | +| `--cache-record-age` | `168h` | How long a cache record lives after last consideration | +| `--config-file` | (required) | Path to the retester policy config file | +| `--config-path` | (Prow) | Prow config path (for Tide queries and presubmit definitions) | + +## Key files +- `cmd/retester/main.go` -- entry point, flag parsing, periodic tick loop +- `pkg/retester/retester.go` -- core logic: `Run()`, `findCandidates()`, `retestOrBackoff()`, `enabledPRs()`, config policy resolution +- `pkg/retester/cache.go` -- backoff cache interface and implementations (file-based, S3-based) + +## Deployment +Runs as a long-lived Deployment on app.ci. Exposes Prometheus metrics on the default Prow metrics port. + +Key metric: `retest_total` counter (labels: org, repo) tracking how many `/retest-required` commands have been issued. diff --git a/cmd/sanitize-prow-jobs/README.md b/cmd/sanitize-prow-jobs/README.md index fcde20d27e4..5378d778af1 100644 --- a/cmd/sanitize-prow-jobs/README.md +++ b/cmd/sanitize-prow-jobs/README.md @@ -1,6 +1,71 @@ -# Sanitize prow jobs +# sanitize-prow-jobs -`sanitize-prow-jobs` is a small tool that: +## What +Deterministically formats Prow job configuration files and assigns clusters to jobs based on dispatcher rules. Unlike the other `determinize-*` tools which only normalize formatting, this tool actively modifies job configurations: it sets the `cluster` field on every job according to the dispatcher's routing logic and normalizes branch regexes on presubmits and postsubmits. +This is the tool that decides **where** each Prow job runs (which build farm cluster). + +## How it works -- full flow + +### Startup +1. Parse flags: `--prow-jobs-dir` (root of Prow job configs), `--config-path` (dispatcher config), `--cluster-config-path` (cluster metadata) +2. Load the dispatcher config from `--config-path` and validate it. This config defines job routing rules: default cluster, SSH bastion cluster, KVM clusters, build farm mappings, job groups, cloud mappings. +3. Load cluster metadata from `--cluster-config-path`, which provides per-cluster info (provider, capacity, capabilities) and returns a set of **blocked** clusters (clusters temporarily unable to accept jobs). + +### Per-file processing +4. If positional arguments are given, they are treated as subdirectories under `--prow-jobs-dir` to process. If none are given, the entire `--prow-jobs-dir` is processed. +5. For each subdirectory, call `sanitizer.DeterminizeJobs()`: + - Walk all `.yaml` files in the directory tree concurrently (producer-consumer pattern) + - For each file: + a. Read and unmarshal the Prow `JobConfig` + b. Apply `defaultJobConfig()` which processes every job in the file + c. Marshal and write the normalized YAML back + +### Cluster assignment logic (`determineCluster`) +For each job (presubmit, postsubmit, periodic), the cluster is determined by this priority chain: + +1. **Non-kubernetes agents**: Jobs with agent != `kubernetes` (or empty) get no cluster assignment +2. **vSphere jobs**: Jobs with "vsphere" in the name go to `vsphere02`, unless they have the `vsphere-elastic-poc` cluster profile (those can be relocated) +3. **SSH bastion jobs**: Jobs requiring an SSH bastion go to the configured `sshBastion` cluster +4. **Explicit cluster label**: Jobs with `ci-operator.openshift.io/cluster` label get that cluster directly +5. **Capability-based routing**: Jobs with `capability/*` labels are matched to clusters that have all required capabilities. If `DetermineE2EByJob` is enabled and a cloud mapping exists, prefer clusters from the matching cloud provider. Distribution across matching clusters is deterministic based on `len(filepath) % len(clusters)`. +6. **KVM jobs**: Jobs with the KVM device label go to configured KVM clusters (deterministic distribution) +7. **E2E cloud matching**: If `DetermineE2EByJob` is true, route to build farm clusters matching the job's cloud provider (detected from `ci-operator.openshift.io/cloud` label or `CLUSTER_TYPE` env var, with cloud mapping applied) +8. **No-builds jobs**: Jobs with the no-builds label go to configured `noBuilds` clusters +9. **Job name match**: Explicit job-name-to-cluster mappings in `config.Groups[cluster].Jobs` +10. **Path regex match**: File path regex patterns in `config.Groups[cluster].Paths` +11. **Build farm filename match**: Exact filename matches in `BuildFarm` config (these jobs can be relocated) +12. **Default**: Falls back to `config.Default` + +If a job's determined cluster is in the **blocked** set, the job is relocated to the most-used cluster in the same file (or the default cluster), provided it is marked as relocatable. + +If a job already has a valid, non-blocked build farm cluster assigned and no dispatcher data overrides exist, the existing assignment is preserved to avoid churn. + +### Branch regex normalization +- **Presubmits**: Each branch pattern gets two regexes -- an exact match (`^branch$`) and a feature branch match (`^branch-`) -- ensuring presubmits also trigger on feature branches. Patterns that already look like regexes are left unchanged. +- **Postsubmits**: Each branch pattern gets only the exact match regex (`^branch$`), since postsubmits should not trigger on feature branches. + +### ARM64 image substitution +If a job is assigned to the `arm01` cluster and uses the standard `ci-operator:latest` container image (or the quay proxy variant), the image is replaced with `ci-operator-arm64:latest`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--prow-jobs-dir` | (required) | Root directory of Prow job config files (`ci-operator/jobs` in openshift/release) | +| `--config-path` | (required) | Path to the dispatcher config (`core-services/sanitize-prow-jobs/_config.yaml` in openshift/release) | +| `--cluster-config-path` | `core-services/sanitize-prow-jobs/_clusters.yaml` | Path to cluster metadata config with provider info, capacity, capabilities, and blocked clusters | +| `-h` | `false` | Show help | + +Positional arguments after flags are treated as subdirectories of `--prow-jobs-dir` to process. If none given, all of `--prow-jobs-dir` is processed. + +## Key files +- `cmd/sanitize-prow-jobs/main.go` -- entry point, flag parsing, subdirectory iteration +- `pkg/sanitizer/determinize.go` -- `DeterminizeJobs()` walks files concurrently, `defaultJobConfig()` applies cluster assignment and branch normalization per job +- `pkg/dispatcher/config.go` -- `Config` struct and `DetermineClusterForJob()` with the full cluster routing priority chain +- `pkg/dispatcher/helpers.go` -- `LoadClusterConfig()`, `FindMostUsedCluster()`, `DetermineTargetCluster()` for blocked cluster relocation +- `pkg/jobconfig/files.go` -- `ExactlyBranch()` and `FeatureBranch()` for branch regex generation + +## Deployment +CLI tool. Run as part of the config generation pipeline in openshift/release (via `make jobs` or `auto-config-brancher`). Ensures all generated Prow jobs have deterministic cluster assignments and formatting. * Makes sure all jobs are formatted the same way to keep diffs small * Applies defaults to them diff --git a/cmd/serviceaccount-secret-rotation-trigger/README.md b/cmd/serviceaccount-secret-rotation-trigger/README.md index ffcc91396ec..7d99f923ba2 100644 --- a/cmd/serviceaccount-secret-rotation-trigger/README.md +++ b/cmd/serviceaccount-secret-rotation-trigger/README.md @@ -1,5 +1,39 @@ # serviceaccount-secret-rotation-trigger -A small tool that will take a list of namespaces and: -* Add a TTL annotation to all serviceaccount secrets in them with a value of now + 24h: This will trigger the `serviceaccount_secret_refresher` controller to delete them as soon as that TTL is in the past -* Will update all ServiceAccounts in those namespaces to have empty Secrets and ImagePullSecrets fields: This will trigger an immediate recreation of those secrets by the corresponding minters +## What +One-shot tool that triggers ServiceAccount (SA) token secret rotation across multiple clusters and namespaces. It forces the Kubernetes control plane to regenerate SA token secrets by adding TTL annotations to existing secrets and clearing SA secret references, which causes the `serviceaccount_secret_refresher` controller to rotate them. + +## How it works -- full flow + +1. **Load kubeconfigs.** Loads multi-cluster kubeconfigs via Prow's `KubernetesOptions` (defaults to no in-cluster config). Sets QPS to 50 and burst to 500 per cluster for high-throughput operations. Constructs a controller-runtime client for each cluster. If `--dry-run` is true (default), wraps each client in a dry-run decorator. + - Clusters that fail client construction are skipped with a warning (non-fatal). + - If no clients are available at all, exits fatally. + +2. **Process each namespace on each cluster.** Launches goroutines in parallel across all cluster/namespace combinations using `errgroup`. + +3. **Add TTL annotations to SA secrets.** For each namespace: + - Lists all secrets in the namespace. + - Filters for secrets that have the `kubernetes.io/service-account.uid` annotation (SA token secrets) but do NOT already have the TTL annotation (`servicesecret.openshift.io/expiry-time`). + - For each matching secret, patches it to add a TTL annotation set to `now + 24 hours` (RFC3339 format). + - The `serviceaccount_secret_refresher` controller watches for this annotation and handles the actual rotation. + +4. **Clear SA secret references.** For each namespace: + - Lists all ServiceAccounts. + - For each SA, patches it to clear both `secrets` and `imagePullSecrets` lists to `nil`. + - This forces the Kubernetes control plane to re-create the default token secret and image pull secrets for the ServiceAccount. + +All operations within a namespace are parallelized using `errgroup`. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--namespace` | (required) | Namespace to process. Repeatable (must pass at least one). | +| `--dry-run` | `true` | When true, uses dry-run client (no actual changes). | +| Prow kubernetes flags | -- | Multi-cluster kubeconfig flags. Default: no in-cluster config. | + +## Key files +- `cmd/serviceaccount-secret-rotation-trigger/main.go` -- all logic: client setup, secret TTL annotation, SA reference clearing +- `pkg/controller/serviceaccount_secret_refresher/` -- the controller that watches for the TTL annotation and performs the actual secret rotation (this tool just triggers it) + +## Deployment +Runs as a periodic Prow job or via manual invocation. Targets specific namespaces across all configured build clusters. Typically used when SA token secrets need to be rotated (e.g., credential compromise, certificate renewal). diff --git a/cmd/sippy-config-generator/README.md b/cmd/sippy-config-generator/README.md index 48b401a287f..15d1171dd33 100644 --- a/cmd/sippy-config-generator/README.md +++ b/cmd/sippy-config-generator/README.md @@ -1,7 +1,71 @@ -# Updating Sippy configuration +# sippy-config-generator -This utility updates the Sippy configuration based on our job annotations and [release-controller configuration][release-controller]. +## What +CLI tool that generates Sippy monitoring configuration by combining release controller job definitions with Prow periodic job metadata. Sippy uses this config to know which jobs belong to which OpenShift release and whether they are blocking or informing. Output is YAML written to stdout. +## How it works -- full flow + +### 1. Load customization file (optional) +If `--customization-file` is provided, reads a YAML file containing a pre-populated `SippyConfig` struct. This allows manually specified releases, regexp patterns, or job overrides to be preserved and merged with the generated output. + +### 2. Parse release controller configuration +Walks the `--release-config` directory for JSON files. For each release controller config, extracts the `verify` map: +- Optional jobs go into the `informingJobs` set +- Non-optional jobs go into the `blockingJobs` set +- Jobs with `AggregatedProwJob` settings generate aggregate job names in the format `{verifyName}-{aggregateProwJobName}` (defaulting to `release-openshift-release-analysis-aggregator`) + +### 3. Load and sort Prow periodic jobs +Reads all Prow job configs from `--prow-jobs-dir` and sorts them alphabetically by name for deterministic output. + +### 4. Build release config +For each periodic job that has a `job-release` label: +- Determines the release version from the label value +- If the job name contains `-okd`, appends `-okd` to the release name (e.g. `4.14-okd`) +- Adds the job to `releases[version].jobs` map (value `true`) +- If the job has aggregate jobs, adds those too +- If the job is in the `blockingJobs` set, appends it to `releases[version].blockingJobs` +- If the job is in the `informingJobs` set or matches `IsSpecialInformingJobOnTestGrid()`, appends it to `releases[version].informingJobs` + +### 5. Output +Prints a YAML header comment with the generation timestamp, then marshals and prints the complete `SippyConfig` struct to stdout. + +### SippyConfig structure +```yaml +prow: + url: +releases: + "4.16": + jobs: + periodic-ci-openshift-release-master-nightly-4.16-e2e-aws: true + ... + regexp: [] + blockingJobs: + - periodic-ci-openshift-release-master-nightly-4.16-e2e-aws + informingJobs: + - periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-upgrade +``` + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--prow-jobs-dir` | (required) | Path to Prow job config directory (`ci-operator/jobs` in openshift/release) | +| `--release-config` | (required) | Path to release controller configuration directory | +| `--customization-file` | (none) | Path to YAML file with additional config to merge (e.g. manually maintained regexp patterns) | + +## Key files +- `cmd/sippy-config-generator/main.go` -- all logic: flag parsing, release config loading, Prow job iteration, YAML output +- `pkg/api/sippy/v1/types.go` -- `SippyConfig`, `ReleaseConfig`, `ProwConfig` type definitions +- `pkg/util/testgrid.go` -- `IsSpecialInformingJobOnTestGrid()` shared with testgrid-config-generator +- `pkg/jobconfig/files.go` -- `ReadFromDir()` for loading Prow job configs +- `pkg/release/config/` -- release controller config types + +## Deployment +CLI tool. Run as part of a periodic Prow job or manually. Output is piped/redirected to a config file consumed by the Sippy service. + +## Related +- Sippy service: monitors CI job health and payload readiness +- `cmd/testgrid-config-generator` -- similar input processing but different output format +- The `IsSpecialInformingJobOnTestGrid()` function is shared between sippy-config-generator and testgrid-config-generator Example invocation: ``` diff --git a/cmd/slack-bot/README.md b/cmd/slack-bot/README.md index 61f577a6ad2..90448853aaa 100644 --- a/cmd/slack-bot/README.md +++ b/cmd/slack-bot/README.md @@ -1,13 +1,73 @@ -# Slack-bot +# slack-bot -This is a Slack bot that helps facilitate common tasks like reporting issues. -Currently, the bot can do the following: -- When the bot is explicitly mentioned in a message (`@DPTP bot`), it lists all available actions it knows how to do, like file a bug, request a consultation, and more. -- When a specific job link is included in a message, the bot responds with helpful information related to that job. -- In the `CoreOS` slack space, when someone tags `@dptp-helpdesk` in the `forum-ocp-testplatform` channel, the bot sends an automatic reply containing helpful basic information in a new thread. -- Support-request mode (enabled by default in `#forum-ocp-testplatform`): if a thread exceeds `--support-request-threshold` messages (default `12`), the bot creates a Jira issue in `DPTP`, posts the link in the thread, and closes that Jira with `Done` when `:closed:` is added to the root thread message. +## What +Interactive Slack bot for OpenShift CI operations. It listens for Slack events (messages, app mentions) and interaction callbacks (shortcuts, modal submissions) and provides helpdesk workflows, Jira issue filing, keyword auto-responses, job link enrichment, and FAQ management. -# Local testing +## How it works -- full flow + +### HTTP server +The bot runs an HTTP server with two Slack-facing endpoints: +- `/slack/events-endpoint` -- receives Slack Events API callbacks (messages, app mentions) +- `/slack/interactive-endpoint` -- receives Slack interaction payloads (shortcut triggers, modal view submissions, block actions) + +Both endpoints verify request authenticity using the Slack signing secret before processing. + +### Event handling +Events are routed through a `MultiHandler` that dispatches to these handlers in order: + +1. **helpdesk.MessageHandler** -- monitors the forum channel (`#forum-ocp-testplatform`) for messages. When `--require-workflows-in-forum` is enabled, it directs users to use the Slack workflow for new posts. Also responds to keyword matches from the keywords config. +2. **helpdesk.FAQHandler** -- manages FAQ items stored as Kubernetes ConfigMaps in the `ci` namespace. Authorized users (from the `test-platform-ci-admins` group) can create, update, and delete FAQ entries via thread reactions. +3. **mention.Handler** -- when the bot is @-mentioned, it responds with contextual suggestions for interactive workflows (bug report, consultation, enhancement, incident, helpdesk, triage) based on the phrasing used. +4. **joblink.Handler** -- detects Prow job URLs in messages and posts enriched information: job status, GCS artifact links, ci-operator config metadata. + +### Interaction handling +Interactions are routed through a modal router that manages Slack modal flows for: +- **Bug** -- file a Jira bug report +- **Consultation** -- request a consultation +- **Enhancement** -- file an enhancement request +- **Helpdesk** -- file a helpdesk ticket +- **Incident** -- report an incident +- **Triage** -- triage an issue + +Each modal flow is triggered via Slack shortcuts or message button presses and progresses through multi-step modal views, ultimately creating Jira issues in the DPTP project. + +### Response pattern +Events are always acknowledged with HTTP 200 immediately. Actual handling runs in a background goroutine so Slack's 3-second timeout is never exceeded. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--port` | `8888` | HTTP server listen port | +| `--log-level` | `info` | Log output level | +| `--grace-period` | `180s` | Graceful shutdown duration | +| `--slack-token-path` | (required) | Path to file containing the Slack bot token | +| `--slack-signing-secret-path` | (required) | Path to file containing the Slack signing secret | +| `--keywords-config-path` | (empty) | Path to the keywords auto-response config file | +| `--helpdesk-alias` | `@dptp-helpdesk` | Alias for the helpdesk user(s) | +| `--forum-channel-id` | `CBN38N3MW` | Slack channel ID for `#forum-ocp-testplatform` | +| `--review-request-workflow-id` | `B06T46F374N` | Slack workflow ID for review requests | +| `--namespace` | `ci` | Kubernetes namespace for storing helpdesk FAQ ConfigMaps | +| `--require-workflows-in-forum` | `true` | Require use of Slack workflows in the forum channel | +| `--prow-config-path` | (Prow) | Path to Prow config (for job link enrichment) | +| `--prow-job-config-path` | (Prow) | Path to Prow job configs | +| `--jira-*` | (various) | Jira connection options for issue filing | + +## Key files +- `cmd/slack-bot/main.go` -- entry point, HTTP server setup, request verification +- `pkg/slack/events/router/router.go` -- event handler multiplexer +- `pkg/slack/events/helpdesk/helpdesk-message.go` -- forum channel message handling, keyword responses +- `pkg/slack/events/helpdesk/helpdesk-faq.go` -- FAQ management via ConfigMaps +- `pkg/slack/events/mention/mention.go` -- @-mention response with workflow suggestions +- `pkg/slack/events/joblink/link.go` -- Prow job URL detection and enrichment +- `pkg/slack/interactions/router/router.go` -- interaction callback router for modal flows +- `pkg/slack/modals/bug/`, `consultation/`, `enhancement/`, `helpdesk/`, `incident/`, `triage/` -- individual modal flow implementations +- `pkg/jira/issue_filer.go` -- Jira issue creation backend + +## Deployment +Long-lived Deployment on app.ci in the `ci` namespace. Requires in-cluster access for Kubernetes API (FAQ ConfigMap storage) and `userv1` scheme (authorized user lookup). Uses GCS client (unauthenticated, read-only) for job artifact lookups. + +Slack app must be configured to send Events API and Interactivity payloads to this service's endpoints. +## Local testing There is an alpha instance of Slack Bot running on the app.ci cluster that you can use for testing by running a mitmproxy and reverse tunneling requests to your local machine. - Make sure to join the `dptp-robot-testing` slack space. diff --git a/cmd/sprint-automation/README.md b/cmd/sprint-automation/README.md index 3426e9ebec9..250f927cf77 100644 --- a/cmd/sprint-automation/README.md +++ b/cmd/sprint-automation/README.md @@ -1,13 +1,67 @@ -# Sprint-automation +# sprint-automation -Test Platform's daily helper. Utilizes slack and pager duty to do things such as: -- Remind `team-dp-testplatform` of our rotating positions, and cards awaiting acceptance -- Send the daily intake digest to the intake role -- Send reminders about next week's roles -- Ensure that our aliases are staffed -- Remind triage of necessary upgrades +## What +Daily automated digest tool for the DPTP (Developer Productivity Test Platform) team. It queries PagerDuty for on-call assignments, posts team digests to Slack, manages Slack user group membership for rotating roles, handles Jira intake ticket assignment, and optionally notifies about build cluster upgrades. Designed to run as a daily CronJob. -# Local testing +## How it works -- full flow + +### On every run (daily) + +#### 1. Resolve on-call users from PagerDuty +Queries PagerDuty for three rotating roles using schedule names: +- `@dptp-triage Primary` (schedule query: "DPTP Primary On-Call") +- `@dptp-helpdesk` (schedule query: "DPTP Help Desk") +- `@dptp-intake` (schedule query: "DPTP Intake") + +For each schedule, it queries who is on-call between 8:00-21:00 UTC. If multiple users are returned (indicating an override), it resolves the override user. Each PagerDuty user is then mapped to their Slack ID via email lookup. + +#### 2. Post team digest to Slack +Posts a message to the `team-dp-testplatform` private channel containing: +- Today's rotating positions (triage, helpdesk, intake) with Slack user mentions +- Links to role manuals and team definition docs +- Cards awaiting acceptance: Jira issues in the DPTP project with status "Review", grouped by assignee + +#### 3. Ensure Slack user group membership +Updates the `@dptp-triage` and `@dptp-helpdesk` Slack user groups to contain exactly the on-call user for that role. + +#### 4. Assign and send intake digest +- Queries Jira for unassigned DPTP issues created in the last 30 days with status "To Do" (excluding sub-tasks and issues labeled `ready` or `no-intake`) +- Auto-assigns all matching issues to the current intake person (looked up via their email in Jira) +- Sends a DM to the intake person listing the issues they need to review + +### On Monday runs only (`--week-start`) + +#### 5. Send next week's role digest +Queries PagerDuty for on-call assignments one week from now. DMs each user about their upcoming roles so they can prepare. + +#### 6. Notify triage of handover +DMs the current triage engineer a link to the Triage Handover Document with a reminder to review ongoing incidents. + +### Build cluster upgrade notification (`--enable-build02-upgrade-notification`) +Compares the OCP version of build01 and build02 clusters: +- If build01's version has been stable (Z-stream: 1 day soak; Y-stream: 7 day soak) and is newer than build02 +- Posts a notification to `alerts-testplatform-build-farms` channel with the `oc adm upgrade` command + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--log-level` | `info` | Log output level | +| `--slack-token-path` | (required) | Path to the Slack bot token file | +| `--week-start` | `false` | Enable Monday-only activities (next week roles, triage handover) | +| `--enable-build02-upgrade-notification` | `false` | Check if build02 needs upgrading to match build01 | +| `--jira-*` | (various) | Jira connection options | +| `--pagerduty-*` | (various) | PagerDuty API options | +| `--kubeconfig` | (various) | Kubeconfig for build01 and build02 clusters | + +## Key files +- `cmd/sprint-automation/main.go` -- entry point, PagerDuty queries, Slack digest posting, user group management, Jira intake, cluster version comparison +- `cmd/sprint-automation/jira_search_v3.go` -- paginated Jira JQL search using v3 REST API + +## Deployment +Runs as a periodic CronJob on app.ci. Two instances: one daily (all activities), one on Mondays only (with `--week-start`). + +Requires kubeconfigs for build01 and build02 clusters (for upgrade notification), PagerDuty API credentials, Jira credentials, and a Slack bot token. +## Local testing You can test out `sprint-automation` utilizing the `dptp-robot-testing` and the `hack/local-sprint-automation.sh` script: - Make sure to join the `dptp-robot-testing` slack space. - Run the `hack/local-sprint-automation.sh` script like so: `RELEASE_REPO_DIR= bash hack/local-sprint-automation.sh` diff --git a/cmd/sync-rover-groups/README.md b/cmd/sync-rover-groups/README.md index 1aa60aad0b2..8cea34553fc 100644 --- a/cmd/sync-rover-groups/README.md +++ b/cmd/sync-rover-groups/README.md @@ -1,5 +1,74 @@ # sync-rover-groups +## What +Resolves Red Hat LDAP Rover group memberships by scanning Kubernetes manifest directories for group references and querying the corporate LDAP server to resolve each group's member list. Outputs two YAML files consumed by `github-ldap-user-group-creator`: one with resolved group memberships and one with GitHub-to-Kerberos user mappings. + +This is the first half of the group sync pipeline. It runs on the Red Hat intranet (PSI cluster) because it needs direct LDAP access. + +## How it works -- full flow + +### Group name collection +1. Walk each `--manifest-dir` directory recursively, parsing every `.yaml` file (skipping symlinks, `_`-prefixed dirs/files). +2. Decode each YAML document using the Kubernetes codec. Supported resource types: + - **RoleBinding / ClusterRoleBinding**: extract group names from `subjects` where `kind: Group` (ignoring `system:` prefixed groups and template variables `${...}`) + - **List**: recurse into list items + - **Template**: recurse into template objects (tolerates template processing errors like `${{REPLICAS}}`) + - **Group** (userv1): detected but only used for validation mode +3. Always include `ci-admins` in the group set regardless of what was found in manifests. +4. If a group config file is provided, add any groups defined there and remove groups that have been renamed (the original name is resolved instead). + +### Validation mode (`--validate-subjects`) +When run as a presubmit (no intranet access), the tool validates manifests without connecting to LDAP: +- Ensures no `User` subjects appear in RoleBindings/ClusterRoleBindings +- Ensures no `Group` resources are created +- Does not resolve groups or generate output files + +### Group resolution +For each collected group name, query LDAP: +- Filter: `(&(objectClass=rhatRoverGroup)(cn=))` +- Base DN: `dc=redhat,dc=com` +- Extract `uniqueMember` attributes, parse UIDs from DNs (`uid=,ou=users,...`) +- Groups not found in LDAP are logged as warnings and skipped (except `ci-admins`, which is fatal) +- `ci-admins` must have at least 3 members + +### GitHub user collection (`--github-users-file`) +When this flag is set, additionally query LDAP for all users with a GitHub social URL: +- Filter: `(rhatSocialURL=GitHub*)` +- Base DN: `ou=users,dc=redhat,dc=com` +- Extract `uid`, `rhatSocialURL` (parsed to get GitHub username), `rhatCostCenter` +- Write the result as a YAML array of `rover.User` objects + +### Config printing (`--print-config`) +If `--print-config` is set with `--config-file`, print the normalized config (sorted, no comments) and exit. + +## Flags + +| Flag | Default | What it controls | +|---|---|---| +| `--manifest-dir` | (required, repeatable) | Directories containing Kubernetes manifests to scan for group references | +| `--groups-file` | `/tmp/groups.yaml` | Output path for resolved group memberships YAML | +| `--github-users-file` | `""` | Output path for GitHub-to-Kerberos user mapping YAML | +| `--config-file` | `""` | Group config YAML for cluster targeting and renaming | +| `--ldap-server` | `ldap.corp.redhat.com` | LDAP server hostname | +| `--validate-subjects` | `false` | Run in validation mode (no LDAP, check manifests only) | +| `--print-config` | `false` | Print normalized config and exit | +| `--log-level` | `info` | Log verbosity | + +## Key files + +- `cmd/sync-rover-groups/main.go` -- entry point, option parsing, orchestration (`roverGroups()`) +- `cmd/sync-rover-groups/ldap.go` -- `ldapGroupResolver`: LDAP queries for group resolution and GitHub user collection, `getGitHubID()` parser +- `cmd/sync-rover-groups/yamlgroupcollector.go` -- `yamlGroupCollector`: walks manifest directories, decodes YAML, extracts group names from RBAC subjects +- `pkg/group/config.go` -- `Config`, `GroupConfig` types + +## Deployment +Runs as a CronJob on the PSI cluster (`ocp-test-platform--runtime-int` namespace) because it requires Red Hat intranet access to reach the LDAP server. Its output (`groups.yaml`, GitHub users file) is stored in a ConfigMap (`sync-rover-groups` in the `ci` namespace) and consumed by `github-ldap-user-group-creator`. + +When run as a presubmit with `--validate-subjects`, it checks that manifests don't create Groups or use User subjects directly (since the presubmit environment has no intranet access). + +## Related +- `cmd/github-ldap-user-group-creator` -- downstream: consumes the output files +- `pkg/rover/types.go` -- `User` type used for the GitHub users file ## What it does `sync-rover-groups` is a tool to resolve the groups in [the manifests](https://github.com/openshift/release/tree/main/clusters) of CI clusters diff --git a/cmd/testgrid-config-generator/README.md b/cmd/testgrid-config-generator/README.md index c43555efe23..f8c2a6b165e 100644 --- a/cmd/testgrid-config-generator/README.md +++ b/cmd/testgrid-config-generator/README.md @@ -1,7 +1,80 @@ -# Updating TestGrid configuration +# testgrid-config-generator -This utility updates the TestGrid configuration based on our job annotations and [release-controller configuration][release-controller]. +## What +CLI tool that generates TestGrid dashboard configuration files from Prow periodic job definitions and release controller configuration. It determines which jobs appear on which dashboards (blocking, informing, broken) and for which OpenShift version, producing YAML files consumable by the TestGrid service. +## How it works -- full flow + +### 1. Load release controller configuration +Walks the `--release-config` directory for JSON files. For each release controller config, extracts the `verify` map to determine job classifications: +- Non-optional, non-upgrade jobs are `blocking` +- Optional jobs are `informing` +- Upgrade jobs are `informing` +- Jobs with `AggregatedProwJob` settings generate aggregate dashboard entries (prefixed with the verify name) + +### 2. Load and validate the allow-list +Reads the `--allow-list` YAML file, which maps job names to override classifications. Valid values: `informing`, `broken`, `generic-informing`, `osde2e`, `olm`. The value `blocking` is forbidden in the allow-list (blocking status must come from the release controller config). Jobs present in both the allow-list and the release controller config as blocking cause a fatal error. + +### 3. Load Prow periodic jobs +Reads all Prow job configs from `--prow-jobs-dir` via `jobconfig.ReadFromDir()`. + +### 4. Assign jobs to dashboards +For each periodic job, `addDashboardTab()` determines dashboard placement: + +**Classification priority:** +1. Allow-list override (if present) +2. Release controller config (blocking/informing) +3. Special informing prefixes (hardcoded list in `pkg/util/testgrid.go`) +4. Layered product interop patterns (`-lp-interop`, `-lp-rosa-hypershift`, `-lp-rosa-classic`, `CSPI-QE-MSI`) +5. If none match, the job is excluded from TestGrid + +**Dashboard naming:** `redhat-openshift-{stream}-release-{version}-{role}` where: +- Stream is determined by job name patterns: `ocp`, `okd`, `lp-interop`, `lp-rosa-hypershift`, `lp-rosa-classic`, `CSPI-QE-MSI` +- Version comes from the `job-release` Prow label, or is extracted from the job name via `-X.Y-` regex +- Role is `blocking`, `informing`, or `broken` + +**Generic dashboards** (no version): `redhat-openshift-informing`, `redhat-openshift-osd`, `redhat-openshift-olm` + +**Retention tuning:** for jobs running at 12h+ intervals, `daysOfResults` is calculated to show ~100 entries, capped at 7-60 days. + +### 5. Write output files +- Updates the `groups.yaml` file in `--testgrid-config` to add/remove dashboard names from the `redhat` dashboard group +- Writes one YAML file per dashboard: `{dashboard-name}.yaml` containing `TestGroup` and `Dashboard` definitions +- Removes stale dashboard YAML files that are no longer generated + +### Dashboard tab defaults +Each dashboard tab includes: +- Open test template linking to Prow +- File bug template linking to Bugzilla with pre-filled fields (classification: Red Hat, product: OpenShift Container Platform) +- Results URL template linking to Prow job history +- Code search linking to github.com/openshift/origin +- Base filter options excluding Monitor and operator template test noise + +### Validation-only mode +With `--validate`, the tool only validates the allow-list entries (no Prow jobs dir or TestGrid output needed) and exits. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--prow-jobs-dir` | (required unless `--validate`) | Path to Prow job config directory (`ci-operator/jobs` in openshift/release) | +| `--release-config` | (required) | Path to release controller configuration directory | +| `--testgrid-config` | (required unless `--validate`) | Path to TestGrid configuration output directory | +| `--allow-list` | (required) | Path to YAML file with job classification overrides | +| `--validate` | false | Only validate the allow-list, skip generation | +| `--google-storage-bucket` | `test-platform-results` | GCS bucket for test artifact links | + +## Key files +- `cmd/testgrid-config-generator/main.go` -- all logic: flag parsing, release config loading, allow-list validation, dashboard generation, file output +- `pkg/util/testgrid.go` -- `IsSpecialInformingJobOnTestGrid()` with hardcoded prefix list +- `pkg/release/config/` -- release controller config types +- `pkg/jobconfig/files.go` -- `ReadFromDir()` for loading Prow job configs + +## Deployment +CLI tool. Not run directly in production -- instead invoked by `auto-testgrid-generator` which wraps it and creates PRs against kubernetes/test-infra. + +## Related +- `cmd/auto-testgrid-generator` -- orchestrates this tool and creates PRs +- TestGrid dashboards: `https://testgrid.k8s.io/redhat-openshift-*` Blocking jobs are those that signal widespread failure of the platform. These are traditionally the core end-to-end test runs on our major platforms and upgrades from previous versions. Informing jobs are a broader suite that test the variety of enviroments and configurations our customers expect. Broken jobs are those that have a known, triaged failure that prevents their function for a sustained period of time (more than a week). The release config and the job annotation combine to determine the dashboard. If a job in the release definition is an upgrade job it goes into diff --git a/cmd/vault-secret-collection-manager/README.md b/cmd/vault-secret-collection-manager/README.md index 4effc027ec0..4c2aeaff858 100644 --- a/cmd/vault-secret-collection-manager/README.md +++ b/cmd/vault-secret-collection-manager/README.md @@ -1,5 +1,100 @@ -# Vault secret collection manager +# vault-secret-collection-manager +## What +Self-service web application for creating and managing isolated Vault secret collections with policy-based access control. Teams can create named collections, add/remove members, and store secrets -- all without DPTP intervention. Each collection is backed by a Vault KV path, a Vault identity group, and a Vault policy that grants the group CRUD access to that path. + +## How it works -- full flow + +### Authentication +The service expects an OAuth2 proxy (e.g., oauth2-proxy) in front. User identity is extracted from the `X-Forwarded-Email` header (the part before `@`). Requests without this header are rejected with HTTP 400. + +### Secret collection lifecycle + +#### Creating a collection (`PUT /secretcollection/:name`) +1. Validates the name matches `^[a-z0-9-]+$`. +2. Checks for an existing Vault group with the prefixed name (`secret-collection-manager-managed-`). If it exists, returns 409 Conflict (idempotency is on group creation, not policy). +3. Looks up the requesting user by alias in Vault. If the user does not exist, creates a new Vault identity entity and alias for them. +4. Creates a Vault policy granting: + - `list`, `delete` on `secret/metadata/self-managed//*` + - `create`, `update`, `read` on `secret/data/self-managed//*` +5. Creates a Vault identity group named `secret-collection-manager-managed-` with the requesting user as the sole member and the policy attached. +6. Creates a placeholder KV entry at `secret/self-managed//placeholder` so the collection is visible in the Vault web console (bypasses the censoring plugin's minimum-length rules). + +#### Listing collections (`GET /secretcollection`) +1. Looks up the user by alias. Creates the user if they do not exist yet. +2. Iterates the user's group memberships, filtering for groups prefixed with `secret-collection-manager-managed`. +3. For each matching group, reads the group's policy to extract the collection path and resolves member names from entity IDs. +4. Returns a sorted JSON array of `{name, path, members}` objects. +5. If `?ui=true` is set, renders the embedded HTML template instead of raw JSON. + +#### Updating members (`PUT /secretcollection/:name/members`) +1. Verifies the requesting user is a member of the collection (otherwise 404). +2. Accepts a JSON body: `{"members": ["user1", "user2"]}`. At least one member required. +3. Resolves each member name to a Vault entity ID (via alias lookup). +4. Calls `UpdateGroupMembers` to replace the group's member list. + +#### Deleting a collection (`DELETE /secretcollection/:name`) +1. Verifies the requesting user is a member. +2. Lists all KV entries under the collection path recursively. +3. Irreversibly destroys each KV entry (not just soft-delete). +4. Deletes the Vault identity group. + +### Policy reconciliation +Every hour, the manager reconciles all policies prefixed with `secret-collection-manager-managed`: +1. Lists all Vault policies. +2. For each managed policy, compares the current policy document against the expected one (re-derived from the collection name). +3. Updates any policies that have drifted. +4. Logs which policies were reconciled. + +This runs at startup and then hourly via `interrupts.TickLiteral`. + +### User management +- Users are auto-created on first access: a Vault identity entity is created with a `default` policy, and an alias is created linking the username to the configured auth backend (default: `oidc`). +- User and group lookups are cached in memory (`idNameCache`) to avoid redundant Vault API calls. The cache maps both name-to-ID and ID-to-name. + +### Embedded frontend +The service embeds static assets (`style.css`, `index.js`, `index.template.html`) via Go's `//go:embed` directive. The root path `/` redirects to `/secretcollection?ui=true`. + +### Endpoints + +| Method | Path | What it does | +|---|---|---| +| `GET` | `/` | Redirects to `/secretcollection?ui=true` | +| `GET` | `/style.css` | Serves embedded CSS | +| `GET` | `/index.js` | Serves embedded JavaScript | +| `GET` | `/healthz` | Health check (200 OK) | +| `GET` | `/secretcollection` | List collections for the authenticated user. `?ui=true` returns HTML. | +| `PUT` | `/secretcollection/:name` | Create a new collection | +| `PUT` | `/secretcollection/:name/members` | Update collection members | +| `DELETE` | `/secretcollection/:name` | Delete a collection and all its secrets | +| `GET` | `/users` | List all Vault user aliases | + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--kv-store-prefix` | `secret/self-managed` | Vault KV folder under which all collections are created | +| `--listen-addr` | `127.0.0.1:8080` | Address to listen on | +| `--vault-addr` | `http://127.0.0.1:8300` | Upstream Vault address | +| `--vault-token` | `""` | Privileged Vault token (mutually exclusive with `--vault-role`) | +| `--vault-role` | `""` | Vault role for Kubernetes SA auth (mutually exclusive with `--vault-token`) | +| `--auth-backend-type` | `oidc` | Vault auth backend type used for user authentication | +| Instrumentation flags | -- | `--health-port`, `--metrics-port` (standard Prow instrumentation) | + +## Key files +- `cmd/vault-secret-collection-manager/main.go` -- server setup, all HTTP handlers, collection CRUD, policy reconciliation, user/group management +- `cmd/vault-secret-collection-manager/types.go` -- data types: `secretCollection`, `managedVaultPolicy`, request/response bodies +- `cmd/vault-secret-collection-manager/middleware.go` -- logging, instrumentation (Prometheus histograms), UUID-based request tracking +- `cmd/vault-secret-collection-manager/frontend.go` -- embedded static assets via `//go:embed` +- `pkg/vaultclient/` -- Vault client library (identity, group, KV, policy operations) + +## Deployment +Long-lived Deployment on app.ci, namespace `ci`. Fronted by an OAuth2 proxy that handles authentication and sets the `X-Forwarded-Email` header. Exposes a health endpoint on the instrumentation health port and Prometheus metrics on the metrics port. + +If using `--vault-role`, authenticates to Vault via Kubernetes service account auth and monitors credential expiry (health check returns 500 when expired). + +## Related +- `cmd/vault-subpath-proxy` -- reverse proxy that complements this by enabling subpath discovery in the Vault UI +- ci-docs: `how-tos/adding-a-new-secret-to-ci.md` ## Description A webservice that allows to manage secret collections in Vault. A secret collection is diff --git a/cmd/vault-subpath-proxy/README.md b/cmd/vault-subpath-proxy/README.md index 81f9af4460c..e3d1bd4e679 100644 --- a/cmd/vault-subpath-proxy/README.md +++ b/cmd/vault-subpath-proxy/README.md @@ -1,11 +1,83 @@ -# Vault Subpath Proxy +# vault-subpath-proxy -A small proxy that we run in front of Vault. It solves the problem of "If user has no list perm in parent directory, -user can't find any subdirectory they might have access to.". To do so it: -* Checks if a request got a 403 -* If so, it will call out to the `/v1/sys/internal/ui/resultant-acl` endpoint, using the provided token -* That endpoint returns effective permissions, so if a user has access to a subpath, it is there -* If any result is found this way, it will be added to the response and the status code is changed from 403 to 200 +## What +Reverse proxy for Vault that adds two capabilities the upstream Vault server lacks: +1. **Subpath discovery:** when a user gets a 403 on a KV metadata list request, the proxy inspects their effective ACL permissions and injects accessible subpaths into the response, making folders visible in the Vault UI even without list permission on the parent. +2. **KV write validation and secret sync:** intercepts KV write requests (PUT/POST/PATCH/DELETE), validates secret keys against Kubernetes naming rules, checks for conflicting keys across secrets targeting the same namespace/name, and synchronously syncs the secret data to Kubernetes clusters. +## How it works -- full flow + +### Subpath injection (ModifyResponse) +1. The proxy forwards all requests to the upstream Vault server. +2. On the response path, if the response is a **403** to a **GET** request on a KV **metadata** path with `?list=true`: + - Reads and buffers the original response body. + - Checks if Vault already returned data (non-empty `keys` array). If so, does nothing. + - Extracts the user's Vault token from the `X-Vault-Token` header. + - Calls Vault's `/v1/sys/internal/ui/resultant-acl` API with the user's token to get their effective permissions. + - Scans the glob paths in the ACL result for paths that: + - Start with the requested folder's metadata prefix + - End with `/` (indicating a folder) + - Have the `list` capability + - If any matching folders are found, replaces the 403 response with a 200 containing the discovered subpaths as `keys`. + +### KV write validation and sync (RoundTripper) +For PUT/POST/PATCH requests to KV paths: + +1. **Key validation.** Reads and parses the request body as `{"data": {...}}`. For each key: + - `secretsync/target-namespace`: validates each comma-separated namespace is a valid DNS-1123 label. + - `secretsync/target-name`: validates as DNS-1123 label. + - `secretsync/target-cluster`: skipped (no validation needed). + - All other keys: must match `^[a-zA-Z0-9\.\-_]+$`. + - Validates the secret can be used in a CI step (`ci_validation.ValidateSecretInStep`). + +2. **Conflict detection.** If a privileged Vault client is configured: + - On first request, populates a key cache by listing all KV entries recursively and building a map of `{namespace, name} -> set of keys` and `vaultPath -> [{namespace, name, key}]`. + - For each key in the request, checks if the same key is already claimed by a *different* Vault secret targeting the same Kubernetes namespace/name. This prevents two Vault entries from writing conflicting keys to the same Kubernetes Secret. + - The cache is updated after successful writes and deletes. + +3. **Upstream forwarding.** If validation passes, forwards the request to Vault. + +4. **Secret sync.** After a successful write (2xx response), asynchronously syncs the secret data to all Kubernetes clusters: + - Iterates all configured Kubernetes clients. + - For each cluster that the secret targets (per `secretsync/target-cluster`), and for each target namespace (comma-separated): + - Creates or updates a Kubernetes Secret with the data keys from the Vault entry. + - Uses `controllerutil.CreateOrUpdate` with retry on conflict. + - Sync has a 5-minute timeout per operation. + +For DELETE requests: +- Forwards to Vault, then clears the key cache entry for the deleted path. + +### TLS support +Supports serving over TLS with certificate hot-reloading: the cert/key pair is reloaded from disk every hour. + +### Kubeconfig hot-reloading +Kubernetes clients are loaded at startup and reloaded via fsnotify when the kubeconfig file changes. Client access is protected by a read-write mutex. + +## Flags +| Flag | Default | What it controls | +|---|---|---| +| `--vault-addr` | `http://127.0.0.1:8300` | Upstream Vault address | +| `--kv-mount-path` | `secret` | KV secret engine mount path | +| `--listen-addr` | `127.0.0.1:8400` | Proxy listen address | +| `--tls-cert-file` | `""` | TLS certificate file path (requires `--tls-key-file`) | +| `--tls-key-file` | `""` | TLS key file path (requires `--tls-cert-file`) | +| `--vault-token` | `""` | Privileged Vault token for conflict detection (mutually exclusive with `--vault-role`) | +| `--vault-role` | `""` | Vault role for Kubernetes SA auth for conflict detection (mutually exclusive with `--vault-token`) | +| Prow kubernetes flags | -- | Multi-cluster kubeconfig for secret sync (`--kubeconfig`, etc.). Default: no in-cluster config. | + +## Key files +- `cmd/vault-subpath-proxy/main.go` -- server setup, reverse proxy creation, subpath injection logic, TLS reloader, kubeconfig loading +- `cmd/vault-subpath-proxy/kv_update_transport.go` -- KV write interception: key validation, conflict detection, key cache management, Kubernetes secret sync + +## Deployment +Runs as a sidecar container inside the `vault` StatefulSet on app.ci (namespace `vault`, 3 replicas), not a standalone Deployment. Listens on port 8300 for TLS termination, forwarding to Vault at `127.0.0.1:8200`. Requires: +- Network access to the upstream Vault server. +- A privileged Vault token or role with read access to the entire KV store (for conflict detection). +- Kubeconfig access to build clusters (for secret sync). +- TLS cert/key if serving HTTPS. + +## Related +- `cmd/vault-secret-collection-manager` -- the UI that users interact with to manage secret collections; this proxy sits between the Vault UI/CLI and the actual Vault server. +- `pkg/api/vault/` -- constants for secret sync target keys (`secretsync/target-namespace`, etc.) Careful: The `resultant-acl` api is internal, undocumented and no stability guarantee is provided. Ideally, this functionality will get included into Vault itself one day.