[NVIDIA-850] Move signed precompiled drivers to registry.stage.redhat.io by josecastillolema · Pull Request #79860 · openshift/release

josecastillolema · 2026-05-29T08:51:37Z

Updates OpenShift CI configuration for the NVIDIA GPU Operator E2E pipelines to use signed precompiled drivers from the Red Hat staging registry and to add an opt-in step that merges staging registry credentials into the cluster pull secret.

Opted for the merging credentials strategy to avoid handling secrets (and it only takes 50 sec/no reboot needed).

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>

coderabbitai · 2026-05-29T08:52:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3fba693a-0656-49f2-ad45-d82990d61cdf

📥 Commits

Reviewing files that changed from the base of the PR and between 34ae982 and 12e7fa9.

📒 Files selected for processing (1)

ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.yaml

Walkthrough

Adds a CI step that conditionally merges Red Hat staging registry credentials into the cluster pull secret, inserts that step into the NVIDIA GPU Operator E2E workflow, and enables it in the job while switching the GPU driver repository to the staging registry.

Changes

Staging Registry Credentials Integration

Layer / File(s)	Summary
Step definition, metadata, and OWNERS `ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.yaml`, `ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.metadata.json`, `ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/OWNERS`	Adds the `nvidia-gpu-operator-merge-stage-credentials` step registry YAML, a metadata JSON pointing to the ref, and an OWNERS file with approvers/reviewers.
Merge script implementation `ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-commands.sh`	Implements a bash script that, when `MERGE_STAGE_REGISTRY_CREDENTIALS=="true"`, reads `/var/run/vault/mirror-registry/registry_stage.json`, merges `registry.stage.redhat.io` auth into the extracted `.dockerconfigjson` using `jq`, updates `openshift-config/pull-secret`, and polls MCP worker rollout (`Updating` then `Updated`).
Workflow insertion and job enablement `ci-operator/step-registry/nvidia-gpu-operator/e2e-aws/nvidia-gpu-operator-e2e-aws-workflow.yaml`, `ci-operator/config/rh-ecosystem-edge/nvidia-ci/rh-ecosystem-edge-nvidia-ci-main__4.21-stable.yaml`	Inserts the merge step between `aws-secureboot-verify` and `gpu-operator-e2e` in the workflow, sets `MERGE_STAGE_REGISTRY_CREDENTIALS: "true"` in the job, and updates `NVIDIAGPU_GPU_CLUSTER_POLICY_PATCH` driver `repository` to `registry.stage.redhat.io/nvidia`.

Sequence Diagram(s)

sequenceDiagram
  participant CIJob as CI Job Step
  participant Vault as Vault (registry_stage.json)
  participant Script as merge-stage script
  participant K8sAPI as Kubernetes API (pull-secret)
  participant MCP as MachineConfigPool (worker)

  CIJob->>Script: invoke nvidia-gpu-operator-merge-stage-credentials-commands.sh
  Script->>Vault: read /var/run/vault/mirror-registry/registry_stage.json
  Script->>K8sAPI: get secret/openshift-config/pull-secret
  Script->>Script: merge creds into .dockerconfigjson (jq)
  Script->>K8sAPI: apply updated pull-secret
  K8sAPI->>MCP: trigger worker rollout
  Script->>MCP: poll MachineConfigPool status until Updated
  MCP->>Script: report rollout completion

Possibly Related PRs

openshift/release#79757: Updates driver configuration fields in the same job YAML, directly related to the staging registry driver repository change.

Suggested Labels

rehearsals-ack, tide/merge-method-squash

Suggested Reviewers

empovit
wabouhamad

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 15

✅ Passed checks (15 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main change: moving signed precompiled NVIDIA drivers to registry.stage.redhat.io, which aligns with the driver repository URL changes and new stage credentials handling across the modified CI configuration files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR contains no Ginkgo test definitions. Changes limited to CI/operator config files and bash scripts for credential management. Check not applicable.
Test Structure And Quality	✅ Passed	PR contains no Ginkgo test code; only CI/CD YAML configs, bash scripts, and metadata files. Custom check for test code structure is not applicable.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added in this PR. Changes are CI/build configuration, workflow definitions, and credential management scripts only.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added in this PR. All changes are CI/operator configuration and utility scripts (YAML, JSON, OWNERS, bash). The check does not apply.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies CI configuration files only (job configs, workflows, bash script); no deployment manifests, operator code, or pod scheduling constraints introduced.
Ote Binary Stdout Contract	✅ Passed	PR only modifies CI configuration files (YAML, JSON, bash scripts, OWNERS). No Go code or OTE binaries are changed, so the stdout contract check is not applicable.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests (It(), Describe(), Context(), etc.) are present in this PR. Changes are CI infrastructure files (YAML configs, bash scripts, metadata) only.
No-Weak-Crypto	✅ Passed	No weak crypto patterns found. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto, or unsafe secret comparisons detected across the modified CI/CD YAML and bash script files.
Container-Privileges	✅ Passed	No privileged container configurations found. No privileged: true, hostPID/Network/IPC, SYS_ADMIN capability, or allowPrivilegeEscalation settings in any modified files.
No-Sensitive-Data-In-Logs	✅ Passed	Bash script disables debug tracing (set +x) before loading credentials and re-enables after; no echo statements expose passwords, tokens, or secrets.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-29T08:52:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: josecastillolema
Once this PR has been reviewed and has the lgtm label, please assign empovit for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/rh-ecosystem-edge/nvidia-ci/OWNERS~~ [josecastillolema]
ci-operator/step-registry/nvidia-gpu-operator/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-commands.sh`:
- Around line 38-51: The polling loop uses COUNTER with a limit of 600s (while [
$COUNTER -lt 600 ]) which matches the step timeout and can cause the script to
be killed before the final diagnostics (oc get mcp worker -o yaml) run; fix by
either reducing the in-script poll budget (e.g., change the loop limit from 600
to a smaller value like 540 so diagnostic branch can run) or by increasing the
CI step timeout/grace_period to >600s (e.g., 900s) so the existing loop can
complete and the oc get mcp worker -o yaml/exit 1 diagnostics will be executed.
Ensure you update the same loop condition (COUNTER and while [ $COUNTER -lt 600
]) or the CI step timeout accordingly.
- Around line 21-29: The script restores tracing too early causing the base64
credential in stage_registry_auth to be exposed during the jq merge; keep
tracing disabled until after the jq command that references stage_registry_auth
completes. Modify the flow around WAS_TRACING/set +x and the jq invocation (the
variables WAS_TRACING, set +x, stage_registry_auth, and the jq --argjson ...
'.auths |= . + $stage' invocation) so that set -x (if WAS_TRACING was true) is
only executed after the jq merge has finished and the sensitive variable is no
longer expanded.
- Around line 35-47: The current loop uses updatedMachineCount vs machineCount
which can be equal before an actual update and may be empty; change the logic to
poll MCP conditions instead: first wait for the worker MCP's Updating condition
to become "True" (use oc get mcp worker jsonpath for .status.conditions where
type=="Updating"), then wait for the worker MCP's Updated condition to become
"True" (check .status.conditions where type=="Updated"), treating missing
condition values as not-True and keeping the existing timeout/interval mechanism
(the same COUNTER/sleep loop). Ensure you replace references to updated/total
count checks with these two condition checks and exit 0 only after Updated=True,
and error/exit nonzero on timeout.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: eb181fa4-beb9-4c31-a396-ff4df3fc760e

📥 Commits

Reviewing files that changed from the base of the PR and between 322149b and 7cba171.

📒 Files selected for processing (6)

ci-operator/config/rh-ecosystem-edge/nvidia-ci/rh-ecosystem-edge-nvidia-ci-main__4.21-stable.yaml
ci-operator/step-registry/nvidia-gpu-operator/e2e-aws/nvidia-gpu-operator-e2e-aws-workflow.yaml
ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/OWNERS
ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-commands.sh
ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.metadata.json
ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.yaml

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>

josecastillolema · 2026-05-29T09:51:55Z

/pj-rehearse pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver

openshift-merge-bot · 2026-05-29T09:51:57Z

@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>

openshift-merge-bot · 2026-05-29T12:06:59Z

[REHEARSALNOTIFIER]
@josecastillolema: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver	rh-ecosystem-edge/nvidia-ci	presubmit	Ci-operator config changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.15-stable-nvidia-gpu-operator-e2e-master	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.15-stable-nvidia-gpu-operator-e2e-25-10-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.15-stable-nvidia-gpu-operator-e2e-26-3-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.15-stable-nvidia-gpu-operator-e2e-arm64	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.22-stable-nvidia-gpu-operator-e2e-master	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.22-stable-nvidia-gpu-operator-e2e-25-10-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.22-stable-nvidia-gpu-operator-e2e-26-3-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.22-stable-nvidia-gpu-operator-e2e-arm64	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-5.0-stable-nvidia-gpu-operator-e2e-master	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-5.0-stable-nvidia-gpu-operator-e2e-25-10-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-5.0-stable-nvidia-gpu-operator-e2e-26-3-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-5.0-stable-nvidia-gpu-operator-e2e-arm64	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.19-stable-nvidia-gpu-operator-e2e-master	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.19-stable-nvidia-gpu-operator-e2e-25-10-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.19-stable-nvidia-gpu-operator-e2e-26-3-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.19-stable-nvidia-gpu-operator-e2e-arm64	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.14-stable-nvidia-gpu-operator-e2e-master	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.14-stable-nvidia-gpu-operator-e2e-25-10-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.14-stable-nvidia-gpu-operator-e2e-26-3-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.14-stable-nvidia-gpu-operator-e2e-arm64	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-master	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-25-10-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-26-3-x	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-arm64	rh-ecosystem-edge/nvidia-ci	presubmit	Registry content changed

A total of 45 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

josecastillolema · 2026-05-29T12:15:41Z

/pj-rehearse pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver

openshift-merge-bot · 2026-05-29T12:15:44Z

@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

josecastillolema · 2026-05-29T13:57:07Z

Looking good!

Credential merge step completed in 50s:

Extracting current cluster pull secret...
Merging registry.stage.redhat.io credentials...
Updating cluster pull secret...
secret/pull-secret data updated
MCP Updating=True (20s elapsed)
MCP Updated=True (20s elapsed)
MCP rollout complete.

Driver image pulled from registry.stage.redhat.io (not quay.io):

nvidia-driver-daemonset-5.14.0-570.112.1.el9.6-rhel9.6-djp5p:
  registry.stage.redhat.io/nvidia/gpu-driver-rhel9:580.159.03-5.14.0-570.112.1.el9_6.x86_64-rhel9.6

ClusterPolicy confirms precompiled drivers

usePrecompiled: true
repository: registry.stage.redhat.io/nvidia
image: gpu-driver-rhel9
version: 580.159.03
Status: ready, reconciled at 13:30:16Z (~4 min after creation)

josecastillolema · 2026-05-29T13:57:25Z

/pj-rehearse ack

openshift-merge-bot · 2026-05-29T13:57:29Z

@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

josecastillolema · 2026-05-29T13:57:53Z

PTAL @empovit @ShiraEzra

josecastillolema · 2026-05-29T13:58:12Z

/assign @empovit @ShiraEzra

josecastillolema · 2026-05-29T13:59:26Z

/retest-failed

josecastillolema · 2026-05-29T14:00:31Z

/retest

openshift-ci · 2026-05-29T14:07:10Z

@josecastillolema: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

[NVIDIA-850] Move signed precompiled drivers to registry.stage.redhat.io

7cba171

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>

openshift-ci Bot requested review from empovit and ggordaniRed May 29, 2026 08:52

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

josecastillolema added 2 commits May 29, 2026 11:28

Fix possible credential leak

a438e10

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>

Adjust timeout and machineCount wait condition

34ae982

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>

Switch from cli to cli-jq image

12e7fa9

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 29, 2026

openshift-ci Bot assigned empovit and ShiraEzra May 29, 2026

Conversation

josecastillolema commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly Related PRs

Suggested Labels

Suggested Reviewers

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

openshift-merge-bot Bot commented May 29, 2026

Uh oh!

openshift-merge-bot Bot commented May 29, 2026

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

openshift-merge-bot Bot commented May 29, 2026

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

openshift-merge-bot Bot commented May 29, 2026

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

josecastillolema commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

josecastillolema commented May 29, 2026 •

edited

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading