[NVIDIA-850] Move signed precompiled drivers to registry.stage.redhat.io#79860
[NVIDIA-850] Move signed precompiled drivers to registry.stage.redhat.io#79860josecastillolema wants to merge 4 commits into
Conversation
Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds a CI step that conditionally merges Red Hat staging registry credentials into the cluster pull secret, inserts that step into the NVIDIA GPU Operator E2E workflow, and enables it in the job while switching the GPU driver repository to the staging registry. ChangesStaging Registry Credentials Integration
Sequence Diagram(s)sequenceDiagram
participant CIJob as CI Job Step
participant Vault as Vault (registry_stage.json)
participant Script as merge-stage script
participant K8sAPI as Kubernetes API (pull-secret)
participant MCP as MachineConfigPool (worker)
CIJob->>Script: invoke nvidia-gpu-operator-merge-stage-credentials-commands.sh
Script->>Vault: read /var/run/vault/mirror-registry/registry_stage.json
Script->>K8sAPI: get secret/openshift-config/pull-secret
Script->>Script: merge creds into .dockerconfigjson (jq)
Script->>K8sAPI: apply updated pull-secret
K8sAPI->>MCP: trigger worker rollout
Script->>MCP: poll MachineConfigPool status until Updated
MCP->>Script: report rollout completion
Possibly Related PRs
Suggested Labels
Suggested Reviewers
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: josecastillolema The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-commands.sh`:
- Around line 38-51: The polling loop uses COUNTER with a limit of 600s (while [
$COUNTER -lt 600 ]) which matches the step timeout and can cause the script to
be killed before the final diagnostics (oc get mcp worker -o yaml) run; fix by
either reducing the in-script poll budget (e.g., change the loop limit from 600
to a smaller value like 540 so diagnostic branch can run) or by increasing the
CI step timeout/grace_period to >600s (e.g., 900s) so the existing loop can
complete and the oc get mcp worker -o yaml/exit 1 diagnostics will be executed.
Ensure you update the same loop condition (COUNTER and while [ $COUNTER -lt 600
]) or the CI step timeout accordingly.
- Around line 21-29: The script restores tracing too early causing the base64
credential in stage_registry_auth to be exposed during the jq merge; keep
tracing disabled until after the jq command that references stage_registry_auth
completes. Modify the flow around WAS_TRACING/set +x and the jq invocation (the
variables WAS_TRACING, set +x, stage_registry_auth, and the jq --argjson ...
'.auths |= . + $stage' invocation) so that set -x (if WAS_TRACING was true) is
only executed after the jq merge has finished and the sensitive variable is no
longer expanded.
- Around line 35-47: The current loop uses updatedMachineCount vs machineCount
which can be equal before an actual update and may be empty; change the logic to
poll MCP conditions instead: first wait for the worker MCP's Updating condition
to become "True" (use oc get mcp worker jsonpath for .status.conditions where
type=="Updating"), then wait for the worker MCP's Updated condition to become
"True" (check .status.conditions where type=="Updated"), treating missing
condition values as not-True and keeping the existing timeout/interval mechanism
(the same COUNTER/sleep loop). Ensure you replace references to updated/total
count checks with these two condition checks and exit 0 only after Updated=True,
and error/exit nonzero on timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: eb181fa4-beb9-4c31-a396-ff4df3fc760e
📒 Files selected for processing (6)
ci-operator/config/rh-ecosystem-edge/nvidia-ci/rh-ecosystem-edge-nvidia-ci-main__4.21-stable.yamlci-operator/step-registry/nvidia-gpu-operator/e2e-aws/nvidia-gpu-operator-e2e-aws-workflow.yamlci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/OWNERSci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-commands.shci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.metadata.jsonci-operator/step-registry/nvidia-gpu-operator/merge-stage-credentials/nvidia-gpu-operator-merge-stage-credentials-ref.yaml
Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>
Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>
|
/pj-rehearse pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver |
|
@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>
|
[REHEARSALNOTIFIER]
A total of 45 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver |
|
@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
Looking good!
nvidia-driver-daemonset-5.14.0-570.112.1.el9.6-rhel9.6-djp5p:
registry.stage.redhat.io/nvidia/gpu-driver-rhel9:580.159.03-5.14.0-570.112.1.el9_6.x86_64-rhel9.6
|
|
/pj-rehearse ack |
|
@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
PTAL @empovit @ShiraEzra |
|
/assign @empovit @ShiraEzra |
|
/retest-failed |
|
/retest |
|
@josecastillolema: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Updates OpenShift CI configuration for the NVIDIA GPU Operator E2E pipelines to use signed precompiled drivers from the Red Hat staging registry and to add an opt-in step that merges staging registry credentials into the cluster pull secret.
Opted for the merging credentials strategy to avoid handling secrets (and it only takes 50 sec/no reboot needed).