Skip to content

feat(kcl/k8s): add ValidateWorkloadHealth step#1234

Open
xinWeiWei24 wants to merge 4 commits into
v2from
xinwei/validate_pod
Open

feat(kcl/k8s): add ValidateWorkloadHealth step#1234
xinWeiWei24 wants to merge 4 commits into
v2from
xinwei/validate_pod

Conversation

@xinWeiWei24

Copy link
Copy Markdown
Collaborator

Generic K8s workload health-check step for Azure Pipelines. Supports Deployment, StatefulSet, DaemonSet, Job, Pod, and wildcard ("*") on a namespace. Verifies controller readiness / Job completion / Pod phase, then scans container logs for panic|fatal (overridable).

Generic K8s workload health-check step for Azure Pipelines. Supports
Deployment, StatefulSet, DaemonSet, Job, Pod, and wildcard ("*") on a
namespace. Verifies controller readiness / Job completion / Pod phase,
then scans container logs for panic|fatal (overridable).
Comment thread kcl/lib/steps/k8s/validate_workload.k Outdated
displayName: str = "", \
continueOnError: bool = Undefined \
-> steps.Step {
assert kind in ["deployment", "statefulset", "daemonset", "job", "pod", "*"], \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use can define kind as

kind: "deployment" | "statefulset" | "daemonset" | "job" | "pod"| "*"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread kcl/lib/steps/k8s/validate_workload.k
}

# 1. Resolve target controllers and pod set.
case "${kind}" in

@wonderyl wonderyl Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clearer to write the following way. You used double case, to reuse the same piece of code, but you can achieve the same with a function.

case "${kind}" in
  "*")
    check_controllers deployment  '{.spec.replicas}'                 '{.status.readyReplicas}'
    check_controllers statefulset '{.spec.replicas}'                 '{.status.readyReplicas}'
    check_controllers daemonset   '{.status.desiredNumberScheduled}' '{.status.numberReady}'
    check_jobs ""
    pods=$(kubectl -n "${namespace}" get pods -o jsonpath='{.items[*].metadata.name}')
    ;;
  pod)
    check_pods "${name}"
    ;;
  deployment|statefulset)
    check_controllers "${kind}" '{.spec.replicas}' '{.status.readyReplicas}' "${name}" 
    check_controller_pods "${kind}" "${name}"
    ;;
  daemonset)
    check_controllers daemonset '{.status.desiredNumberScheduled}' '{.status.numberReady}' "${name}" ;;
    check_controller_pods "${kind}" "${name}"
  job)
    check_jobs "${name}" 
    check_controller_pods "${kind}" "${name}"
    ;;
esac

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — refactored as suggested. Each case branch now calls check_pods / check_controller_pods directly, and $pods no longer crosses branches.

Comment thread kcl/lib/steps/k8s/validate_workload.k Outdated
set +e

failures=0
fail() { echo "FAIL: $*"; failures=$((failures + 1)); }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you read log and try to find the exact failure, you will search for "FAIL:", but this string to too simple, other error string may have the same pattern. I suggest to add a bit more words to make it less likely to collide with others. E.g. "FAILED WORKLOAD CHECK:"

Comment thread kcl/lib/steps/k8s/validate_workload.k Outdated
active=$(kubectl -n "${namespace}" get job "$n" -o jsonpath='{.status.active}')
failed_cond=$(kubectl -n "${namespace}" get job "$n" \\
-o jsonpath='{.status.conditions[?(@.type=="Failed")].status}')
[ -z "$completions" ] && completions=1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoids silence failure, not being able to get comletions and other fields should be considered as failures.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added an existence probe (kubectl get ... -o name) at the top of check_controllers and check_jobs. If the resource can't be read (missing / API error / permission denied), it now fails loudly and skips that item instead of silently defaulting fields to 0.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the existence of the workload directly translate to these fields exist. I was more thinking about when field like completion doesn't exist, just print out an error and exit. The result is not usable when you use default anyway.

Comment thread kcl/lib/steps/k8s/validate_workload.k Outdated
echo "job ${namespace}/$n: succeeded=$succeeded/$completions active=$active failed=$failed_cond"
if [ "$failed_cond" = "True" ]; then
fail "job ${namespace}/$n failed"
elif [ "$succeeded" -lt "$completions" ] && [ "$active" = "0" ]; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I want 3 completions, 2 succeeded and 1 is active, it will count as succeed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this step is meant to be usable in two scenarios: (1) right after a Job is created, to confirm it has come up and is making progress (running + no failed pods); (2) after waiting for the Job to finish, to confirm everything succeeded. As long as the Job is either succeeded or still running (with no Failed condition)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what happens when you want to check in scenario 2, it actually meets scenario 1 criteria? It's cleaner just split them into two functions and make the name of the function clear, if you check succeed and active, name name it checkJobSucceedOrActive. Health mean different things in different scenarios.
This also prompt me thinking, do you need a generic check workload health function? since you need to use them separately. one for checking cl2 drivers, which are jobs, one for checking kube-system workloads, which doesn't have job at all.

Comment thread kcl/lib/steps/k8s/validate_workload.k Outdated
;;
esac

if [ -z "$pods" ]; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$pods is assigned in multiple places and used in here. Having a local variable that span 10s of lines of code makes it harder to read. If you use the way I suggested in https://github.com/Azure/telescope/pull/1234/changes#r3488891510, you can avoid this.

Comment thread kcl/lib/steps/k8s/validate_workload.k Outdated
check_jobs ""
pods=$(kubectl -n "${namespace}" get pods -o jsonpath='{.items[*].metadata.name}')
;;
pod)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking out loud, would anyone actually check for the health of a particular pod?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — generally we don’t focus on the health of a specific pod unless it represents a critical or singleton component, there are edge cases—e.g., pods deployed via custom binaries or standalone images—where they’re not tightly managed by controllers, and in those cases checking individual pod health may still be necessary.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only a hypothetical use case, I suggest to remove it for now and add it back when someone really have a use case for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants