feat(kcl/k8s): add ValidateWorkloadHealth step by xinWeiWei24 · Pull Request #1234 · Azure/telescope

xinWeiWei24 · 2026-06-25T07:21:06Z

Generic K8s workload health-check step for Azure Pipelines. Supports Deployment, StatefulSet, DaemonSet, Job, Pod, and wildcard ("*") on a namespace. Verifies controller readiness / Job completion / Pod phase, then scans container logs for panic|fatal (overridable).

wonderyl · 2026-06-26T06:41:12Z

+    displayName: str = "", \
+    continueOnError: bool = Undefined \
+    -> steps.Step {
+    assert kind in ["deployment", "statefulset", "daemonset", "job", "pod", "*"], \


use can define kind as

kind: "deployment" | "statefulset" | "daemonset" | "job" | "pod"| "*"

wonderyl · 2026-06-29T01:29:44Z

+}
+
+# 1. Resolve target controllers and pod set.
+case "${kind}" in


It's clearer to write the following way. You used double case, to reuse the same piece of code, but you can achieve the same with a function.

case "${kind}" in "*") check_controllers deployment '{.spec.replicas}' '{.status.readyReplicas}' check_controllers statefulset '{.spec.replicas}' '{.status.readyReplicas}' check_controllers daemonset '{.status.desiredNumberScheduled}' '{.status.numberReady}' check_jobs "" pods=$(kubectl -n "${namespace}" get pods -o jsonpath='{.items[*].metadata.name}') ;; pod) check_pods "${name}" ;; deployment|statefulset) check_controllers "${kind}" '{.spec.replicas}' '{.status.readyReplicas}' "${name}" check_controller_pods "${kind}" "${name}" ;; daemonset) check_controllers daemonset '{.status.desiredNumberScheduled}' '{.status.numberReady}' "${name}" ;; check_controller_pods "${kind}" "${name}" job) check_jobs "${name}" check_controller_pods "${kind}" "${name}" ;; esac

Done — refactored as suggested. Each case branch now calls check_pods / check_controller_pods directly, and $pods no longer crosses branches.

wonderyl · 2026-06-29T01:35:16Z

+set +e
+
+failures=0
+fail() { echo "FAIL: $*"; failures=$((failures + 1)); }


When you read log and try to find the exact failure, you will search for "FAIL:", but this string to too simple, other error string may have the same pattern. I suggest to add a bit more words to make it less likely to collide with others. E.g. "FAILED WORKLOAD CHECK:"

wonderyl · 2026-06-29T01:38:23Z

+    active=$(kubectl -n "${namespace}" get job "$n" -o jsonpath='{.status.active}')
+    failed_cond=$(kubectl -n "${namespace}" get job "$n" \\
+      -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}')
+    [ -z "$completions" ] && completions=1


avoids silence failure, not being able to get comletions and other fields should be considered as failures.

Done — added an existence probe (kubectl get ... -o name) at the top of check_controllers and check_jobs. If the resource can't be read (missing / API error / permission denied), it now fails loudly and skips that item instead of silently defaulting fields to 0.

I don't think the existence of the workload directly translate to these fields exist. I was more thinking about when field like completion doesn't exist, just print out an error and exit. The result is not usable when you use default anyway.

wonderyl · 2026-06-29T01:40:14Z

+    echo "job ${namespace}/$n: succeeded=$succeeded/$completions active=$active failed=$failed_cond"
+    if [ "$failed_cond" = "True" ]; then
+      fail "job ${namespace}/$n failed"
+    elif [ "$succeeded" -lt "$completions" ] && [ "$active" = "0" ]; then


So if I want 3 completions, 2 succeeded and 1 is active, it will count as succeed?

Yes, this step is meant to be usable in two scenarios: (1) right after a Job is created, to confirm it has come up and is making progress (running + no failed pods); (2) after waiting for the Job to finish, to confirm everything succeeded. As long as the Job is either succeeded or still running (with no Failed condition)

Then what happens when you want to check in scenario 2, it actually meets scenario 1 criteria? It's cleaner just split them into two functions and make the name of the function clear, if you check succeed and active, name name it checkJobSucceedOrActive. Health mean different things in different scenarios.
This also prompt me thinking, do you need a generic check workload health function? since you need to use them separately. one for checking cl2 drivers, which are jobs, one for checking kube-system workloads, which doesn't have job at all.

wonderyl · 2026-06-29T01:49:55Z

+    ;;
+esac
+
+if [ -z "$pods" ]; then


$pods is assigned in multiple places and used in here. Having a local variable that span 10s of lines of code makes it harder to read. If you use the way I suggested in https://github.com/Azure/telescope/pull/1234/changes#r3488891510, you can avoid this.

wonderyl · 2026-06-29T01:52:19Z

+    check_jobs ""
+    pods=$(kubectl -n "${namespace}" get pods -o jsonpath='{.items[*].metadata.name}')
+    ;;
+  pod)


just thinking out loud, would anyone actually check for the health of a particular pod?

Good question — generally we don’t focus on the health of a specific pod unless it represents a critical or singleton component, there are edge cases—e.g., pods deployed via custom binaries or standalone images—where they’re not tightly managed by controllers, and in those cases checking individual pod health may still be necessary.

If this is only a hypothetical use case, I suggest to remove it for now and add it back when someone really have a use case for it.

xinWeiWei24 requested review from liyu-ma, vittoriasalim and wonderyl as code owners June 25, 2026 07:21

wonderyl reviewed Jun 26, 2026

View reviewed changes

Comment thread kcl/lib/steps/k8s/validate_workload.k

addressed comments

4e81ba1

wonderyl reviewed Jun 29, 2026

View reviewed changes

xinWeiWei24 added 2 commits June 30, 2026 11:06

addressed comments

026d0a2

addresed comment

64d4e5a

Uh oh!

Conversation

xinWeiWei24 commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wonderyl Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wonderyl Jun 29, 2026 •

edited

Loading