feat(kcl/k8s): add ValidateWorkloadHealth step#1234
Conversation
Generic K8s workload health-check step for Azure Pipelines. Supports
Deployment, StatefulSet, DaemonSet, Job, Pod, and wildcard ("*") on a
namespace. Verifies controller readiness / Job completion / Pod phase,
then scans container logs for panic|fatal (overridable).
| displayName: str = "", \ | ||
| continueOnError: bool = Undefined \ | ||
| -> steps.Step { | ||
| assert kind in ["deployment", "statefulset", "daemonset", "job", "pod", "*"], \ |
There was a problem hiding this comment.
use can define kind as
kind: "deployment" | "statefulset" | "daemonset" | "job" | "pod"| "*"
| } | ||
|
|
||
| # 1. Resolve target controllers and pod set. | ||
| case "${kind}" in |
There was a problem hiding this comment.
It's clearer to write the following way. You used double case, to reuse the same piece of code, but you can achieve the same with a function.
case "${kind}" in
"*")
check_controllers deployment '{.spec.replicas}' '{.status.readyReplicas}'
check_controllers statefulset '{.spec.replicas}' '{.status.readyReplicas}'
check_controllers daemonset '{.status.desiredNumberScheduled}' '{.status.numberReady}'
check_jobs ""
pods=$(kubectl -n "${namespace}" get pods -o jsonpath='{.items[*].metadata.name}')
;;
pod)
check_pods "${name}"
;;
deployment|statefulset)
check_controllers "${kind}" '{.spec.replicas}' '{.status.readyReplicas}' "${name}"
check_controller_pods "${kind}" "${name}"
;;
daemonset)
check_controllers daemonset '{.status.desiredNumberScheduled}' '{.status.numberReady}' "${name}" ;;
check_controller_pods "${kind}" "${name}"
job)
check_jobs "${name}"
check_controller_pods "${kind}" "${name}"
;;
esac
There was a problem hiding this comment.
Done — refactored as suggested. Each case branch now calls check_pods / check_controller_pods directly, and $pods no longer crosses branches.
| set +e | ||
|
|
||
| failures=0 | ||
| fail() { echo "FAIL: $*"; failures=$((failures + 1)); } |
There was a problem hiding this comment.
When you read log and try to find the exact failure, you will search for "FAIL:", but this string to too simple, other error string may have the same pattern. I suggest to add a bit more words to make it less likely to collide with others. E.g. "FAILED WORKLOAD CHECK:"
| active=$(kubectl -n "${namespace}" get job "$n" -o jsonpath='{.status.active}') | ||
| failed_cond=$(kubectl -n "${namespace}" get job "$n" \\ | ||
| -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}') | ||
| [ -z "$completions" ] && completions=1 |
There was a problem hiding this comment.
avoids silence failure, not being able to get comletions and other fields should be considered as failures.
There was a problem hiding this comment.
Done — added an existence probe (kubectl get ... -o name) at the top of check_controllers and check_jobs. If the resource can't be read (missing / API error / permission denied), it now fails loudly and skips that item instead of silently defaulting fields to 0.
There was a problem hiding this comment.
I don't think the existence of the workload directly translate to these fields exist. I was more thinking about when field like completion doesn't exist, just print out an error and exit. The result is not usable when you use default anyway.
| echo "job ${namespace}/$n: succeeded=$succeeded/$completions active=$active failed=$failed_cond" | ||
| if [ "$failed_cond" = "True" ]; then | ||
| fail "job ${namespace}/$n failed" | ||
| elif [ "$succeeded" -lt "$completions" ] && [ "$active" = "0" ]; then |
There was a problem hiding this comment.
So if I want 3 completions, 2 succeeded and 1 is active, it will count as succeed?
There was a problem hiding this comment.
Yes, this step is meant to be usable in two scenarios: (1) right after a Job is created, to confirm it has come up and is making progress (running + no failed pods); (2) after waiting for the Job to finish, to confirm everything succeeded. As long as the Job is either succeeded or still running (with no Failed condition)
There was a problem hiding this comment.
Then what happens when you want to check in scenario 2, it actually meets scenario 1 criteria? It's cleaner just split them into two functions and make the name of the function clear, if you check succeed and active, name name it checkJobSucceedOrActive. Health mean different things in different scenarios.
This also prompt me thinking, do you need a generic check workload health function? since you need to use them separately. one for checking cl2 drivers, which are jobs, one for checking kube-system workloads, which doesn't have job at all.
| ;; | ||
| esac | ||
|
|
||
| if [ -z "$pods" ]; then |
There was a problem hiding this comment.
$pods is assigned in multiple places and used in here. Having a local variable that span 10s of lines of code makes it harder to read. If you use the way I suggested in https://github.com/Azure/telescope/pull/1234/changes#r3488891510, you can avoid this.
| check_jobs "" | ||
| pods=$(kubectl -n "${namespace}" get pods -o jsonpath='{.items[*].metadata.name}') | ||
| ;; | ||
| pod) |
There was a problem hiding this comment.
just thinking out loud, would anyone actually check for the health of a particular pod?
There was a problem hiding this comment.
Good question — generally we don’t focus on the health of a specific pod unless it represents a critical or singleton component, there are edge cases—e.g., pods deployed via custom binaries or standalone images—where they’re not tightly managed by controllers, and in those cases checking individual pod health may still be necessary.
There was a problem hiding this comment.
If this is only a hypothetical use case, I suggest to remove it for now and add it back when someone really have a use case for it.
Generic K8s workload health-check step for Azure Pipelines. Supports Deployment, StatefulSet, DaemonSet, Job, Pod, and wildcard ("*") on a namespace. Verifies controller readiness / Job completion / Pod phase, then scans container logs for panic|fatal (overridable).