Skip to content

Add InstallKubePrometheusStack lib step#1237

Open
LeonardCareer wants to merge 1 commit into
v2from
leonarddu/lib-install-prometheus
Open

Add InstallKubePrometheusStack lib step#1237
LeonardCareer wants to merge 1 commit into
v2from
leonarddu/lib-install-prometheus

Conversation

@LeonardCareer

@LeonardCareer LeonardCareer commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds kcl/lib/steps/k8s/install_prometheus.k, providing one new step:

import lib.steps.k8s.install_prometheus as prom

prom.InstallKubePrometheusStack(
    serviceConnection = SERVICE_CONNECTION,
    valuesFile        = "kcl/<scenario>/prometheus-values.yaml",
)

The step installs the kube-prometheus-stack Helm chart (Operator + Prometheus + kube-state-metrics + Grafana + Alertmanager) into the current kubectl context.

Why

Multiple benchmark scenarios need to provision an in-cluster Prometheus on a freshly created AKS cluster — currently each scenario hand-rolls its own helm/kubectl-apply step. This step covers both cases discussed so far:

  • Xinwei's 15K-node bench (kube-prometheus-stack with cilium PodMonitor / PrometheusRule and KSM)
  • Leonard's S5 lease bench (lightweight: Operator + Prometheus only, all other components disabled in values.yaml)

What the caller must prepare

  1. Working kubeconfig — call azure.GetCredentials(...) in a prior step so kubectl works against the target cluster.
  2. values.yaml checked in to the caller's pipeline repo, e.g. kcl/<scenario>/prometheus-values.yaml. The path passed to valuesFile is repo-relative under $(Pipeline.Workspace)/s/. All workload-specific tuning (retention, storage, resource requests, scrape rules, PodMonitor/ServiceMonitor selectors, node selectors, tolerations, enabling/disabling Grafana/Alertmanager/KSM) lives in this file — see the chart's values.yaml for the full surface.
  3. Firewall egress to the chart registries (ghcr.io, quay.io, pkg-containers.githubusercontent.com, production.cloudflare.docker.com and *. of each) — caller handles this in their cluster-create step.
  4. Any PodMonitor / ServiceMonitor / PrometheusRule CRs the workload needs — caller kubectl apply -f ... after this step. Not lib's concern.

What the step does

  1. Installs Helm v3 on the agent if helm is not on PATH.
  2. Adds / refreshes the prometheus-community helm repo.
  3. helm upgrade --install <releaseName> prometheus-community/kube-prometheus-stack with the caller's values file, --create-namespace, --wait, --atomic (rolls back cleanly on failure), and configurable timeout.
  4. Echoes the installed Prometheus web service name (svc/<releaseName>-kube-prometheus-prometheus, port 9090) so the caller knows where to port-forward / scrape from next.

Parameters

Parameter Default Required
serviceConnection yes
valuesFile yes
namespace "monitoring" no
releaseName "prometheus" no
chartVersion "" (latest) no
waitTimeout "10m" no

Validation

End-to-end-tested in the Telescope S5 lease benchmark pipeline (internal repo) against an AKS H8 hyperscale cluster in southeastasia:

  • helm install completes in ~2.5 min
  • the installed svc/prometheus-kube-prometheus-prometheus is port-forwardable and serves /-/ready, /api/v1/query, /api/v1/query_range
  • caller's additionalScrapeConfigs scrape config picks up the workload pods correctly

@xinWeiWei24 xinWeiWei24 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

import lib.steps.azure

# Install kube-prometheus-stack via Helm. Caller must run azure.GetCredentials first and check in a values.yaml.
InstallKubePrometheusStack = lambda serviceConnection: str, valuesFile: str, namespace: str = "monitoring", releaseName: str = "prometheus", chartVersion: str = "", waitTimeout: str = "10m" -> steps.Step {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name can be more succinct. e.g. InstallPrometheus. KubePrometheus is not a thing. Stack bears no meaning.

# Prometheus web service is svc/<releaseName>-kube-prometheus-prometheus on port 9090.
echo "==> Prometheus web service: svc/${releaseName}-kube-prometheus-prometheus in ${namespace} (port 9090)"
"""
azure.AzCli(serviceConnection, "Install kube-prometheus-stack", script)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

script = """
set -euo pipefail

# Install Helm v3 if missing

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider extracting install helm into a separate step. Imaging there are 10 helm charts, each have their own way to install helm 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants