This document covers the periodic task of merging upstream Datadog Agent changes into the StackState Agent fork. This is not day-to-day work — see CLAUDE.md for normal development workflows.
The StackState Agent is a fork of the Datadog Agent. Periodically, upstream Datadog releases are merged into the fork to pick up new features, bug fixes, and dependency updates. This is a large, intensive task that touches most of the codebase.
Fork structure:
- Main branch is named
stackstate-<DD-version>after the DD version it tracks (e.g.,stackstate-7.71.2). - Each merge produces a new main branch; the previous one is left in place as historical reference.
- A set of named scaffolding branches (
base-*,common-ancestor-*,backport-*,merged-*-to-*) is used to make the merge tractable — see "Pre-merge: branch setup" below. - A clean "compare copy" of the repo at a sibling path is useful for diffing post-merge fix-ups against the raw merge point.
The local.sh script orchestrates containerized builds. Key steps:
PREP— rsyncs source into the container, runsfix_package_paths.sh(if relocated), runsfix_branding.sh(if branded)DEPS_DEB— installs dependencies, runsinv deps, regenerates vendorBUILD_CLUSTER_AGENT/BUILD_AGENT— compiles binariesBUILD_DEB— builds the .deb package via omnibusUNIT_TESTS— builds with race detector, runs full test suite
The build container image is registry.tooling.stackstate.io/quay/stackstate/datadog_build_linux_x64.
- Pipeline structure: parent pipeline triggers bridge jobs, which spawn child pipelines (
agent-x86,agent-arm) - API base:
https://gitlab.com/api/v4/projects/<PROJECT_ID> - Auth:
Authorization: Bearer $GITLAB_TOKEN(token stored in.env) - Use
[cluster-agent]in commit messages to run only cluster-agent pipeline steps - The
branded_unit_testsjob runsfix_branding.shthen the full test suite - The
unbranded_unit_testsjob runs tests without branding (baseline comparison) - Jobs have
retry: max: 2, when: always— any single test failure triggers up to 2 retries
Both branded_unit_tests and unbranded_unit_tests in .gitlab-ci-agent.yml invoke inv -e test with two STS-specific flags that diverge from upstream defaults:
--build-exclude=$STS_UT_BUILD_EXCLUDE— drops build tags for features StackState does not ship in the cluster-agent / node-agent images. The current set isoracle,trivy,trivy_no_javadb,nvml,jetson,bundle_installer,systemd. If a future upstream merge introduces a new heavy build tag for a feature StackState doesn't surface (e.g., a new database integration, GPU/hardware support, vendor SDK), consider adding it to this list to keep CI time bounded. Service-discovery integrations (consul,etcd,zk,ncm) are deliberately kept in.--timeout=600— bumps Go's per-package test timeout from 180s to 600s. Required because we rungo clean -modcacheat job start, so subprocess-heavy tests likepkg/collector/corechecks/servicediscovery/apm.TestGoDetector(which shells out togo buildfour times to compile fixture binaries) can blow the default 3-minute timeout on a busy runner. Don't drop this without first confirming the modcache wipe is also gone.
All branding transformations live in fix_branding.sh. This script runs at build time and must NOT be applied as permanent local code changes — the source tree stays close to upstream for easier future merges.
- gofmt rule:
gofmt -r '"datadoghq.com" -> "stackstate.io"'— changes exact standalone Go string literals (e.g.,DefaultSite). Also applies other gofmt rules for localhost:7077 URL substitutions in specific directories. - Catch-all sed:
sed 's/datadoghq\.com/stackstate.io/g'on all*.gofiles — catchesdatadoghq.comas a substring in URLs likeapi.datadoghq.com,intake.profile.datadoghq.com, etc. - Targeted reverts — patterns that must NOT be branded are reverted back to
datadoghq.com.
The catch-all sed does NOT match datadoghq\.com (with backslash-dot) in source files, because \. in the file is two characters (backslash + dot), not a literal dot. This means:
- Go regex patterns like
ad\.datadoghq\.comare NOT changed by the sed - But string constants like
"ad.datadoghq.com/"ARE changed - This creates regex/constant mismatches that must be fixed by reverting the string constants
K8s annotations (must stay datadoghq.com — K8s protocol):
ad,internal.dd,tags,apm,internal.apmadmission,autoscaling,service-discovery,k8s.csi,external-metrics,custom-metrics
CRD API groups (K8s CustomResourceDefinition registrations):
- Version suffixes:
v1alpha1,v1alpha2,v1beta1,v2alpha1 - Standalone
"datadoghq.com"in orchestrator CRD files (Group, Name, groupName fields,datadogAPIGroupconstant)
Package repository URLs (reference real Datadog infrastructure):
apt,yum,keys— global revertinstall— scoped topkg/fleet/only (diagnose/connectivity needs branded URLs)
Documentation URLs: docs.datadoghq.com
Regex patterns (must add stackstate.io as recognized domain):
wellKnownSitesReinpkg/config/utils/endpoints.go— trailing FQDN dotddURLRegexpinpkg/config/utils/endpoints.go—AddAgentVersionToDomainddURLRegexp+ddNoSubDomainRegexpinpkg/trace/api/tracer_flare.go— separate file from endpoints.go- Forwarder health domain regex in
comp/forwarder/defaultforwarder/forwarder_health.go
Constants overridden by gofmt to localhost:7077 (must be fixed to branded URLs):
DefaultProcessEndpoint→https://process.stackstate.io.DefaultProcessEventsEndpoint→https://process-events.stackstate.io.defaultEndpoint(orchestrator) →https://orchestrator.stackstate.io- Test expected values using
url.Parsein orchestratorconfig_test.go
YAML fixture files (catch-all sed only targets *.go):
pkg/config/utils/tests/datadog_secrets.yaml— branded explicitlypkg/util/scrubber/test/datadog.yaml— NOT branded; Go expected value reverted instead
Compression: serializer_max_payload_size 250 → 200 (zstd → zlib CompressBound difference)
Test DNS resolution: npcollector tests override site to datadoghq.com so the event platform forwarder constructs resolvable intake endpoints (netpath-intake.datadoghq.com instead of netpath-intake.stackstate.io)
When upstream introduces new datadoghq.com references, most are handled automatically by the catch-all sed. You only need to add to fix_branding.sh when:
- A reference must NOT be branded (add a revert)
- A Go regex pattern needs to recognize
stackstate.io(add the domain to the regex) - A non-
.gofixture file needs branding (add explicit sed for that file) - A
gofmtrule produceslocalhost:7077but the correct value is a branded URL (add a fixup)
When RELOCATED=true, the source is moved from the Datadog import path to the StackState path:
github.com/DataDog/datadog-agent→github.com/StackVista/stackstate-agent
This involves rewriting Go import paths, cleaning the module cache, removing go.sum and vendor, then re-syncing go work and re-vendoring.
Upstream Datadog merges can silently drop StackState-specific code blocks (usually marked with // sts begin / // sts end or // [sts] comments). These are modifications to upstream files that don't exist in Datadog's codebase. After every merge, verify these are still present:
- File:
comp/core/tagger/collectors/workloadmeta_extract.go(old path:pkg/tagger/collectors/workloadmeta_extract.go) - What: Adds
kube_cluster_nametag (fromclustername.GetClusterName()) to all Kubernetes pod tags - Why: vmagent relabel rules derive
cluster_name,_k8s_cluster_, and_scope_labels from this tag. Without it, the StackState UI cannot display CPU/memory metrics for containers because MetricBindings use${tags.cluster-name}to scope queries. - Symptom if missing: Container CPU/memory columns empty in StackState UI;
cluster_name,_k8s_cluster_,_scope_labels absent from all container/kubelet metrics in VictoriaMetrics. - Note: Datadog doesn't need this because they use
expected_tags_durationto inject host tags at flush time. StackState relies on the tagger instead.
- File:
pkg/config/setup/config.go—DefaultCompressorKindconstant (handled byfix_branding.sh, NOT instackstate()) - What:
fix_branding.shchangesDefaultCompressorKind = "zstd"to"zlib"and adjustsserializer_max_payload_sizein tests from 250 → 200 - Why: The StackState receiver does not support zstd decompression. It returns HTTP 400 for zstd-compressed payloads, silently breaking host metadata ingestion (
/intake/endpoint). - Important: Do NOT override this in the
stackstate()function — it must be done viafix_branding.shbecause the payload size test tuning (250 vs 200) must match the compressor. Branded tests get both changes; unbranded tests keep zstd + 250. - Symptom if missing: Receiver returns 400 for all agent payloads; host metadata not ingested; metric enrichment stops.
- File:
pkg/config/setup/config.go, in thestackstate()function - What: Various StackState-specific defaults (skip leader election, batcher config, transactional forwarder, check state, etc.)
- Why: These configure StackState-specific components and disable Datadog-only features.
- File:
cmd/agent/subcommands/run/command.go—fx.Supply(resourcesimpl.Disabled())is supplied beforemetadata.Bundle(). - What: Suppresses the gohai-derived "resources" payload that the node-agent would otherwise post to
/intake/every 5 minutes (comp/metadata/resources/resourcesimpl/resources.go,defaultCollectInterval = 300s). - Why: The StackState receiver decodes the payload through
case class Intake(stackstate-receiver-project/.../apimodel/Intake.scala) which mandates a top-levelinternalHostname: String. The resources payload places hostname undermeta.hostinstead, so spray-json returns 400 with"Object is missing required member 'internalHostname'". Even when the field is added, the receiver'sIntake.resources: Option[Resources]is parsed but never read by any processor — the payload is wasted bandwidth. 7.51.1 prod has been silently 400ing on this for years; rather than perpetuate the noise, we disable the producer. - Do NOT replace this with a serializer-side
internalHostnameinjection. Earlier rebase commits (2881df138d,d9f478c698) added that injection; it was removed in8802b7a3and replaced with this provider-disable in STAC-24623. Generic post-marshal byte injection is the wrong layer — STS payloads carryinternalHostnamestructurally (seepkg/batcher/batcher.go:150,comp/metadata/host/hostimpl/utils/common.go:20,pkg/serializer/internal/metrics/events.go). - Pattern to watch in future merges: Any new metadata payload component added to
comp/metadata/that callsserializer.SendMetadata/SendProcessesMetadatamust either embedhostMetadataUtils.CommonPayload(which hasInternalHostname) or be disabled if the receiver doesn't consume it. GrepSendMetadata\|SendProcessesMetadata\|SendHostMetadata\|SendAgentchecksMetadatafor new call sites. - Cluster-agent and dogstatsd are unaffected: cluster-agent does not wire
metadata.Bundle(); dogstatsd already suppliesDisabled()upstream (cmd/dogstatsd/subcommands/start/command.go:161).
- File:
pkg/config/setup/config.go, inserializer()function - What: Forces
use_v2_api.seriestofalse - Why: The StackState receiver only supports the v1 series API.
- File:
pkg/serializer/internal/metrics/events.go - What: Serializes
EventContextfield in event payloads - Why: StackState topology events require the context field for proper processing.
- File:
pkg/config/setup/config.go, in thestackstate()function - File:
comp/connectivitychecker/impl/connectivitychecker.go - What:
connectivity_checker.enableddefaults tofalse; the component skips lifecycle/timer registration when disabled. - Why: DD 7.71.2 added a periodic connectivity checker that probes all DD endpoints every 10 minutes. The STS receiver doesn't support many of these endpoints, causing 404s in receiver logs. The
// sts begin/endguard inNewComponentmust be preserved.
- File:
fix_branding.sh(applied at build time) - What: Brands C++ rtloader files (header paths, module names)
- Why: Python checks loaded via rtloader won't work if the C++ layer references
datadog_agentinstead ofstackstate_agent.
- File:
pkg/config/setup/config.go - What: Default must stay at
1(StackState override). DD upstream changed it from unset to10in commitf4b1c7cc17. - Why: With concurrent requests > 1, topology snapshot batches (
SnapshotStart→ data →SnapshotStop) can arrive out of order at the receiver, causingDuplicateSnapshotItemerrors in the sync processor. The larger the cluster, the more batches per snapshot, the more likely reordering occurs. - Symptom if wrong: Topology sync thrashing on large clusters —
DuplicateSnapshotItemandComponentForRelationMissingerrors in the sync processor, create/delete churn on topology components.
pkg/logs/client/http/worker_pool_test.go carries an STS-specific driveUntil helper plus an absDuration utility, used to absorb a goroutine-scheduling race in TestRetryableError, TestNonRetryableError, and TestWorkerCounts. Without these, the tests flake on busy CI runners with off-by-one worker counts and millisecond-level assert.InDelta mismatches on virtualLatency. An upstream merge into pkg/logs/client/http/ may overwrite this patch — verify the helpers are still present and the Test* functions still call driveUntil(...) rather than the original fixed-iteration loops. The original assertions (for i := 0; i < 100; i++ { pool.run(...) }; require.Equal(t, 8, pool.inUseWorkers)) compile but flake in CI.
- Not an agent code issue — this is a stackpacks/platform concern, but triggered by agent version changes.
- What: The threshold monitor function (
urn:stackpack:common:monitor-function:threshold) deriveshealthStateIdfrom ALL metric label values. If the agent version adds, removes, or changes any label (e.g.,orch_cluster_idappearing,statusflip-flopping), the platform creates duplicate monitor instances for the same component. - Affected monitors: Node Disk/Memory/PID Pressure, Node Readiness, Available Endpoints (fixed in stackpacks MR 1332 by adding
max by (...)aggregation). Desired-replicas monitors (daemonset/deployment/replicaset/statefulset) are theoretically vulnerable but not currently affected. - After merge: Check if new KSM metrics add labels that differ from the labels used in monitor
urnTemplatefields. If so, the monitor queries in the kubernetes-v2 stackpack needby (...)aggregation to strip volatile labels.
Before any conflict resolution, set up the branches that the merge will run on. The strategy is to give git a meaningful three-way merge base by replaying StackState's changes onto the upstream commit that the source and target DD versions share. Without this, git treats every line of every file StackState ever touched as a potential conflict.
There is no upstream remote in this repo. Pristine DataDog code is fetched from a separate DataDog clone and pushed to origin as base-* branches.
For a merge from current DD version <CURRENT> to target DD version <NEXT>:
base-<CURRENT> ← pristine DD <CURRENT> upstream (no STS code)
↓
stackstate-<CURRENT> ← current fork main = base-<CURRENT> + STS changes
↓
common-ancestor-<CURRENT>-<NEXT> ← upstream commit shared by both DD tags
↓
backport-<CURRENT>-common-ancestor-<NEXT> ← STS changes replayed onto the common ancestor
↓
base-<NEXT> ← pristine DD <NEXT> upstream
↓
merged-<CURRENT>-to-<NEXT> ← merge of backport into base-<NEXT>;
conflict resolution and fix-ups land here
↓
stackstate-<NEXT> ← new fork main (created at cutover)
For a solo merge, fix-up commits go directly on merged-<CURRENT>-to-<NEXT>. When more than one person is contributing, open per-developer feature branches (any naming) off merged-<CURRENT>-to-<NEXT> and merge them back via MR.
| Branch | Contents | Created when |
|---|---|---|
base-<CURRENT> |
Pristine DD <CURRENT> upstream commit, no StackState code |
Already exists from the previous merge |
stackstate-<CURRENT> |
Current fork main (= base-<CURRENT> + all STS changes) |
Already exists; this is the live main branch |
common-ancestor-<CURRENT>-<NEXT> |
Output of git merge-base base-<CURRENT> base-<NEXT> — the upstream commit shared by both DD versions |
New, this merge |
backport-<CURRENT>-common-ancestor-<NEXT> |
common-ancestor-... + every StackState change from stackstate-<CURRENT> replayed on top |
New, this merge |
base-<NEXT> |
Tip of DD's <MAJOR.MINOR>.x release branch at prep time, named after the latest released patch (NOT the version tag — see Prep commands note below) |
New, this merge |
merged-<CURRENT>-to-<NEXT> |
Result of merging backport-... into base-<NEXT> plus all conflict-resolution and fix-up commits |
New, this merge |
stackstate-<NEXT> |
Final post-merge state, becomes the new fork main | At cutover |
These assume a separate DataDog clone exists somewhere on disk (e.g., a clone of https://github.com/DataDog/datadog-agent). If you don't have one, clone it once — it's a large repo, treat it as a long-lived workspace.
Why the release branch tip and not the tag: DataDog's release tags often point at release-prep commits that are off the <MAJOR.MINOR>.x branch line of history (changelog generators, version bumpers, etc.). Using a tag commit as base-<NEXT> can push git merge-base base-<CURRENT> base-<NEXT> further back in history than necessary — sometimes to the previous DD minor version's branch point — yielding a less useful three-way merge base. The <MAJOR.MINOR>.x branch tip is on the "real" line of history and matches what the previous merge cycle did (verify by git branch -r --contains <previous base-* tip>).
# 1. In the DataDog clone: push the DD release-branch tip as base-<NEXT>.
# Name the branch after the latest released patch version (e.g., base-7.78.2,
# even when origin/7.78.x has moved a few backports past the 7.78.2 tag).
cd /path/to/datadog-agent
git fetch origin
git push <stackstate-gitlab-remote> origin/<MAJOR.MINOR>.x:refs/heads/base-<NEXT>
# 2. Back in the StackState fork: get the new base branch locally
cd /path/to/stackstate-agent
git fetch origin
git checkout base-<NEXT>
# 3. Compute and push the common ancestor
COMMON=$(git merge-base base-<CURRENT> base-<NEXT>)
git push origin "$COMMON":refs/heads/common-ancestor-<CURRENT>-<NEXT>
git fetch origin
# 4. Build the backport branch: STS changes replayed onto the common ancestor
git checkout -b backport-<CURRENT>-common-ancestor-<NEXT> common-ancestor-<CURRENT>-<NEXT>
# Bring over every file StackState changed vs. base-<CURRENT>:
git diff --name-only base-<CURRENT>..stackstate-<CURRENT> > /tmp/sts-files.txt
git checkout stackstate-<CURRENT> -- $(cat /tmp/sts-files.txt)
git commit -m "All StackState changes replayed on top of common-ancestor-<CURRENT>-<NEXT>"
git push -u origin backport-<CURRENT>-common-ancestor-<NEXT>
# 5. Open the merge branch and do the actual merge
git checkout -b merged-<CURRENT>-to-<NEXT> base-<NEXT>
git merge backport-<CURRENT>-common-ancestor-<NEXT>
# resolve conflicts (this is the big sit-down), then commit
git push -u origin merged-<CURRENT>-to-<NEXT>After step 5, the merge tree is in place and you move on to the workflow below. Fix-ups can be committed directly on merged-<CURRENT>-to-<NEXT>, or via feature branches if multiple people are working in parallel.
Worth setting up at this point: a second clone at a sibling path checked out at the merge commit (the tip of merged-<CURRENT>-to-<NEXT> before any fix-up commits land). Useful for diffing "what have I changed since the merge point" without polluting the working tree. Naming convention: <repo-path>-compare.
Pre-merge branch setup (above) is a prerequisite. By the time you're here, merged-<CURRENT>-to-<NEXT> exists with the conflict-resolution merge committed.
- Fix compilation errors on
merged-<CURRENT>-to-<NEXT>(or on feature branches off it) - Update
fix_branding.shto handle new branding patterns - Iterate on CI until
branded_unit_testsandunbranded_unit_testspass on both x86 and ARM - Verify all StackState-specific code blocks (see "StackState-Specific Code That Can Be Lost During Merge" above)
- Run integration tests (beest) against the produced container images
- Deploy to sandbox and verify metric enrichment (cluster_name, k8s_cluster, scope labels present)
- Fix any runtime issues
- Cut over the new branch to be the fork's main — see "Cutover" below
Once the merge branch (e.g., stackstate-7.71.2) has clean CI, sandbox verification is healthy, and the team is ready to retire the previous main, four repositories need coordinated changes in lockstep. Without coordination, the nightly promoter pipeline, beest CI gating, and the helm chart appVersion silently desynchronize and you end up debugging "why did my dev tag get clobbered overnight?" the morning after.
- Set the new branch as the GitLab default branch (Settings → Repository → Branch defaults).
- Update protected branches: add the new branch, optionally remove the old one (or keep it for a grace period).
- The branch name pattern
stackstate-<DD-version>is the convention; keep it.
main.py:103hard-codes the agent's main branch:AgentOps("stackvista/agent/stackstate-agent", "stackstate-agent", "stackstate-7.51.1"). Update the third argument to the new branch.- The nightly
promote_agent_master_to_promoter_devpipeline reads this and rewritesconfig.ymlanddeploy/argocd/common/apps/dev-agent/values.yamlwith the latest commit on that branch. If you skip this step, every overnight run will continue setting dev tags from the old branch, clobbering whatever dev verification you're trying to do on the new one. .github/copilot-instructions.md(if tracked in your local copy) also references the branch name; check and update.
- Bump
Chart.yaml'sappVersion. Convention:<STS-major>.<DD-minor>.<DD-patch>. The StackState major (currently3) tracks DD's major-version family — DD v5/v6/v7 mapped to STS v1/v2/v3 historically. So DD7.71.2→ STS appVersion3.71.2. The chartversion(separate fromappVersion) follows its own SemVer cadence and is bumped byverify_versions_bumped.shrules whenever any file understable/<chart>/changes. - Audit
templates/_container-agent.yamlandtemplates/checks-agent-deployment.yamlfor env vars deprecated by the new agent. Concrete example from the 7.71.2 cutover: removedSTS_PROCESS_AGENT_ENABLED(the deprecatedprocess_config.enabledkey) — the replacement pairSTS_PROCESS_CONFIG_PROCESS_COLLECTION_ENABLED+STS_PROCESS_CONFIG_CONTAINER_COLLECTION_ENABLEDhad been added alongside it earlier so the removal was a no-op deletion. Look for similar deprecation pairs introduced upstream during the merge. nodes/statsRBAC entry must be present intemplates/node-agent-clusterrole.yaml(was missing pre-cutover; verify it's still there).- Image tags in
values.yamlare managed by the agent-promoter nightly — leave them alone in the cutover MR; once #2 above is merged, the next nightly will write a tag from the new branch. - Pre-commit hooks must run for every commit in this repo (helm-docs, shellcheck, helm-lint). Don't squash commits past hook runs.
The agent's main branch name is referenced in roughly 30 places, all needing the same find-and-replace:
- 5 CI rule files use
merge_train_alwaysrules pinned to the agent's main branch:.gitlab-ci-rancher-tests.yml.gitlab-ci-suse-observability-cli-tests.yml.gitlab-ci-agent-x86-tests.yml.gitlab-ci-agent-arm-tests.yml.gitlab-ci-suse-observability-ui-inspection.yml
Makefile:21last-resortGIT_BRANCH ?= ... echo "<branch>"fallbackhelpers/resolve-agent-hashes.sh:48AGENT_DEFAULT_BRANCHfallback (and the matching comment on line 47)README.md:143example value for theAGENT_BRANCH_UNDER_TESTenv vardocs/setup-locally.mdreferences a non-existentbeest/subfolder of the agent repo at the old branch — that link has been dead for a while; either fix or delete it as a separate cleanup.
Order matters slightly. Recommended:
- Merge the agent default-branch change AND the beest CI change on roughly the same day.
- Merge the agent-promoter change next — its nightly run that night will start writing tags from the new branch.
- Merge the helm-charts appVersion bump whenever convenient (independent).
Don't merge the agent-promoter change before the agent default branch flips, or the next nightly will fail to find commits to promote.