Skip to content
Open
199 changes: 199 additions & 0 deletions .github/workflows/rolling-update.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
name: Rolling update

# Manually-triggered production rollout. Joins the Tailnet, SSHes over
# MagicDNS into each node, and invokes scripts/rolling-update.sh.
# See docs/design/2026_04_24_proposed_deploy_via_tailscale.md.

on:
workflow_dispatch:
inputs:
ref:
description: Git ref (tag or sha) to deploy. Also used as the image tag unless image_tag is set.
required: true
type: string
image_tag:
description: Override the image tag (default = ref). Used for rollbacks.
required: false
type: string
default: ""
nodes:
description: Comma-separated raft IDs to roll (e.g. "n1,n2"). Empty = all nodes in NODES_RAFT_MAP.
required: false
type: string
default: ""
dry_run:
description: Render the plan and run a reachability check only; do NOT touch containers.
required: true
type: boolean
default: true

permissions:
contents: read
id-token: write # required by tailscale/github-action OIDC flow
Comment on lines +30 to +32
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add packages:read for GHCR manifest check

This workflow narrows GITHUB_TOKEN to contents and id-token, which implicitly removes package scope, but the Verify image exists on ghcr.io step authenticates to GHCR and inspects a manifest with that token. In environments where the image is private (or package auth is otherwise required), docker login/docker manifest inspect will fail with authorization errors before rollout begins, so deploys are blocked even when the image exists.

Useful? React with 👍 / 👎.


concurrency:
group: rolling-update
cancel-in-progress: false

jobs:
deploy:
runs-on: ubuntu-latest
# Approval gate — see GitHub environment settings for required reviewers.
# Dry-runs also use this environment so the secret wiring is identical;
# the environment's approval rule should be configured to auto-approve
# dry-runs if that distinction is desired (GitHub UI: "Deployment
# protection rules").
environment: production
timeout-minutes: 60

steps:
- name: Checkout
uses: actions/checkout@v6
with:
ref: ${{ inputs.ref }}
Comment on lines +85 to +87
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict deploy script checkout to trusted refs

The workflow checks out ${{ inputs.ref }} and later executes ./scripts/rolling-update.sh after loading production credentials (SSH key, Tailscale OAuth secret) and joining the tailnet, so ref is effectively arbitrary code execution, not just a deployment selector. If someone can dispatch runs and supply a branch/tag they control, a modified script in that ref can exfiltrate secrets or perform unintended actions once the environment is approved. Keep execution on a trusted protected revision (or strictly validate allowed refs) and use the input only for image selection.

Useful? React with 👍 / 👎.


- name: Install jq
run: sudo apt-get install -y --no-install-recommends jq

- name: Verify image exists on ghcr.io
env:
IMAGE_BASE: ${{ vars.IMAGE_BASE }}
IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }}
GHCR_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
if [[ -z "$IMAGE_BASE" ]]; then
echo "::error::IMAGE_BASE repository variable is not set"
exit 1
fi
echo "Checking $IMAGE_BASE:$IMAGE_TAG"
echo "$GHCR_TOKEN" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin >/dev/null
if ! docker manifest inspect "$IMAGE_BASE:$IMAGE_TAG" >/dev/null; then
echo "::error::image $IMAGE_BASE:$IMAGE_TAG not found on ghcr.io"
exit 1
fi

- name: Join Tailnet (ephemeral)
uses: tailscale/github-action@v3
with:
oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }}
oauth-secret: ${{ secrets.TS_OAUTH_SECRET }}
tags: tag:ci-deploy

- name: Configure SSH
env:
SSH_KEY: ${{ secrets.DEPLOY_SSH_PRIVATE_KEY }}
KNOWN_HOSTS: ${{ secrets.DEPLOY_KNOWN_HOSTS }}
run: |
set -euo pipefail
mkdir -p ~/.ssh
chmod 700 ~/.ssh
printf '%s\n' "$SSH_KEY" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
printf '%s\n' "$KNOWN_HOSTS" > ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts
# Sanity: no stray CRLF in the key, no empty file.
test -s ~/.ssh/id_ed25519 || { echo "::error::DEPLOY_SSH_PRIVATE_KEY is empty"; exit 1; }
ssh-keygen -lf ~/.ssh/id_ed25519 >/dev/null

- name: Render NODES and SSH_TARGETS
id: render
env:
NODES_RAFT_MAP: ${{ vars.NODES_RAFT_MAP }}
SSH_TARGETS_MAP: ${{ vars.SSH_TARGETS_MAP }}
NODES_FILTER: ${{ inputs.nodes }}
run: |
set -euo pipefail
if [[ -z "$NODES_RAFT_MAP" || -z "$SSH_TARGETS_MAP" ]]; then
echo "::error::NODES_RAFT_MAP or SSH_TARGETS_MAP is not set in the production environment variables"
exit 1
fi
if [[ -n "$NODES_FILTER" ]]; then
# Filter NODES_RAFT_MAP and SSH_TARGETS_MAP to the requested subset.
filter_csv() {
local all="$1"
local filter="$2"
local out=""
IFS=',' read -r -a entries <<< "$all"
IFS=',' read -r -a wanted <<< "$filter"
for e in "${entries[@]}"; do
key="${e%%=*}"
for w in "${wanted[@]}"; do
if [[ "$key" == "$w" ]]; then
out+="${e},"
break
fi
done
done
echo "${out%,}"
}
NODES_RAFT_MAP="$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")"
SSH_TARGETS_MAP="$(filter_csv "$SSH_TARGETS_MAP" "$NODES_FILTER")"
if [[ -z "$NODES_RAFT_MAP" ]]; then
echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP"
exit 1
Comment on lines +189 to +191
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject partially invalid node filters

The filter logic silently drops unknown raft IDs and only errors when no IDs match, so an input like nodes: n1,n9 proceeds as a rollout of n1 only. That can leave operators believing multiple nodes were upgraded when one was skipped, which is risky for staged deploys/rollbacks and can leave the cluster in an unintended mixed-version state. Treat inputs.nodes as an exact set and fail if any requested ID is missing from the configured map(s).

Useful? React with 👍 / 👎.

Comment on lines +189 to +191
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail when requested node IDs are only partially matched

The nodes filter currently errors only when zero IDs match, so a typo like n1,n9 (or whitespace variant n1, n2) silently drops unmatched IDs and proceeds with a partial rollout. That means operators can believe they updated a specific subset while one or more intended nodes were skipped, which is risky during security or incident-driven deploys. Please validate that every requested ID is present in NODES_RAFT_MAP and fail if any are missing.

Useful? React with 👍 / 👎.

fi
fi
{
echo "NODES=$NODES_RAFT_MAP"
echo "SSH_TARGETS=$SSH_TARGETS_MAP"
} >> "$GITHUB_OUTPUT"
echo "::group::Deploy plan"
echo "NODES=$NODES_RAFT_MAP"
echo "SSH_TARGETS=$SSH_TARGETS_MAP"
echo "::endgroup::"

- name: Tailscale reachability check
env:
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
run: |
set -euo pipefail
IFS=',' read -r -a entries <<< "$SSH_TARGETS"
failed=0
for e in "${entries[@]}"; do
Comment on lines +208 to +210
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check reachability for all rollout nodes, not only SSH map entries

The reachability step iterates only over SSH_TARGETS, but rolling-update.sh resolves missing SSH mappings by falling back to each node host from NODES (ssh_target_by_id). If SSH_TARGETS_MAP is incomplete, dry-run can report success without probing some actual rollout targets, and the job can then fail mid-roll when it reaches an unvalidated node. Preflight should derive targets from NODES + SSH_TARGETS using the same fallback semantics or enforce one-to-one mapping coverage first.

Useful? React with 👍 / 👎.

host="${e##*=}"
host="${host%%:*}"
# strip user@ if present
host="${host##*@}"
if tailscale ping --c 2 --timeout 3s "$host" >/dev/null 2>&1; then
echo " ok $host"
else
echo "::error::$host not reachable over tailnet"
failed=1
fi
done
if [[ "$failed" -ne 0 ]]; then
exit 1
fi

- name: Dry-run summary
if: ${{ inputs.dry_run }}
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
IMAGE_BASE: ${{ vars.IMAGE_BASE }}
IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }}
SSH_USER: ${{ vars.SSH_USER }}
run: |
set -euo pipefail
cat <<EOF
==== DRY RUN — no containers were touched ====
image: ${IMAGE_BASE}:${IMAGE_TAG}
SSH user: ${SSH_USER}
NODES: ${NODES}
SSH_TARGETS: ${SSH_TARGETS}
ref: ${{ inputs.ref }}
Re-run with dry_run=false to apply.
EOF

- name: Roll cluster
if: ${{ !inputs.dry_run }}
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
SSH_USER: ${{ vars.SSH_USER }}
IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }}
SSH_STRICT_HOST_KEY_CHECKING: "yes"
run: |
set -euo pipefail
./scripts/rolling-update.sh
153 changes: 153 additions & 0 deletions docs/deploy_via_tailscale_runbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Deploy via Tailscale + GitHub Actions — Runbook

Companion doc to `docs/design/2026_04_24_proposed_deploy_via_tailscale.md`. This
runbook is for operators: what to configure on GitHub and Tailscale so the
`rolling-update` workflow can execute a production deploy.

## 1. Precondition: Tailscale on every node

Each cluster node must have `tailscale` installed, logged into the tailnet, and
tagged so the CI runner's ACL can reach it.

```
# on each kv0X node
sudo tailscale up \
--ssh=false \
--advertise-tags=tag:elastickv-node \
--accept-routes=false
Comment on lines +14 to +17
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instructions specify --ssh=false, which disables Tailscale SSH. This approach requires the standard system SSH daemon (sshd) to be running and correctly configured to accept connections on the Tailscale interface. It would be safer to explicitly mention this requirement to prevent operators from being locked out of the deployment flow if they rely solely on Tailscale SSH for access.

```

Verify the node is reachable by MagicDNS from another tailnet peer:

```
tailscale status | grep kv0X
ping kv0X.<tailnet>.ts.net
```

## 2. Tailscale ACL

In the Tailscale admin console, add the deploy rule to the tailnet ACL:

```jsonc
"tagOwners": {
"tag:ci-deploy": ["autogroup:admin"],
"tag:elastickv-node": ["autogroup:admin"],
},
"acls": [
{
"action": "accept",
"src": ["tag:ci-deploy"],
"dst": ["tag:elastickv-node:22"],
},
],
```

`tag:ci-deploy` must NOT have access to any other port on the tailnet. The
deploy workflow only needs SSH.

## 3. Tailscale OAuth client

Admin console → Settings → OAuth clients → New client:

- Description: `elastickv GitHub Actions deploy`
- Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions
may additionally require `devices:core` (write); enable that if the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The scope devices:core is not a standard Tailscale OAuth scope. For the tailscale/github-action to successfully join the tailnet and manage ephemeral nodes, it typically requires the devices:write scope (along with auth_keys:write). Using an incorrect scope name will cause the authentication step to fail.

Suggested change
may additionally require `devices:core` (write); enable that if the
may additionally require devices (write); enable that if the

join step fails with an authorization error. The action's README is
the definitive source for current scope requirements.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When using an OAuth client with the tailscale/github-action, the devices scope with write access is generally required to allow the action to register the ephemeral runner as a node. I recommend updating the instructions to include this scope as a requirement to avoid authorization errors during the tailnet join step.

Suggested change:
- Scopes: auth_keys (write) and devices (write). The devices scope (or devices:core in newer API versions) is required for the action to register the ephemeral runner as a node in your tailnet.

- Tags: `tag:ci-deploy`

Copy the client ID and secret; they go into GitHub in the next step.

## 4. GitHub environment: `production`

Repo → Settings → Environments → New environment: `production`.

### Required reviewers
Configure "Required reviewers" on the environment. Non-dry-run deploys will
pause until one of the reviewers approves. Configure "Deployment protection
rules" to auto-approve if the workflow input `dry_run == true` (optional; cuts
friction for previews).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

GitHub's native environment protection rules do not support conditional auto-approval based on workflow inputs. Implementing this would require a custom GitHub Action for Deployment Protection Rules. It would be more accurate to state that dry-runs will still require manual approval if they target the production environment, or suggest using a separate, unprotected environment for previews.


### Environment secrets

| Name | Value |
|------|-------|
| `TS_OAUTH_CLIENT_ID` | Tailscale OAuth client ID from step 3 |
| `TS_OAUTH_SECRET` | Tailscale OAuth secret from step 3 |
| `DEPLOY_SSH_PRIVATE_KEY` | OpenSSH private key, authorized on every node under the deploy user |
| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan kv01.<tailnet>.ts.net kv02.<tailnet>.ts.net …` output (one host per line) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Using ssh-keyscan without the -H flag stores hostnames in plain text in the known_hosts file. For better security hygiene, it is recommended to use the -H flag to hash hostnames, which helps prevent host discovery if the secrets or the runner environment are compromised.


The SSH key should be ed25519, dedicated to CI (not a reused developer key).
Regenerate on operator rotation.

### Environment variables

| Name | Value | Example |
|------|-------|---------|
| `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` |
| `SSH_USER` | SSH login on every node | `bootjp` |
| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`) | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` |
| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host` | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,...` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable names NODES_RAFT_MAP and SSH_TARGETS_MAP differ from the names expected by the scripts/rolling-update.sh script (NODES and SSH_TARGETS). While the workflow likely performs a mapping, this discrepancy can lead to confusion if an operator attempts to run the script manually using the variables defined in this runbook. Aligning the names or explicitly noting the mapping would improve maintainability and reduce friction during manual troubleshooting.


## 5. Running a deploy

Actions tab → "Rolling update" → Run workflow.

Inputs:

- `ref` — the git tag or sha to deploy (also used as the container image tag)
- `image_tag` — override only for rollbacks (e.g., deploy tag `v1.2.3` of a
commit that was also `v1.2.3`)
- `nodes` — subset of raft IDs, e.g., `n1,n2`. Empty rolls all nodes.
- `dry_run` — default `true`. Renders the plan and checks reachability without
touching containers.

Recommended first-run sequence:

1. `dry_run: true`, `nodes: n1`, `ref: <target>` — confirms tailnet join,
SSH config, image availability, target mapping. No production impact.
2. `dry_run: false`, `nodes: n1` — roll a single node, verify the cluster
stays healthy and the image is correct.
3. `dry_run: false`, `nodes:` (empty) — full roll.

## 6. Rollback

Re-run the workflow with `image_tag` set to the previous-known-good sha. The
`nodes` input can target specific nodes if only some carry the bad image.

## 7. What the workflow does NOT do (yet)

- **No post-deploy health verification beyond tailnet reachability.** The
script itself blocks on `raftadmin` leadership transfer and health-gate
timeouts, but the workflow does not independently probe Prometheus or
Redis after the roll. Add this when we have a canonical post-deploy
assertion suite.
- **No auto-rollback on failure.** If the script exits non-zero mid-roll,
the cluster is left in whatever state the script reached. The operator
must inspect and either re-roll or roll back manually.
- **No Jepsen gate.** The deploy does not require a green Jepsen run on
`ref` before proceeding.
- **No image-signature check.** `cosign verify` on the image is a follow-up.

## 8. Troubleshooting

### Job pauses indefinitely at "Waiting for approval"
Expected for non-dry-run deploys — a reviewer from the `production` environment
must click Approve. Check the "Required reviewers" list in the environment
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix approval troubleshooting text for dry-run workflows

This troubleshooting guidance is incorrect for the workflow as implemented: .github/workflows/rolling-update.yml sets environment: production unconditionally, so dry-run and non-dry-run executions both pause for environment approval in v1. Telling operators this is only expected for non-dry-runs can send them down the wrong debugging path when a dry-run is legitimately waiting for approval.

Useful? React with 👍 / 👎.

settings.

### `tailscale ping` fails for a node
The node may not be running `tailscaled`, not tagged `tag:elastickv-node`, or
the tailnet ACL may have drifted. `tailscale status` on the node should show
the tag; the admin console should show the IP in the `tag:elastickv-node`
group.

### `image ... not found on ghcr.io`
The verification step hit the ghcr manifest API and got a 404. Either the
image tag was not pushed (check the `Docker Image CI` workflow for `ref`) or
the tag is a moving tag (`latest`) that the verification step can't
distinguish from stale. Specify an immutable tag.

### SSH `Host key verification failed`
`DEPLOY_KNOWN_HOSTS` is stale. Re-run `ssh-keyscan` against every node and
update the secret.
Loading
Loading