Skip to content

Dispatch cancel command for standalone activities#10218

Draft
rkannan82 wants to merge 74 commits into
kannan/move-dispatch-response-to-errorfrom
kannan/standalone-activity-cancel-commands
Draft

Dispatch cancel command for standalone activities#10218
rkannan82 wants to merge 74 commits into
kannan/move-dispatch-response-to-errorfrom
kannan/standalone-activity-cancel-commands

Conversation

@rkannan82
Copy link
Copy Markdown
Contributor

@rkannan82 rkannan82 commented May 11, 2026

What

When a standalone (CHASM) activity is cancel-requested or terminated while running on a worker, dispatch a cancel command via the Nexus worker commands control queue — same mechanism added for workflow-based activities in #9233 and #10047.

Why

Without this, standalone activities only discover cancellation passively via heartbeat or completion. If heartbeat isn't enabled, the activity runs to completion uselessly.

How did you test it?

Unit tests (chasm/lib/activity):

  • Worker control task queue stored from poll request on activity start
  • Cancel request dispatches cancel command task when activity is started with a control queue
  • Terminate dispatches cancel command task for started and cancel-requested activities

Functional tests (tests/standalone_activity_test.go):

  • TestDispatchCancelCommandToWorker/CancelRequest: start activity with control queue, cancel, verify cancel command arrives on Nexus control queue with correct task token
  • TestDispatchCancelCommandToWorker/Terminate: same flow but with terminate

🤖 Generated with Claude Code

stephanos and others added 7 commits May 11, 2026 18:48
## What changed?

Port `WaitForChannel` and `SendToChannel` to TestEnv.

## Why?

Safer / simpler.
When a standalone activity's cancellation is requested or it is terminated
while running on a worker, proactively dispatch a cancel command via the
Nexus worker commands control queue. This avoids relying on the worker to
discover cancellation only through heartbeat responses.

Changes:
- Add worker_control_task_queue field to ActivityAttemptState proto
- Store control queue from poll request in TransitionStarted
- Add CancelCommandDispatchTask side-effect task
- Schedule dispatch task on cancel request and terminate
- Dispatch cancel command via Nexus to matching service

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests both cancel-request and terminate paths: start activity with
worker control queue, trigger cancellation/termination, verify cancel
command arrives on the Nexus control queue with correct task token.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the duplicated cancelCommandDispatchResponseToError with the
shared commonnexus.DispatchResponseToError now available from the
merged kannan/move-dispatch-response-to-error branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rkannan82 rkannan82 force-pushed the kannan/standalone-activity-cancel-commands branch from 77e0a54 to d68c034 Compare May 11, 2026 19:26
@rkannan82 rkannan82 changed the base branch from main to kannan/move-dispatch-response-to-error May 11, 2026 19:26
rkannan82 and others added 21 commits May 11, 2026 13:27
…10219)

## What

Change the `DispatchNexusTask` call in `workerCommandsTaskDispatcher` to
use `TASK_QUEUE_KIND_WORKER_COMMANDS` instead of
`TASK_QUEUE_KIND_NORMAL`.

## Why

Worker commands queues use a dedicated partition type
(`WorkerCommandsPartition`) with different properties than normal
partitions. This new kind was introduced in
#9899.

## How did you test it?

Unit test: Updated e2e test and added an assertion on the dispatched
request's task queue kind.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed?

Adds a new package `testing/testcontext` for creating and managing test
`context.Context`s.

No behavior change expected/intended.

## Why?

tl;dr encapsulating the test context behavior and decouple from TestEnv

Bigger picture: Right now the test context behavior is coupled to
TestEnv; and legacy suites use `testcore.NewContext()`. That served us
well so far, but it's suboptimal. A test context should be available
outside of TestEnv and `tests/` package; ie unit and integration tests
can also benefit it.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)
…line WFT path (#10217)

## Summary
- Added revision number mechanics to strengthen checks while deciding
between clearing/setting the LastNotifiedTargetVersion!

## Problem
- The inline WFT path in `RespondWorkflowTaskCompleted` (both
`bypassTaskGeneration` at line 596 and speculative at line 722) calls
`AddWorkflowTaskStartedEvent` with `targetDeploymentVersion=nil` and
`targetRevisionNumber=0` because matching is never consulted on this
path.
- In AddWorkflowTaskStartedEvent, case 3 compared only `buildId +
deploymentName`, so a `nil` target fails that case and the default fires
— also corrupting `LastNotifiedTargetVersion` with `{nil, 0}`. For
update- or signal-driven PINNED workflows (where the updates and the
signals were buffered on workflow task completion) this manifested as a
continuous CaN loop on a stable deployment.
- A separate but related hazard: even on the normal poll path, a stale
matching partition that happens to report `target == effective` by
buildId would spuriously hit case 3 and clear legitimate notifications,
silently defeating the trampolining-suppression mechanism from #9895.

## Test plan
- [x] `TestInlinePath_StableRouting_NoSpuriousFlag` — new integration
test that exercises the user-reported scenario:
- Sends a buffered signal during a regular WFT, completes with
`ReturnNewWorkflowTask=true`.
- Asserts the inline follow-up WFT's WFT-Started event has `requestId ==
"request-from-RespondWorkflowTaskCompleted"` (self-verification that the
test actually exercises the inline code path).
- Asserts `TargetWorkerDeploymentVersionChanged == false` on the inline
WFT-Started event.
- Confirmed to **fail** before the state machine change (test commit is
on top of the fix commit) and **pass** after.
- [x] All 8 related existing tests still pass:
  - `TestStalePartition_RevisionSuppressesTrampolining` (from #9895)
  - `TestPinnedCaN_NoAUOnCaN_NoInfiniteLoop`
  - `TestOverride_SuppressesTargetVersionChangedSignal`
  - `TestAutoUpgrade_SuppressesTargetVersionChangedSignal`
  - `TestPinnedCaN_TargetChangesAgain_SignalsTrue`
  - `TestRemoveOverride_ClearsDeclinedState`
  - `TestRetryOfDeclinedCaN_SignalsOnNewTarget`
  - `TestPinnedCaN_RollbackResetsDeclined`

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches workflow-task start/versioning decision logic in history
service; mistakes could suppress legitimate upgrade notifications or
alter trampolining behavior, but changes are localized and covered by a
new integration test.
> 
> **Overview**
> Fixes a trampolining loop where inline workflow tasks created in
`RespondWorkflowTaskCompleted` could incorrectly set
`targetWorkerDeploymentVersionChanged` and corrupt
`LastNotifiedTargetVersion` despite not consulting matching.
> 
> Inline/eager/synthetic `AddWorkflowTaskStartedEvent` call sites now
pass `targetRevisionNumber=-1` as a sentinel, and the workflow-task
state machine strengthens its decision logic by tracking the *highest
seen* matching revision and refusing to clear/set notification state
based on stale (older-revision) reports.
> 
> Adds `TestInlinePath_StableRouting_NoSpuriousFlag` to reproduce the
buffered-signal inline-WFT scenario and assert the flag remains false on
stable routing.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
baf0f63. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## What changed?
Added `context.Context` as the first parameter to the
`callback.Validator` interface and threaded it through all call
sites (`WorkflowHandler`, `activity.FrontendHandler`, and their internal
validation/preparation methods). The base validator ignores the context.
No behavioral change.

## Why?
Deployments that decorate the validator (via `fx.Decorate`) need access
to the gRPC context to check caller identity — e.g. to allow internal
callers (scheduler) to attach `temporal://internal` callbacks while
rejecting them from external users.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
Port error handling pattern from workerCommandsTaskDispatcher: distinguish
UpstreamTimeout (no_poller metric), non-retryable handler errors, transport
errors, and permanent worker-returned failures. Fix misleading comment on
buildCancelCommandTaskToken.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was using s.OverrideDynamicConfig, s.FrontendClient(), s.tv etc.
which don't exist on the suite — must use env from newTestEnv(). Also fix
gofmt formatting in activity_test.go and library.go.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…10201)

## Summary
- Fix `schedule_action_delay` for CHASM schedules: `DesiredTime` is nil
for most starts (only set when blocked behind overlap), causing the
metric to record ~56 years (now minus epoch). Use
`cmp.Or(start.DesiredTime, start.ActualTime)` to fall back to
`ActualTime`, matching V1 behavior.
- Add `schedule_generate_latency` timer metric to measure the delay
between when a scheduled action was due and when the generator buffered
it. Only recorded for non-manual (non-backfill) actions.
## Summary

- Emit the `EventBlobSize` for `UpdateWorkflowExecution` requess, tagged
with `namespace`.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## What
Deduplicates `CancelOutstandingWorkerPolls` RPCs by destination matching
host during `ShutdownWorker`. Uses `Route()` on the matching client to
determine which host each partition maps to, then sends only one RPC per
unique host instead of one per partition.

## Why
With N partitions across H matching hosts (H << N), the current code
sends N RPCs per task type when H would suffice — the RPC cancels all
pollers for the `workerInstanceKey` on the target host regardless of
which partition was used for routing.

## How did you test it?
Unit test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What

Clear `StartedClock` on activity retry/pause. To do this, refactored the
code that clears per attempt field into a single
`ClearActivityStartedState` helper, and updated all code paths.

## Why

`StartedClock` is a per-attempt field introduced in #9233 to reconstruct
task tokens for cancel worker commands. It was not being cleared when
the activity leaves the started state (retry or pause), leaving a stale
value during backoff. This can cause cancel commands to be unnecessarily
dispatched for activities not currently running on any worker.

## How did you test it?

- Unit tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed?

Replaces (almost) all use of `s.Eventually` with `s.EventuallyWithT`.

## Why?

Assertions are often used inside `s.Eventually` here and that's not safe
as it aborts the test immediately.
## What changed?
Use a metrics handler without `header_callsite` tag because Prometheus
rejects re-registering the same metric with a different label set and
logs `error in prometheus reporter ... event_blob_size ... has different
label names`.

## Why?
Fix found from regression

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Fixes a regression introduced in #10223
## What changed?
Added a short protorequire package subsection to
docs/development/testing.md documenting protorequire.ProtoEqual and the
new protorequire.IgnoreFields option, with a minimal usage example.

## Why?
Follow-up to PR #9937. Without a doc entry, the new IgnoreFields helper
is undiscoverable and contributors will keep reaching for the verbose
cmp.Diff pattern.


## How did you test it?
- [X] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What
Add `SyncMatchOutcome` enum to the hooks API (NotMatched, Success,
RateLimited) and plumb rate limiting signal from the matcher through to
hooks. Keep `IsSyncMatch` as deprecated for backwards compatibility.

## Why
Hook consumers (e.g. scaling operators) need to distinguish rate
limiting from genuine lack of pollers when deciding whether to scale up
workers.

## How did you test it?
Unit tests — rate-limited and non-rate-limited scenarios, multiple hooks
invocation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed?
Log message for nexus operation cancellation invocation

## Why?
Makes the log distinct from errors during operation invocation. 

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed?
Reverting dropping tasks when feature flag disabled -> returning error

Interface change for active or not based on business id

## Why?
regression

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed?
Adds TestPGXSimpleProtocol — a new entry in the persistence integration
suite that runs `PostgreSQLSuite` under the `postgres12_pgx` plugin with
`default_query_exec_mode=simple_protocol`.

## Why?
Regression coverage for issues like
[#9804](#9804). With pgx ≤
v5.9.1, current_executions.state/status (proto-enum-typed int32 fields
with a String() method) were text-encoded via fmt.Stringer and rejected
by `Postgres` on simple/exec protocol, the path users land on behind
PgBouncer in transaction pooling. pgx v5.9.2 fixed it upstream; this
test makes sure we notice if pgx is ever downgraded or if a similar
issue sneaks in.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [X] added new functional test(s)

Verified locally: passes on pgx v5.9.2; fails reproducibly on v5.9.1
with invalid input syntax for type integer: "Created" matching
[#9804](#9804).

## Potential risks
Adds one additional pass through PostgreSQLSuite to the Integration test
job. Job timeout is 15 min, so should be fine, but worth observing.
## What changed?
Add flag to skip setting up ES cluster settings. Eg:
`temporal-elasticsearch-tool setup-schema --skip-cluster-settings`

## Why?
Allow users to not use our provided cluster settings (eg: avoid
overwriting their cluster settings).
#9857

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
## What changed?
Revert `OperatorRateBurstImpl.Burst()` to `baseRateBurstFn.Burst()`.

## Why?
We don't have to reduce the burst value for operator priority

## How did you test it?
- [x] built
- [x] covered by existing tests
## What changed?

Added validation of the user metadata in the
`StartNexusOperationExecutionRequest`.

## Why?

All other fields are validated.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)
## What changed?

Introduces shared constants for default long poll timeout/buffer and
applies them to SAA and SANO.

## Why?

After a long internal technical discussion, 60s was determined as the
default for the timeout.
awln-temporal and others added 27 commits May 15, 2026 18:35
## What changed?
Add a Scheduler specific query converter to handle `ScheduleId` search
attribute alias to underlying `WorkflowId` system search attribute
field. If chasm is enabled, the query converter will match against both
V1 and V2 Scheduler workflowId formats. V1 prefixes the workflowId with
`temporal-sys-scheduler`, while V2 doesn't, so to handle either case, we
need to add a OR/AND statement to match both V1 and V2 Scheduler
BusinessId. OR will be used if the operator's boolean is positive, AND
if the operator's boolean is negative.

## Why?
Unable to retrieve V2 Schedules that query using `ScheduleId` as a
search attribute.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [X] added new unit test(s)
- [X] added new functional test(s)
## Summary

Replace the pseudo-version pin of `go.temporal.io/api` with the
freshly-tagged `v1.62.12`. This is a prerequisite for cutting cloud
release 3.156, which requires all OSS dependencies to be on tagged
versions (runbook step 2).

## Details
- `v1.62.12` tags api-go commit
`0a978d4fd72ccadc7666d7f19aa6df9b335b3133` — the same commit the
previous pseudo-version pinned
(`v1.62.12-0.20260511225354-0a978d4fd72c`).
- No source code changes — the module content is identical (the
`/go.mod` h1 hash didn't change in go.sum, only the version-identifier
hash).
- Tag created via `temporalio/api`'s `create-release.yml` workflow.

## Test plan
- [x] `go mod tidy` clean
- [ ] CI passes
…`callback.allowedAddresses`. (#10234)

## What changed?

Use the new dynamic config instead of leaving it as a placeholder and
update its documentation.

Also bumps the server version constant to `1.32.0`. This was supposed to
be done after the `1.31.0` release and was missed.

## Why?

Remove duplication as we migrate the code to the CHASM backed
implementation.

## Potential risks

This config is only relevant for external endpoints which are
experimental or older server versions, needs to be called out in the
release notes.
## What changed
- Replaced per-key trailer format with a single protobuf
`ContextMetadata` message serialized into `contextmetadata-bin` trailer
key
- gRPC automatically base64-encodes the `-bin` value, making arbitrary
bytes (including HTTP/2-unsafe control chars) transport-safe
- Writer emits both proto format and legacy per-key format for backward
compatibility during rolling deploys
- Reader prefers proto key, falls back to legacy per-key format for old
writers
- Wired `TrailerToContextMetadataInterceptor` in test server to match
production behavior

## Why
Workflow type names containing control characters (newlines, NUL, etc.)
cause the gRPC HTTP/2 framer to reject trailer values. A single proto
message in a `-bin` key is simpler than per-key `-bin` suffixes: one
trailer key, one serialization, no key naming constraints, cleaner
backward compat removal.

## How tested
- Unit tests for proto round-trip, dual-format emission, reader
preference, legacy fallback, HTTP/2 safety
- Integration test suite (TestWorkflowTypeEncodingSuite) with control
chars, UTF-8, long names, -bin suffix workflow types
- All existing tests pass

## Risks
- During rolling deploy, old writers emit only legacy keys. New readers
handle this via fallback path. No data loss.
- After full rollout, legacy key emission can be removed in a follow-up.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes how context metadata is encoded/decoded in gRPC trailers,
which can affect cross-version compatibility and observability of
propagated metadata. Backward-compatible legacy fallback and extensive
unit/integration tests reduce the rollout risk.
> 
> **Overview**
> **Switches context-metadata propagation in gRPC trailers to a single
proto-encoded payload.** Server-side `ContextMetadataInterceptor` now
serializes all context metadata into a new `ContextMetadata` protobuf
and emits it under `contextmetadata-bin`, avoiding HTTP/2-unsafe control
characters in values.
> 
> **Maintains rolling-deploy compatibility.** Writers still emit legacy
per-key trailers (skipping unsafe values), and the client-side
`TrailerToContextMetadataInterceptor` now *prefers* the proto trailer
and falls back to legacy keys (including unprefixed well-known keys)
when needed.
> 
> Adds the new `contextpropagation/v1` proto + generated Go types, plus
unit tests around proto/legacy behavior and an integration suite
(`WorkflowTypeEncodingSuite`) covering control characters, UTF-8, long
names, and `-bin` suffix workflow types.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e1d772f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## What changed?
Adding support for a fallback/default TLS section in the client config
for remote clusters

## Why?
When adding a new server, a user would have to add the new hostname to
the config. This change allows for new servers to have a default cert,
while still preserving the existing behavior.
Closes #9881 

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [x] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

---------

Co-authored-by: stuart-wells <stuart.well@temporal.io>
## What changed?
Improve chasm Map deserialization/serialization logic

## Why?
chasm.Map should not be added to DeletedNodes if never persisted. When
deserializing, it should always be initialized to avoid excessive nil
checks.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [X] added new unit test(s)
- [ ] added new functional test(s)
## What changed?
Added missing versionedQueuesLock.RUnlock() in describe() before the
early return when defaultQueue() returns nil. (This can't happen in production.)

## Why?
The function acquires RLock at the top but does not use defer instead
releases the lock manually at each exit point. One early return path was
missing the RUnlock: when buildIds contains "" and the default queue is
not yet initialized, the function returned errDefaultQueueNotInit
without releasing the lock, leaving the mutex permanently locked and
blocking any concurrent writer.

## How did you test it?
- [X] built
## What changed?
Add a dedicated DeleteChasmExecution RPC to the history service for
deleting CHASM executions by namespace, execution key, and archetype ID.
Uses the CHASM engine's DeleteExecution path (terminate-if-running +
async DeleteExecutionTask), replacing the ForceDeleteWorkflowExecution
workaround in the delete namespace activity which bypassed the engine.

Also adds NewComponentRefByArchetypeID to chasm/ref.go to construct a
ComponentRef from a runtime archetype ID without a compile-time type
parameter.

## Why?
Remove dependency on force deletion API.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [X] added new unit test(s)
- [X] added new functional test(s)
)

When the matching client returns an error from DispatchNexusTask in
StartOperation and CancelOperation, tag the metrics with
outcome=matching_timeout instead of falling through to the default
internal_error outcome. This lets us distinguish failures originating
from the matching dispatch (typically transient timeouts) from other
internal errors, which are considered more severe.
…10289)

## What changed?
Add `WorkerControllerPerNSWorkerTaskQueue`
("temporal-sys-worker-controller-per-ns-tq") to the set of internal
per-namespace task queues and treat it the same as
`PerNSWorkerTaskQueue` in `IsInternalPerNsTaskQueue`, so user workflows
cannot start workflows, schedule activities, start child workflows,
continue-as-new, or update activity options targeting it.

## Why?
The Worker Controller internal task queue has the same security concerns
as the shared one, so treating it the same seems appropriate.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [X] covered by existing tests
- [X] added new unit test(s)
- [X] added new functional test(s)
## What changed?

Pass custom `DCRedirectionPolicy` directly instead of going through
`WithFxOptionsForService`.

## Why?

We want to eliminate `WithFxOptionsForService` as it is blocking us from
migrating away from the `onebox.go` approach (which duplicates the fx
setup) since we don't want to expose an equivalent method in
`temporal/fx.go`.
…uce allocations (#9620)

## Summary

- Cache `WithTags()` child handlers via `sync.Map` to eliminate repeated
`tagsToMap()`, `scope.Tagged()`, and handler struct allocations on the
hot path
- Cache `scope.Tagged()` results per unique inline tag combination in
`cachedTaggedScope()`, bounded to 1024 entries with graceful degradation
- Normalize excluded tags before cache key computation so
high-cardinality excluded values (e.g. `activityType`) share a single
cache entry, preventing unbounded cache growth

## Design

Two complementary caching layers in `tallyMetricsHandler`:

1. **`childCache`** (`sync.Map`): caches entire handler subtrees
returned by `WithTags()`. On cache hit: zero allocations.
2. **`scopeCache`** (`sync.Map` + `atomic.Int64` size bound): caches
`tally.Scope` objects returned by `scope.Tagged()` for inline tags
passed to `Counter`/`Gauge`/`Timer`/`Histogram` `Record()` calls.
Bounded to 1024 entries; beyond that, scopes are created but not cached.

Both caches use `LoadOrStore` for safe concurrent access. Tag
normalization via `normalizeTagsForCaching()` ensures excluded tag
variants collapse to the same cache key. The normalization has a
zero-alloc fast path when no tags need substitution.

## Allocation Reduction (pprof alloc_space, 5min ScyllaDB workload)

### Commit 1: WithTags handler cache
| Metric | Before | After | Reduction |
|--------|--------|-------|-----------|
| WithTags cumulative | 1,930 MB | 316 MB | -83.6% |
| Total server allocs | 18,030 MB | 16,481 MB | -8.6% |

### Commit 2: Scope cache for inline tags
| Metric | Before | After | Reduction |
|--------|--------|-------|-----------|
| tagsToMap.func1 | 1,101 MB | 0 MB | -100% |
| tally Subscope | 1,012 MB | 0 MB | -100% |
| Total server allocs | 18,465 MB | 16,511 MB | -10.6% |

## Benchmark (omes throughput_stress, mc150, 5 min)

Host networking, i7-1270P 4 cores/component, inter-run data resets:

| Database | Baseline | After commit 1 | After commit 2 |
|----------|----------|----------------|----------------|
| Cassandra | 280 | 294 (+5.0%) | 270 (-3.6%) |
| ScyllaDB | 290 | 296 (+2.1%) | 298 (+2.8%) |

Note: Throughput variance at mc150 is ~5-10%. The allocation reduction
is confirmed by pprof but throughput gains are within noise at this
concurrency level.

## Testing

- Unit tests for all 4 metric types (Counter, Gauge, Timer, Histogram)
with inline tags
- Concurrency tests with race detector (32 goroutines × 100 iterations)
- Cache bound enforcement test
- Exclude-tag normalization tests (merge, allowed values, zero-alloc
fast path)
- Independent per-handler scope cache verification
- All existing tests continue to pass

---

## v2 — addressing review feedback

All 6 review comments addressed. Rebased on `origin/main`, squashed into
a single commit.

### Changes

1. **`tagsCacheKey`: Use `strings.Builder` with `Grow` pre-allocation**
— replaced manual `[]byte` construction with `strings.Builder`. A sizing
pass pre-computes the exact capacity via `Grow()` to avoid internal
reallocation (1 alloc/op).

2. **`tagsCacheKey`: Remove single-tag special case, uniform `\x00`
separator** — removed the `len(tags) == 1` branch. Every tag pair now
unconditionally appends `\x00` after both key and value, making the
format uniform and the code simpler.

3. **`normalizeTagsForCaching`: Use `slices.Clone`** — replaced
`make([]Tag, len(tags))` + `copy(tags[:i])` with `slices.Clone(tags)`,
which copies the entire slice upfront. This eliminates the `if
normalized != nil { normalized[i] = t }` guard for unchanged tags after
the clone point.

4. **Extract shared `normalizeTag` function** — the exclude-tag check
was duplicated between `normalizeTagsForCaching` and the `convert`
closure in `tagsToMap`. Extracted `normalizeTag(tag Tag, excl
excludeTags) (Tag, bool)` used by both, removing the duplication.

5. **Bound `childCache` to `scopeCacheMaxSize`** — `childCache` (used by
`WithTags`) was previously unbounded. Applied the same bounding strategy
as `scopeCache`: atomic counter + stop caching beyond 1024 entries.
Added `childCacheSize atomic.Int64` field and
`TestWithTags_BoundedChildCacheSize` test.

### Micro-benchmark results (cached vs uncached)

```
goos: linux, goarch: amd64, cpu: 12th Gen Intel(R) Core(TM) i7-1270P

                                    ns/op     B/op   allocs/op
CounterRecord_Uncached              344       592    3
CounterRecord_CachedScope            93        48    2           ← 3.7x faster, 12x less memory
WithTags_Uncached                   325       592    3
WithTags_CacheHit                    50        16    1           ← 6.5x faster, 37x less memory
TagsCacheKey_SingleTag               29        24    1
TagsCacheKey_ThreeTags               49        64    1
```

### Not changed (deliberate)

- **Cache key is order-sensitive / does not deduplicate keys** — Tally
internally canonicalizes tag maps (sorted keys, rightmost precedence),
so the cache key could theoretically miss on reordered-but-equivalent
tag sets. Verified across 100+ call sites: tag ordering is fully
consistent and duplicate keys never appear in the codebase. Adding
sort+dedup to the hot path would add cost without real-world benefit.
## What changed?
WISOTT

## Why?
Part of our migration to `TestEnv` to speed up tests and reduce flakes.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Tests only
## What changed?
Remove keys with `nil` value from the memo when instantiating CHASM
Visibility.

## Why?
`nil` value is used to remove keys from the memo, and it needs to match
the behavior in workflows.

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
## What changed?
Added support for Nexus workflow update completion callbacks via CHASM.
This allows a Nexus caller to be notified when a workflow update
completes by attaching completion callbacks to the update request.

## Why?

Nexus operations that target workflow updates need a way to receive
completion notifications. Without this, a Nexus caller that sends an
update has no async mechanism to learn when the update finishes.
Completion callbacks enable the same async notification pattern that
already exists for workflow-level Nexus operations.

## How did you test it?
- [ ] built
- [x] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [x] added new functional test(s)

## Potential risks
Touches speculative workflow updates, they are always hard to reason
about. Tried to compensate with lots of test coverage.

Note: Needs this API PR
https://github.com/temporalio/api/pull/742/changes

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> Touches workflow update state machine and mutable state event handling
to persist/trigger per-update callbacks, including close/retry/reset
paths, which is complex and can affect correctness of update outcomes
and callback delivery.
> 
> **Overview**
> Adds **workflow update completion callbacks** via CHASM so Nexus
callers can register callbacks on `UpdateWorkflowExecution` and have
them fired on update completion or workflow close.
> 
> This introduces a `WorkflowUpdate` CHASM component with new
`UpdateState` protobuf (including persisted `rejection_failure`), stores
update callbacks under `Workflow.Updates`, and extends callback
processing to handle *update-level* callbacks on update completion,
rejection (including reset/reapply), and on run transitions
(retry/timeout/continue-as-new) where update callbacks must fire even if
workflow-level callbacks are inherited.
> 
> It also adds dynamic config gates/limits
(`EnableWorkflowUpdateCallbacks`, `MaxCallbacksPerUpdateID`), updates
`DescribeWorkflow` to surface update callback triggers, extends mutable
state/history builder APIs to carry per-update callback options in
`WorkflowExecutionOptionsUpdated`, and adds `Update.AttachCallbacks`
logic to persist/flush callbacks (including buffering while `stateSent`,
request-id dedup, and stricter validation requiring `request_id` when
callbacks are present).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
4484fee. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: long-nt-tran <long.tran@temporal.io>
## What changed?

The Temporal worker service is now opt-in for `testcore.NewEnv`. Tests
that need scheduler / batcher / worker-deployment system workflows opt
in explicitly:

```go
env := testcore.NewEnv(t, testcore.WithWorkerService("V1 scheduler"))
```

`FunctionalTestBase`-based tests are unchanged.

## Why?

Profiling a 5-suite parallelsuite selection showed the system worker
service was responsible for ~55% of live memory and ~4s of test wall
time across all three persistence backends.

Measured impact on the selection (sqlite, locally):

| Metric | Before | After |
|---|---|---|
| Test wall (sqlite) | 33.0s | 29.1s |
| Live memory (inuse_space) | 200 MB | 89 MB |
## What changed?

Removed retries in test when Nexus endpoint wasn't found.

Follow-up to #10208

## Why?

Nexus lookups are now strongly consistent.
## What changed?
Handle non workflow chasm clean up logic in replication

## Why?
Support clean up logic for both workflow and chasm

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed?
`ClockedRateLimiter.SetBurstAt` now clamps `newBurst` to a minimum of 1
when the limiter's rate is positive. `rate=0, burst=0` is still accepted
so callers can fully pause the limiter.

## Why?
A zero or negative burst on an actively rate-limiting limiter stalls it
— `Allow`/`Wait` can never succeed and waiters block indefinitely. The
conditional clamp prevents that while preserving the existing pause
semantic (rate=0, burst=0) used by the matching task queue rate limiter.

## How did you test it?
- [x] built
- [x] covered by existing tests
## What changed?
WISOTT

## Why?
Part of our migration process to testcore.TestEnv for reliability and
speed purposes.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [X] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
None, test file changes only
## What changed?
Replace calls to `payload.Encode` with
`sadefs.EncodeValue`/`sadefs.MustEncodeValue`.

## Why?
`payload.Encode` doesn't set the metadata type, which might be useful
when decoding. Besides, encoding search attributes value should've used
`sadefs.EncodeValue` anyway.

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
## What changed?
This PR adds `SignalWithStartWorkflowExecution` as a synchronous Nexus
operation exposed via `__temporal_system endpoint`, allowing workflows
to signal-with-start other workflows through the CHASM Nexus operation
framework.

Key changes:
- `chasm/lib/workflow/nexus_service.go`: `workflowServiceNexusHandler`
implements the `SignalWithStartWorkflowExecution` Nexus sync operation
by resolving the namespace and delegating to the History service. A
`SignalWithStartOperationProcessor` handles input enrichment (namespace,
request ID, links) and routing via CHASM's
`NexusOperationProcessorResult`.
- `chasm/lib/workflow/library.go` — library now holds the
`workflowServiceNexusHandler`, the workflow `Config`, the SA mapper
provider, and the SA validator. `newLibrary` (used by fx) takes those
dependencies, while the public `NewLibrary` keeps its old signature for
external callers. Adds `NexusServices()` so the library registers its
Nexus service via CHASM.
- `chasm/lib/workflow/validator.go`: `RequestValidator` consolidates the
`SignalWithStartWorkflowExecution` validation logic (previously inlined
in `WorkflowHandler`) into a reusable, injectable struct. This same
validator is used by both the frontend handler and the new CHASM
processor.
- common/dynamicconfig/constants.go — adds
`EnableSignalWithStartFromWorkflow` (namespace-scoped, default false).
- `service/frontend/workflow_handler.go`: Removed the `SignalWithStart`
validation block
- `service/history/fx.go`: Provides a `HistoryServiceServerProvider` so
the CHASM workflow library can call the history handler directly.
- `temporal/fx.go`: Removes the now-redundant `ChasmLibraryOptions`
grouping; each service module registers its own CHASM libraries.
- `components/nexusoperations/workflow/commands.go`: `NotFound` and
`InvalidArgument` errors during Nexus command handling are now surfaced
as workflow task failures instead of being treated as transient handler
errors.
- `common/payloads`: Adds `EncodeSingle`, `MustEncodeSingle`, and
`MustEncode` helpers used in tests.
- `cmd/tools/getproto`: Adds support for nexus-proto-annotations proto
imports.
- `tests/signal_with_start_from_workflow_test.go`: Functional test suite
covering the happy path, duplicate detection, conflict policies, and
validation rejection for the new Nexus operation.


## Why?
This functionality is one of our most requested GitHub issues.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [ ] covered by existing tests
- [X] added new unit test(s)
- [X] added new functional test(s)

## Potential risks
The history service now directly exposes a `HistoryServiceServer`
interface via `fx` for injection into the CHASM workflow library. This
tight coupling between the CHASM workflow library and the history
handler could complicate future layering — callers outside the history
service should not adopt this pattern. The feature is gated by
`history.enableSignalWithStartFromWorkflow` for rollout.

---------

Co-authored-by: Roey Berman <roey.berman@gmail.com>
## What changed?
Remove client cache from RPCFactory.

## Why?
This cache was added to avoid creating multiple connections to the same
host for different services. But this cache does not have a way to
identify when to close a connection and remove it. So it accumulates
connections. This defeats the purpose of having this cache to reduce the
number of connections. It just keep accumulating stale connections
instead. All clients that uses RPCFactory have their own cache. They can
listen to membership changes and cleanup stale connections.

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed

Replaced one `WithFxOptionsForService` with an await loop.

## Why

We want to eliminate `WithFxOptionsForService` as it is blocking us from
migrating away from the `onebox.go` approach (which duplicates the fx
setup) since we don't want to expose an equivalent method in
`temporal/fx.go`.
rkannan82 and others added 2 commits May 20, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.