Dispatch cancel command for standalone activities#10218
Draft
rkannan82 wants to merge 74 commits into
Draft
Conversation
## What changed? Port `WaitForChannel` and `SendToChannel` to TestEnv. ## Why? Safer / simpler.
When a standalone activity's cancellation is requested or it is terminated while running on a worker, proactively dispatch a cancel command via the Nexus worker commands control queue. This avoids relying on the worker to discover cancellation only through heartbeat responses. Changes: - Add worker_control_task_queue field to ActivityAttemptState proto - Store control queue from poll request in TransitionStarted - Add CancelCommandDispatchTask side-effect task - Schedule dispatch task on cancel request and terminate - Dispatch cancel command via Nexus to matching service Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests both cancel-request and terminate paths: start activity with worker control queue, trigger cancellation/termination, verify cancel command arrives on the Nexus control queue with correct task token. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the duplicated cancelCommandDispatchResponseToError with the shared commonnexus.DispatchResponseToError now available from the merged kannan/move-dispatch-response-to-error branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
77e0a54 to
d68c034
Compare
…10219) ## What Change the `DispatchNexusTask` call in `workerCommandsTaskDispatcher` to use `TASK_QUEUE_KIND_WORKER_COMMANDS` instead of `TASK_QUEUE_KIND_NORMAL`. ## Why Worker commands queues use a dedicated partition type (`WorkerCommandsPartition`) with different properties than normal partitions. This new kind was introduced in #9899. ## How did you test it? Unit test: Updated e2e test and added an assertion on the dispatched request's task queue kind. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed? Adds a new package `testing/testcontext` for creating and managing test `context.Context`s. No behavior change expected/intended. ## Why? tl;dr encapsulating the test context behavior and decouple from TestEnv Bigger picture: Right now the test context behavior is coupled to TestEnv; and legacy suites use `testcore.NewContext()`. That served us well so far, but it's suboptimal. A test context should be available outside of TestEnv and `tests/` package; ie unit and integration tests can also benefit it. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s)
…line WFT path (#10217) ## Summary - Added revision number mechanics to strengthen checks while deciding between clearing/setting the LastNotifiedTargetVersion! ## Problem - The inline WFT path in `RespondWorkflowTaskCompleted` (both `bypassTaskGeneration` at line 596 and speculative at line 722) calls `AddWorkflowTaskStartedEvent` with `targetDeploymentVersion=nil` and `targetRevisionNumber=0` because matching is never consulted on this path. - In AddWorkflowTaskStartedEvent, case 3 compared only `buildId + deploymentName`, so a `nil` target fails that case and the default fires — also corrupting `LastNotifiedTargetVersion` with `{nil, 0}`. For update- or signal-driven PINNED workflows (where the updates and the signals were buffered on workflow task completion) this manifested as a continuous CaN loop on a stable deployment. - A separate but related hazard: even on the normal poll path, a stale matching partition that happens to report `target == effective` by buildId would spuriously hit case 3 and clear legitimate notifications, silently defeating the trampolining-suppression mechanism from #9895. ## Test plan - [x] `TestInlinePath_StableRouting_NoSpuriousFlag` — new integration test that exercises the user-reported scenario: - Sends a buffered signal during a regular WFT, completes with `ReturnNewWorkflowTask=true`. - Asserts the inline follow-up WFT's WFT-Started event has `requestId == "request-from-RespondWorkflowTaskCompleted"` (self-verification that the test actually exercises the inline code path). - Asserts `TargetWorkerDeploymentVersionChanged == false` on the inline WFT-Started event. - Confirmed to **fail** before the state machine change (test commit is on top of the fix commit) and **pass** after. - [x] All 8 related existing tests still pass: - `TestStalePartition_RevisionSuppressesTrampolining` (from #9895) - `TestPinnedCaN_NoAUOnCaN_NoInfiniteLoop` - `TestOverride_SuppressesTargetVersionChangedSignal` - `TestAutoUpgrade_SuppressesTargetVersionChangedSignal` - `TestPinnedCaN_TargetChangesAgain_SignalsTrue` - `TestRemoveOverride_ClearsDeclinedState` - `TestRetryOfDeclinedCaN_SignalsOnNewTarget` - `TestPinnedCaN_RollbackResetsDeclined` 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches workflow-task start/versioning decision logic in history service; mistakes could suppress legitimate upgrade notifications or alter trampolining behavior, but changes are localized and covered by a new integration test. > > **Overview** > Fixes a trampolining loop where inline workflow tasks created in `RespondWorkflowTaskCompleted` could incorrectly set `targetWorkerDeploymentVersionChanged` and corrupt `LastNotifiedTargetVersion` despite not consulting matching. > > Inline/eager/synthetic `AddWorkflowTaskStartedEvent` call sites now pass `targetRevisionNumber=-1` as a sentinel, and the workflow-task state machine strengthens its decision logic by tracking the *highest seen* matching revision and refusing to clear/set notification state based on stale (older-revision) reports. > > Adds `TestInlinePath_StableRouting_NoSpuriousFlag` to reproduce the buffered-signal inline-WFT scenario and assert the flag remains false on stable routing. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit baf0f63. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## What changed? Added `context.Context` as the first parameter to the `callback.Validator` interface and threaded it through all call sites (`WorkflowHandler`, `activity.FrontendHandler`, and their internal validation/preparation methods). The base validator ignores the context. No behavioral change. ## Why? Deployments that decorate the validator (via `fx.Decorate`) need access to the gRPC context to check caller identity — e.g. to allow internal callers (scheduler) to attach `temporal://internal` callbacks while rejecting them from external users. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
Port error handling pattern from workerCommandsTaskDispatcher: distinguish UpstreamTimeout (no_poller metric), non-retryable handler errors, transport errors, and permanent worker-returned failures. Fix misleading comment on buildCancelCommandTaskToken. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was using s.OverrideDynamicConfig, s.FrontendClient(), s.tv etc. which don't exist on the suite — must use env from newTestEnv(). Also fix gofmt formatting in activity_test.go and library.go. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…10201) ## Summary - Fix `schedule_action_delay` for CHASM schedules: `DesiredTime` is nil for most starts (only set when blocked behind overlap), causing the metric to record ~56 years (now minus epoch). Use `cmp.Or(start.DesiredTime, start.ActualTime)` to fall back to `ActualTime`, matching V1 behavior. - Add `schedule_generate_latency` timer metric to measure the delay between when a scheduled action was due and when the generator buffered it. Only recorded for non-manual (non-backfill) actions.
## Summary - Emit the `EventBlobSize` for `UpdateWorkflowExecution` requess, tagged with `namespace`. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## What Deduplicates `CancelOutstandingWorkerPolls` RPCs by destination matching host during `ShutdownWorker`. Uses `Route()` on the matching client to determine which host each partition maps to, then sends only one RPC per unique host instead of one per partition. ## Why With N partitions across H matching hosts (H << N), the current code sends N RPCs per task type when H would suffice — the RPC cancels all pollers for the `workerInstanceKey` on the target host regardless of which partition was used for routing. ## How did you test it? Unit test 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What Clear `StartedClock` on activity retry/pause. To do this, refactored the code that clears per attempt field into a single `ClearActivityStartedState` helper, and updated all code paths. ## Why `StartedClock` is a per-attempt field introduced in #9233 to reconstruct task tokens for cancel worker commands. It was not being cleared when the activity leaves the started state (retry or pause), leaving a stale value during backoff. This can cause cancel commands to be unnecessarily dispatched for activities not currently running on any worker. ## How did you test it? - Unit tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed? Replaces (almost) all use of `s.Eventually` with `s.EventuallyWithT`. ## Why? Assertions are often used inside `s.Eventually` here and that's not safe as it aborts the test immediately.
## What changed? Use a metrics handler without `header_callsite` tag because Prometheus rejects re-registering the same metric with a different label set and logs `error in prometheus reporter ... event_blob_size ... has different label names`. ## Why? Fix found from regression ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Fixes a regression introduced in #10223
## What changed? Added a short protorequire package subsection to docs/development/testing.md documenting protorequire.ProtoEqual and the new protorequire.IgnoreFields option, with a minimal usage example. ## Why? Follow-up to PR #9937. Without a doc entry, the new IgnoreFields helper is undiscoverable and contributors will keep reaching for the verbose cmp.Diff pattern. ## How did you test it? - [X] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What Add `SyncMatchOutcome` enum to the hooks API (NotMatched, Success, RateLimited) and plumb rate limiting signal from the matcher through to hooks. Keep `IsSyncMatch` as deprecated for backwards compatibility. ## Why Hook consumers (e.g. scaling operators) need to distinguish rate limiting from genuine lack of pollers when deciding whether to scale up workers. ## How did you test it? Unit tests — rate-limited and non-rate-limited scenarios, multiple hooks invocation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed? Log message for nexus operation cancellation invocation ## Why? Makes the log distinct from errors during operation invocation. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? Reverting dropping tasks when feature flag disabled -> returning error Interface change for active or not based on business id ## Why? regression ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? Adds TestPGXSimpleProtocol — a new entry in the persistence integration suite that runs `PostgreSQLSuite` under the `postgres12_pgx` plugin with `default_query_exec_mode=simple_protocol`. ## Why? Regression coverage for issues like [#9804](#9804). With pgx ≤ v5.9.1, current_executions.state/status (proto-enum-typed int32 fields with a String() method) were text-encoded via fmt.Stringer and rejected by `Postgres` on simple/exec protocol, the path users land on behind PgBouncer in transaction pooling. pgx v5.9.2 fixed it upstream; this test makes sure we notice if pgx is ever downgraded or if a similar issue sneaks in. ## How did you test it? - [X] built - [X] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [X] added new functional test(s) Verified locally: passes on pgx v5.9.2; fails reproducibly on v5.9.1 with invalid input syntax for type integer: "Created" matching [#9804](#9804). ## Potential risks Adds one additional pass through PostgreSQLSuite to the Integration test job. Job timeout is 15 min, so should be fine, but worth observing.
## What changed? Add flag to skip setting up ES cluster settings. Eg: `temporal-elasticsearch-tool setup-schema --skip-cluster-settings` ## Why? Allow users to not use our provided cluster settings (eg: avoid overwriting their cluster settings). #9857 ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks
## What changed? Revert `OperatorRateBurstImpl.Burst()` to `baseRateBurstFn.Burst()`. ## Why? We don't have to reduce the burst value for operator priority ## How did you test it? - [x] built - [x] covered by existing tests
## What changed? Added validation of the user metadata in the `StartNexusOperationExecutionRequest`. ## Why? All other fields are validated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s)
## What changed? Introduces shared constants for default long poll timeout/buffer and applies them to SAA and SANO. ## Why? After a long internal technical discussion, 60s was determined as the default for the timeout.
## What changed? Add a Scheduler specific query converter to handle `ScheduleId` search attribute alias to underlying `WorkflowId` system search attribute field. If chasm is enabled, the query converter will match against both V1 and V2 Scheduler workflowId formats. V1 prefixes the workflowId with `temporal-sys-scheduler`, while V2 doesn't, so to handle either case, we need to add a OR/AND statement to match both V1 and V2 Scheduler BusinessId. OR will be used if the operator's boolean is positive, AND if the operator's boolean is negative. ## Why? Unable to retrieve V2 Schedules that query using `ScheduleId` as a search attribute. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s)
## Summary Replace the pseudo-version pin of `go.temporal.io/api` with the freshly-tagged `v1.62.12`. This is a prerequisite for cutting cloud release 3.156, which requires all OSS dependencies to be on tagged versions (runbook step 2). ## Details - `v1.62.12` tags api-go commit `0a978d4fd72ccadc7666d7f19aa6df9b335b3133` — the same commit the previous pseudo-version pinned (`v1.62.12-0.20260511225354-0a978d4fd72c`). - No source code changes — the module content is identical (the `/go.mod` h1 hash didn't change in go.sum, only the version-identifier hash). - Tag created via `temporalio/api`'s `create-release.yml` workflow. ## Test plan - [x] `go mod tidy` clean - [ ] CI passes
…`callback.allowedAddresses`. (#10234) ## What changed? Use the new dynamic config instead of leaving it as a placeholder and update its documentation. Also bumps the server version constant to `1.32.0`. This was supposed to be done after the `1.31.0` release and was missed. ## Why? Remove duplication as we migrate the code to the CHASM backed implementation. ## Potential risks This config is only relevant for external endpoints which are experimental or older server versions, needs to be called out in the release notes.
## What changed - Replaced per-key trailer format with a single protobuf `ContextMetadata` message serialized into `contextmetadata-bin` trailer key - gRPC automatically base64-encodes the `-bin` value, making arbitrary bytes (including HTTP/2-unsafe control chars) transport-safe - Writer emits both proto format and legacy per-key format for backward compatibility during rolling deploys - Reader prefers proto key, falls back to legacy per-key format for old writers - Wired `TrailerToContextMetadataInterceptor` in test server to match production behavior ## Why Workflow type names containing control characters (newlines, NUL, etc.) cause the gRPC HTTP/2 framer to reject trailer values. A single proto message in a `-bin` key is simpler than per-key `-bin` suffixes: one trailer key, one serialization, no key naming constraints, cleaner backward compat removal. ## How tested - Unit tests for proto round-trip, dual-format emission, reader preference, legacy fallback, HTTP/2 safety - Integration test suite (TestWorkflowTypeEncodingSuite) with control chars, UTF-8, long names, -bin suffix workflow types - All existing tests pass ## Risks - During rolling deploy, old writers emit only legacy keys. New readers handle this via fallback path. No data loss. - After full rollout, legacy key emission can be removed in a follow-up. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes how context metadata is encoded/decoded in gRPC trailers, which can affect cross-version compatibility and observability of propagated metadata. Backward-compatible legacy fallback and extensive unit/integration tests reduce the rollout risk. > > **Overview** > **Switches context-metadata propagation in gRPC trailers to a single proto-encoded payload.** Server-side `ContextMetadataInterceptor` now serializes all context metadata into a new `ContextMetadata` protobuf and emits it under `contextmetadata-bin`, avoiding HTTP/2-unsafe control characters in values. > > **Maintains rolling-deploy compatibility.** Writers still emit legacy per-key trailers (skipping unsafe values), and the client-side `TrailerToContextMetadataInterceptor` now *prefers* the proto trailer and falls back to legacy keys (including unprefixed well-known keys) when needed. > > Adds the new `contextpropagation/v1` proto + generated Go types, plus unit tests around proto/legacy behavior and an integration suite (`WorkflowTypeEncodingSuite`) covering control characters, UTF-8, long names, and `-bin` suffix workflow types. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit e1d772f. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## What changed? Adding support for a fallback/default TLS section in the client config for remote clusters ## Why? When adding a new server, a user would have to add the new hostname to the config. This change allows for new servers to have a default cert, while still preserving the existing behavior. Closes #9881 ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: stuart-wells <stuart.well@temporal.io>
## What changed? Improve chasm Map deserialization/serialization logic ## Why? chasm.Map should not be added to DeletedNodes if never persisted. When deserializing, it should always be initialized to avoid excessive nil checks. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s)
## What changed? Added missing versionedQueuesLock.RUnlock() in describe() before the early return when defaultQueue() returns nil. (This can't happen in production.) ## Why? The function acquires RLock at the top but does not use defer instead releases the lock manually at each exit point. One early return path was missing the RUnlock: when buildIds contains "" and the default queue is not yet initialized, the function returned errDefaultQueueNotInit without releasing the lock, leaving the mutex permanently locked and blocking any concurrent writer. ## How did you test it? - [X] built
## What changed? Add a dedicated DeleteChasmExecution RPC to the history service for deleting CHASM executions by namespace, execution key, and archetype ID. Uses the CHASM engine's DeleteExecution path (terminate-if-running + async DeleteExecutionTask), replacing the ForceDeleteWorkflowExecution workaround in the delete namespace activity which bypassed the engine. Also adds NewComponentRefByArchetypeID to chasm/ref.go to construct a ComponentRef from a runtime archetype ID without a compile-time type parameter. ## Why? Remove dependency on force deletion API. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s)
) When the matching client returns an error from DispatchNexusTask in StartOperation and CancelOperation, tag the metrics with outcome=matching_timeout instead of falling through to the default internal_error outcome. This lets us distinguish failures originating from the matching dispatch (typically transient timeouts) from other internal errors, which are considered more severe.
…10289) ## What changed? Add `WorkerControllerPerNSWorkerTaskQueue` ("temporal-sys-worker-controller-per-ns-tq") to the set of internal per-namespace task queues and treat it the same as `PerNSWorkerTaskQueue` in `IsInternalPerNsTaskQueue`, so user workflows cannot start workflows, schedule activities, start child workflows, continue-as-new, or update activity options targeting it. ## Why? The Worker Controller internal task queue has the same security concerns as the shared one, so treating it the same seems appropriate. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s)
## What changed? Pass custom `DCRedirectionPolicy` directly instead of going through `WithFxOptionsForService`. ## Why? We want to eliminate `WithFxOptionsForService` as it is blocking us from migrating away from the `onebox.go` approach (which duplicates the fx setup) since we don't want to expose an equivalent method in `temporal/fx.go`.
…uce allocations (#9620) ## Summary - Cache `WithTags()` child handlers via `sync.Map` to eliminate repeated `tagsToMap()`, `scope.Tagged()`, and handler struct allocations on the hot path - Cache `scope.Tagged()` results per unique inline tag combination in `cachedTaggedScope()`, bounded to 1024 entries with graceful degradation - Normalize excluded tags before cache key computation so high-cardinality excluded values (e.g. `activityType`) share a single cache entry, preventing unbounded cache growth ## Design Two complementary caching layers in `tallyMetricsHandler`: 1. **`childCache`** (`sync.Map`): caches entire handler subtrees returned by `WithTags()`. On cache hit: zero allocations. 2. **`scopeCache`** (`sync.Map` + `atomic.Int64` size bound): caches `tally.Scope` objects returned by `scope.Tagged()` for inline tags passed to `Counter`/`Gauge`/`Timer`/`Histogram` `Record()` calls. Bounded to 1024 entries; beyond that, scopes are created but not cached. Both caches use `LoadOrStore` for safe concurrent access. Tag normalization via `normalizeTagsForCaching()` ensures excluded tag variants collapse to the same cache key. The normalization has a zero-alloc fast path when no tags need substitution. ## Allocation Reduction (pprof alloc_space, 5min ScyllaDB workload) ### Commit 1: WithTags handler cache | Metric | Before | After | Reduction | |--------|--------|-------|-----------| | WithTags cumulative | 1,930 MB | 316 MB | -83.6% | | Total server allocs | 18,030 MB | 16,481 MB | -8.6% | ### Commit 2: Scope cache for inline tags | Metric | Before | After | Reduction | |--------|--------|-------|-----------| | tagsToMap.func1 | 1,101 MB | 0 MB | -100% | | tally Subscope | 1,012 MB | 0 MB | -100% | | Total server allocs | 18,465 MB | 16,511 MB | -10.6% | ## Benchmark (omes throughput_stress, mc150, 5 min) Host networking, i7-1270P 4 cores/component, inter-run data resets: | Database | Baseline | After commit 1 | After commit 2 | |----------|----------|----------------|----------------| | Cassandra | 280 | 294 (+5.0%) | 270 (-3.6%) | | ScyllaDB | 290 | 296 (+2.1%) | 298 (+2.8%) | Note: Throughput variance at mc150 is ~5-10%. The allocation reduction is confirmed by pprof but throughput gains are within noise at this concurrency level. ## Testing - Unit tests for all 4 metric types (Counter, Gauge, Timer, Histogram) with inline tags - Concurrency tests with race detector (32 goroutines × 100 iterations) - Cache bound enforcement test - Exclude-tag normalization tests (merge, allowed values, zero-alloc fast path) - Independent per-handler scope cache verification - All existing tests continue to pass --- ## v2 — addressing review feedback All 6 review comments addressed. Rebased on `origin/main`, squashed into a single commit. ### Changes 1. **`tagsCacheKey`: Use `strings.Builder` with `Grow` pre-allocation** — replaced manual `[]byte` construction with `strings.Builder`. A sizing pass pre-computes the exact capacity via `Grow()` to avoid internal reallocation (1 alloc/op). 2. **`tagsCacheKey`: Remove single-tag special case, uniform `\x00` separator** — removed the `len(tags) == 1` branch. Every tag pair now unconditionally appends `\x00` after both key and value, making the format uniform and the code simpler. 3. **`normalizeTagsForCaching`: Use `slices.Clone`** — replaced `make([]Tag, len(tags))` + `copy(tags[:i])` with `slices.Clone(tags)`, which copies the entire slice upfront. This eliminates the `if normalized != nil { normalized[i] = t }` guard for unchanged tags after the clone point. 4. **Extract shared `normalizeTag` function** — the exclude-tag check was duplicated between `normalizeTagsForCaching` and the `convert` closure in `tagsToMap`. Extracted `normalizeTag(tag Tag, excl excludeTags) (Tag, bool)` used by both, removing the duplication. 5. **Bound `childCache` to `scopeCacheMaxSize`** — `childCache` (used by `WithTags`) was previously unbounded. Applied the same bounding strategy as `scopeCache`: atomic counter + stop caching beyond 1024 entries. Added `childCacheSize atomic.Int64` field and `TestWithTags_BoundedChildCacheSize` test. ### Micro-benchmark results (cached vs uncached) ``` goos: linux, goarch: amd64, cpu: 12th Gen Intel(R) Core(TM) i7-1270P ns/op B/op allocs/op CounterRecord_Uncached 344 592 3 CounterRecord_CachedScope 93 48 2 ← 3.7x faster, 12x less memory WithTags_Uncached 325 592 3 WithTags_CacheHit 50 16 1 ← 6.5x faster, 37x less memory TagsCacheKey_SingleTag 29 24 1 TagsCacheKey_ThreeTags 49 64 1 ``` ### Not changed (deliberate) - **Cache key is order-sensitive / does not deduplicate keys** — Tally internally canonicalizes tag maps (sorted keys, rightmost precedence), so the cache key could theoretically miss on reordered-but-equivalent tag sets. Verified across 100+ call sites: tag ordering is fully consistent and duplicate keys never appear in the codebase. Adding sort+dedup to the hot path would add cost without real-world benefit.
## What changed? WISOTT ## Why? Part of our migration to `TestEnv` to speed up tests and reduce flakes. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Tests only
## What changed? Remove keys with `nil` value from the memo when instantiating CHASM Visibility. ## Why? `nil` value is used to remove keys from the memo, and it needs to match the behavior in workflows. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks
## What changed? Added support for Nexus workflow update completion callbacks via CHASM. This allows a Nexus caller to be notified when a workflow update completes by attaching completion callbacks to the update request. ## Why? Nexus operations that target workflow updates need a way to receive completion notifications. Without this, a Nexus caller that sends an update has no async mechanism to learn when the update finishes. Completion callbacks enable the same async notification pattern that already exists for workflow-level Nexus operations. ## How did you test it? - [ ] built - [x] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks Touches speculative workflow updates, they are always hard to reason about. Tried to compensate with lots of test coverage. Note: Needs this API PR https://github.com/temporalio/api/pull/742/changes <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **High Risk** > Touches workflow update state machine and mutable state event handling to persist/trigger per-update callbacks, including close/retry/reset paths, which is complex and can affect correctness of update outcomes and callback delivery. > > **Overview** > Adds **workflow update completion callbacks** via CHASM so Nexus callers can register callbacks on `UpdateWorkflowExecution` and have them fired on update completion or workflow close. > > This introduces a `WorkflowUpdate` CHASM component with new `UpdateState` protobuf (including persisted `rejection_failure`), stores update callbacks under `Workflow.Updates`, and extends callback processing to handle *update-level* callbacks on update completion, rejection (including reset/reapply), and on run transitions (retry/timeout/continue-as-new) where update callbacks must fire even if workflow-level callbacks are inherited. > > It also adds dynamic config gates/limits (`EnableWorkflowUpdateCallbacks`, `MaxCallbacksPerUpdateID`), updates `DescribeWorkflow` to surface update callback triggers, extends mutable state/history builder APIs to carry per-update callback options in `WorkflowExecutionOptionsUpdated`, and adds `Update.AttachCallbacks` logic to persist/flush callbacks (including buffering while `stateSent`, request-id dedup, and stricter validation requiring `request_id` when callbacks are present). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 4484fee. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: long-nt-tran <long.tran@temporal.io>
## What changed?
The Temporal worker service is now opt-in for `testcore.NewEnv`. Tests
that need scheduler / batcher / worker-deployment system workflows opt
in explicitly:
```go
env := testcore.NewEnv(t, testcore.WithWorkerService("V1 scheduler"))
```
`FunctionalTestBase`-based tests are unchanged.
## Why?
Profiling a 5-suite parallelsuite selection showed the system worker
service was responsible for ~55% of live memory and ~4s of test wall
time across all three persistence backends.
Measured impact on the selection (sqlite, locally):
| Metric | Before | After |
|---|---|---|
| Test wall (sqlite) | 33.0s | 29.1s |
| Live memory (inuse_space) | 200 MB | 89 MB |
## What changed? Removed retries in test when Nexus endpoint wasn't found. Follow-up to #10208 ## Why? Nexus lookups are now strongly consistent.
## What changed? Handle non workflow chasm clean up logic in replication ## Why? Support clean up logic for both workflow and chasm ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? `ClockedRateLimiter.SetBurstAt` now clamps `newBurst` to a minimum of 1 when the limiter's rate is positive. `rate=0, burst=0` is still accepted so callers can fully pause the limiter. ## Why? A zero or negative burst on an actively rate-limiting limiter stalls it — `Allow`/`Wait` can never succeed and waiters block indefinitely. The conditional clamp prevents that while preserving the existing pause semantic (rate=0, burst=0) used by the matching task queue rate limiter. ## How did you test it? - [x] built - [x] covered by existing tests
## What changed? WISOTT ## Why? Part of our migration process to testcore.TestEnv for reliability and speed purposes. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks None, test file changes only
## What changed? Replace calls to `payload.Encode` with `sadefs.EncodeValue`/`sadefs.MustEncodeValue`. ## Why? `payload.Encode` doesn't set the metadata type, which might be useful when decoding. Besides, encoding search attributes value should've used `sadefs.EncodeValue` anyway. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks
## What changed? This PR adds `SignalWithStartWorkflowExecution` as a synchronous Nexus operation exposed via `__temporal_system endpoint`, allowing workflows to signal-with-start other workflows through the CHASM Nexus operation framework. Key changes: - `chasm/lib/workflow/nexus_service.go`: `workflowServiceNexusHandler` implements the `SignalWithStartWorkflowExecution` Nexus sync operation by resolving the namespace and delegating to the History service. A `SignalWithStartOperationProcessor` handles input enrichment (namespace, request ID, links) and routing via CHASM's `NexusOperationProcessorResult`. - `chasm/lib/workflow/library.go` — library now holds the `workflowServiceNexusHandler`, the workflow `Config`, the SA mapper provider, and the SA validator. `newLibrary` (used by fx) takes those dependencies, while the public `NewLibrary` keeps its old signature for external callers. Adds `NexusServices()` so the library registers its Nexus service via CHASM. - `chasm/lib/workflow/validator.go`: `RequestValidator` consolidates the `SignalWithStartWorkflowExecution` validation logic (previously inlined in `WorkflowHandler`) into a reusable, injectable struct. This same validator is used by both the frontend handler and the new CHASM processor. - common/dynamicconfig/constants.go — adds `EnableSignalWithStartFromWorkflow` (namespace-scoped, default false). - `service/frontend/workflow_handler.go`: Removed the `SignalWithStart` validation block - `service/history/fx.go`: Provides a `HistoryServiceServerProvider` so the CHASM workflow library can call the history handler directly. - `temporal/fx.go`: Removes the now-redundant `ChasmLibraryOptions` grouping; each service module registers its own CHASM libraries. - `components/nexusoperations/workflow/commands.go`: `NotFound` and `InvalidArgument` errors during Nexus command handling are now surfaced as workflow task failures instead of being treated as transient handler errors. - `common/payloads`: Adds `EncodeSingle`, `MustEncodeSingle`, and `MustEncode` helpers used in tests. - `cmd/tools/getproto`: Adds support for nexus-proto-annotations proto imports. - `tests/signal_with_start_from_workflow_test.go`: Functional test suite covering the happy path, duplicate detection, conflict policies, and validation rejection for the new Nexus operation. ## Why? This functionality is one of our most requested GitHub issues. ## How did you test it? - [X] built - [X] run locally and tested manually - [ ] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s) ## Potential risks The history service now directly exposes a `HistoryServiceServer` interface via `fx` for injection into the CHASM workflow library. This tight coupling between the CHASM workflow library and the history handler could complicate future layering — callers outside the history service should not adopt this pattern. The feature is gated by `history.enableSignalWithStartFromWorkflow` for rollout. --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
## What changed? Remove client cache from RPCFactory. ## Why? This cache was added to avoid creating multiple connections to the same host for different services. But this cache does not have a way to identify when to close a connection and remove it. So it accumulates connections. This defeats the purpose of having this cache to reduce the number of connections. It just keep accumulating stale connections instead. All clients that uses RPCFactory have their own cache. They can listen to membership changes and cleanup stale connections. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed Replaced one `WithFxOptionsForService` with an await loop. ## Why We want to eliminate `WithFxOptionsForService` as it is blocking us from migrating away from the `onebox.go` approach (which duplicates the fx setup) since we don't want to expose an equivalent method in `temporal/fx.go`.
…ivity-cancel-commands
…equireTrue Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
When a standalone (CHASM) activity is cancel-requested or terminated while running on a worker, dispatch a cancel command via the Nexus worker commands control queue — same mechanism added for workflow-based activities in #9233 and #10047.
Why
Without this, standalone activities only discover cancellation passively via heartbeat or completion. If heartbeat isn't enabled, the activity runs to completion uselessly.
How did you test it?
Unit tests (chasm/lib/activity):
Functional tests (tests/standalone_activity_test.go):
TestDispatchCancelCommandToWorker/CancelRequest: start activity with control queue, cancel, verify cancel command arrives on Nexus control queue with correct task tokenTestDispatchCancelCommandToWorker/Terminate: same flow but with terminate🤖 Generated with Claude Code