Record batch machine command timings#1236
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the AKS Machine scale-out path to capture per-batch BatchPutMachine timing metrics and emit them as operation metadata, with accompanying test updates to validate metric capture and metadata emission.
Changes:
- Record per-batch start/end timestamps and execution duration in
_create_batch_machines. - Emit
batch_command_execution_timesasscale_machineoperation metadata whenuse_batch_api=True. - Update machine-client tests to cover timing metric capture and metadata emission.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| modules/python/clients/aks_machine_client.py | Captures per-batch timing metrics and emits them as scale_machine operation metadata. |
| modules/python/tests/test_aks_machine_client.py | Extends tests to assert timing metrics are captured and metadata is emitted for the batch path. |
Comments suppressed due to low confidence (1)
modules/python/clients/aks_machine_client.py:921
execution_time_secondsis derived fromdatetime.now()deltas, which can go negative or be skewed if the system clock adjusts (e.g., NTP). For durations, use a monotonic timer (time.perf_counter()/time.monotonic()) and keepdatetimeonly for the wall-clock timestamps.
start_time = datetime.now(timezone.utc)
self._make_batch_request(
"PUT",
url,
body,
Code reviewNo issues found. Checked for bugs and AGENTS.md compliance. |
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
This reverts commit a21a0d1.
| execution_time_seconds = (end_time - start_time).total_seconds() | ||
| request.batch_command_execution_times[first_machine_name] = { | ||
| "start_time": start_time.strftime("%Y-%m-%dT%H:%M:%SZ"), | ||
| "end_time": end_time.strftime("%Y-%m-%dT%H:%M:%SZ"), | ||
| "execution_time_seconds": execution_time_seconds, | ||
| "total_machines_in_batch": len(chunk), | ||
| } |
There was a problem hiding this comment.
Should record batch timing even when the BatchPutMachine request fails.
Right now the metric is written only after _make_batch_request() returns successfully. If it raises after retries / timeout / non-2xx, _scale_machine_batch() catches that worker failure and returns [], so the operation only reports landed X/Y machines and the failed batch has no timing entry. That makes the new metric biased toward successful batches and hides the slow/failing batch we most need for debugging.
Could we wrap the request in try/finally so every batch records elapsed time, and include a small status/error field?
There was a problem hiding this comment.
The batch PutMachine command execution time is the cx-observed latency in frontend. If this step fails, we would be more interested in the error than the latency and will be manually looking into internal logs for them.
Summary
Tests