feat(file_state): guard cross-agent file edits from stale sibling writes#3316
Conversation
…tes (tinyhumansai#3253) Introduce a process-wide FileStateCoordinator that tracks per-agent read stamps and per-path write stamps across parallel subagents. Write tools (file_write, edit, apply_patch) now detect when another agent modified a file after this agent's last read and return a model-facing error requiring a re-read. Partial-read protection prevents overwrites after paginated reads. - file_state/types.rs — ReadStamp, WriteStamp, FileStateCoordinator - file_state/ops.rs — record_read/write, check_stale/partial, path locks - file_state/agent_context.rs — task-local agent ID for tool attribution - file_read records read stamps after successful reads - file_write/edit/apply_patch check staleness + partial before writing - edit/apply_patch acquire per-path async locks for read-modify-write - spawn_parallel_agents annotates results with stale_parent_reads - OPENHUMAN_FILE_STATE_GUARD=0 disables the guard entirely
📝 WalkthroughWalkthroughThis PR introduces a process-wide file-state coordinator that tracks per-agent reads and writes to detect when sibling agents modify files. It guards filesystem operations against stale overwrites, integrates into the agent execution flow, and reports to parent agents which files their children touched. ChangesFile-state guard system
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
src/openhuman/file_state/ops.rs (1)
99-134: ⚡ Quick winLog the rejection branches.
These are the core decision points for the new guard, but they currently return silently. Add stable
trace/debuglogs with correlation fields likeagent,path,writer, and the chosen decision so stale-write reports can be reconstructed from logs.As per coding guidelines "Default to verbose diagnostics on new/changed flows; log entry/exit, branches, external calls, retries/timeouts, state transitions, errors using stable grep-friendly prefixes and correlation fields; use
log/tracingatdebug/tracelevel; never log secrets or full PII".Also applies to: 157-176
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/file_state/ops.rs` around lines 99 - 134, Add structured debug/trace logging to the rejection branches of the read-guard functions so decisions can be reconstructed: in check_stale_read and check_partial_read (and the similar branch around lines 157-176) emit a stable, grep-friendly log when returning Some(...) that includes correlation fields agent=agent_id, path=resolved_path.display(), decision="stale_read" or "partial_read", and writer=ws.writer (only for stale_read). Use the tracing/log crate at debug/trace level with a consistent prefix (e.g. "guard:decision") and avoid logging sensitive data; place the log immediately before the Some(...) return so it records the exact branch taken.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/bus.rs`:
- Around line 260-264: The current file_state_id built in the file_state_id
variable ("bus:{channel_name}:{target_agent_id}") is not unique per turn; change
it to include a per-turn/session discriminator (for example a request/turn id
passed into this context, or generate a short UUID/timestamp once per request)
so each run gets its own identity. Update the construction of file_state_id to
append that discriminator (e.g.
"bus:{channel_name}:{target_agent_id}:{turn_id}") and ensure the discriminator
is created once per request/run (not per write) so other logic that checks
same-agent writes will no longer collapse concurrent turns; make this change
where file_state_id is formed and propagate the new turn/request id into this
scope if necessary.
In `@src/openhuman/file_state/ops.rs`:
- Around line 15-21: The is_disabled function uses OPENHUMAN_FILE_STATE_GUARD
but compares the env value case-sensitively, so values like "FALSE" or "Off"
don't disable the guard; update the closure inside the static
DISABLED.get_or_init to perform a case-insensitive comparison (e.g., call
v.eq_ignore_ascii_case(...) or convert v to_ascii_lowercase()) when matching
against "0", "false", "off", "no" so all capitalizations correctly disable the
guard; the change should be applied within the is_disabled function and touches
the DISABLED initialization logic only.
In `@src/openhuman/file_state/types.rs`:
- Around line 12-18: The current staleness logic relies on ReadStamp.timestamp:
Instant which can be equal across nearby events; change the ordering to use a
process-local strictly increasing sequence number instead: add a u64 sequence
field to ReadStamp (and any write stamp structures) and initialize/increment it
using a global AtomicU64.fetch_add(1, Ordering::SeqCst) on every read and write;
update all places that set timestamp (reads/writes) to also set sequence, and
change comparisons in check_stale_read, stale_reads_for_parent, and
parent_stale_files to compare sequence (or sequence as a tie-breaker with
Instant) so ordering is strict and unambiguous.
In `@src/openhuman/tools/impl/filesystem/apply_patch.rs`:
- Around line 147-169: The current lock acquisition uses caller-supplied edit
order (built into unique_paths) which can deadlock; make the order deterministic
by deduplicating and sorting paths before locking: replace the HashSet-based
unique_paths construction with a deterministic collection (e.g., a BTreeSet or
collect into a Vec then sort and dedup) derived from parsed entries’ e.path
values, then iterate that sorted list when resolving with
self.security.action_dir, tokio::fs::canonicalize, and calling
file_state::acquire_path_lock so _path_guards are always acquired in the same
global order.
In `@src/openhuman/tools/impl/filesystem/file_read.rs`:
- Around line 94-103: The read recording currently performs a separate await via
tokio::fs::metadata(...).await after read_to_string, creating a race; fix by
reusing the metadata you already obtained for the size check (or capture
metadata before the read) and call file_state::record_read(&agent_id,
resolved_path, <reuse_mtime>, false) immediately after the read completes
without performing another await. Locate the Ok(contents) branch in file_read.rs
where file_state::current_file_state_agent_id() and file_state::record_read(...)
are used and replace the extra tokio::fs::metadata(...).await call with the
previously fetched metadata's modified() time (or the metadata captured before
the read) so record_read uses that mtime synchronously in that branch.
In `@src/openhuman/tools/impl/filesystem/file_write.rs`:
- Around line 133-162: The checks and write in file_write are not serialized:
wrap the stale/partial checks, the tokio::fs::write call, and the
file_state::record_write call in a per-path lock by calling
acquire_path_lock(&resolved_target) (or its guard-returning form) so the
sequence check_stale_read(&agent_id, &resolved_target) /
check_partial_read(&agent_id, &resolved_target) /
tokio::fs::write(&resolved_target, content).await /
file_state::record_write(&agent_id, resolved_target) executes atomically under
that lock; ensure you still obtain current_file_state_agent_id() inside the
locked section and release the lock after record_write completes.
---
Nitpick comments:
In `@src/openhuman/file_state/ops.rs`:
- Around line 99-134: Add structured debug/trace logging to the rejection
branches of the read-guard functions so decisions can be reconstructed: in
check_stale_read and check_partial_read (and the similar branch around lines
157-176) emit a stable, grep-friendly log when returning Some(...) that includes
correlation fields agent=agent_id, path=resolved_path.display(),
decision="stale_read" or "partial_read", and writer=ws.writer (only for
stale_read). Use the tracing/log crate at debug/trace level with a consistent
prefix (e.g. "guard:decision") and avoid logging sensitive data; place the log
immediately before the Some(...) return so it records the exact branch taken.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9e8322c7-be12-4334-a0b4-4e3f020e110f
📒 Files selected for processing (14)
.env.examplesrc/core/jsonrpc.rssrc/openhuman/agent/bus.rssrc/openhuman/agent/harness/subagent_runner/ops.rssrc/openhuman/agent_orchestration/tools/spawn_parallel_agents.rssrc/openhuman/file_state/agent_context.rssrc/openhuman/file_state/mod.rssrc/openhuman/file_state/ops.rssrc/openhuman/file_state/types.rssrc/openhuman/mod.rssrc/openhuman/tools/impl/filesystem/apply_patch.rssrc/openhuman/tools/impl/filesystem/edit_file.rssrc/openhuman/tools/impl/filesystem/file_read.rssrc/openhuman/tools/impl/filesystem/file_write.rs
| let file_state_id = format!( | ||
| "bus:{}:{}", | ||
| channel_name, | ||
| target_agent_id.as_deref().unwrap_or("root") | ||
| ); |
There was a problem hiding this comment.
Make the bus file-state ID unique per turn.
"bus:{channel}:{target_agent}" collapses every concurrent turn on the same channel/agent into one logical writer. Since stale-read detection ignores same-agent writes, overlapping root turns can overwrite each other without tripping the new guard. Include a per-turn/session discriminator here (or generate one once per request) so each run gets its own file-state identity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/agent/bus.rs` around lines 260 - 264, The current file_state_id
built in the file_state_id variable ("bus:{channel_name}:{target_agent_id}") is
not unique per turn; change it to include a per-turn/session discriminator (for
example a request/turn id passed into this context, or generate a short
UUID/timestamp once per request) so each run gets its own identity. Update the
construction of file_state_id to append that discriminator (e.g.
"bus:{channel_name}:{target_agent_id}:{turn_id}") and ensure the discriminator
is created once per request/run (not per write) so other logic that checks
same-agent writes will no longer collapse concurrent turns; make this change
where file_state_id is formed and propagate the new turn/request id into this
scope if necessary.
| fn is_disabled() -> bool { | ||
| static DISABLED: OnceLock<bool> = OnceLock::new(); | ||
| *DISABLED.get_or_init(|| { | ||
| std::env::var("OPENHUMAN_FILE_STATE_GUARD") | ||
| .map(|v| matches!(v.as_str(), "0" | "false" | "off" | "no")) | ||
| .unwrap_or(false) | ||
| }) |
There was a problem hiding this comment.
Parse the disable flag case-insensitively.
OPENHUMAN_FILE_STATE_GUARD=FALSE or Off currently leaves the guard enabled because Line 19 only matches exact lowercase strings. For a debug/emergency kill switch, that silent miss is risky.
Suggested fix
*DISABLED.get_or_init(|| {
std::env::var("OPENHUMAN_FILE_STATE_GUARD")
- .map(|v| matches!(v.as_str(), "0" | "false" | "off" | "no"))
+ .map(|v| {
+ let v = v.trim();
+ v == "0"
+ || v.eq_ignore_ascii_case("false")
+ || v.eq_ignore_ascii_case("off")
+ || v.eq_ignore_ascii_case("no")
+ })
.unwrap_or(false)
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn is_disabled() -> bool { | |
| static DISABLED: OnceLock<bool> = OnceLock::new(); | |
| *DISABLED.get_or_init(|| { | |
| std::env::var("OPENHUMAN_FILE_STATE_GUARD") | |
| .map(|v| matches!(v.as_str(), "0" | "false" | "off" | "no")) | |
| .unwrap_or(false) | |
| }) | |
| fn is_disabled() -> bool { | |
| static DISABLED: OnceLock<bool> = OnceLock::new(); | |
| *DISABLED.get_or_init(|| { | |
| std::env::var("OPENHUMAN_FILE_STATE_GUARD") | |
| .map(|v| { | |
| let v = v.trim(); | |
| v == "0" | |
| || v.eq_ignore_ascii_case("false") | |
| || v.eq_ignore_ascii_case("off") | |
| || v.eq_ignore_ascii_case("no") | |
| }) | |
| .unwrap_or(false) | |
| }) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/file_state/ops.rs` around lines 15 - 21, The is_disabled
function uses OPENHUMAN_FILE_STATE_GUARD but compares the env value
case-sensitively, so values like "FALSE" or "Off" don't disable the guard;
update the closure inside the static DISABLED.get_or_init to perform a
case-insensitive comparison (e.g., call v.eq_ignore_ascii_case(...) or convert v
to_ascii_lowercase()) when matching against "0", "false", "off", "no" so all
capitalizations correctly disable the guard; the change should be applied within
the is_disabled function and touches the DISABLED initialization logic only.
| pub struct ReadStamp { | ||
| /// Filesystem mtime at the moment of the read. | ||
| pub mtime: SystemTime, | ||
| /// Monotonic clock timestamp of the read. | ||
| pub timestamp: Instant, | ||
| /// Whether the read was partial (paginated / offset+limit). | ||
| pub partial: bool, |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does Rust's std::time::Instantguarantee that consecutiveInstant::now() calls are strictly increasing / unique, or can they be equal because of clock granularity?
💡 Result:
Rust's std::time::Instant does not guarantee that consecutive calls to Instant::now will be unique [1][2]. It guarantees that instants are monotonically nondecreasing, meaning that a newly created instant will always be greater than or equal to any previously measured instant [1][2]. While the clock is intended to move forward, it is possible for consecutive calls to Instant::now to return the exact same value because of the underlying clock's granularity [1][2]. If the time elapsed between two calls is smaller than the resolution of the platform's monotonic clock, the calls may return equal, non-decreasing values [1][2]. In summary: - Monotonicity: Instants are guaranteed to be nondecreasing (never go backward) [1][2]. - Uniqueness: Consecutive calls are not guaranteed to be unique; they can be equal if the time between calls is less than the clock's resolution [1][2]. - Reliability: These guarantees depend on the platform's OS API and hardware, and while they are generally robust on tier 1 platforms, they can theoretically be broken by rare system, hardware, or virtualization bugs [1][2].
Citations:
- 1: https://doc.rust-lang.org/stable/std/time/struct.Instant.html
- 2: https://doc.rust-lang.org/std/time/struct.Instant.html
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "timestamp\s*>\s*|>\s*.*timestamp|Instant::now|SystemTime|ReadStamp|WriteStamp" src/openhuman/file_state -S || true
echo "---- types.rs relevant excerpts ----"
FILE="src/openhuman/file_state/types.rs"
if [ -f "$FILE" ]; then
nl -ba "$FILE" | sed -n '1,120p'
fi
echo "---- ops.rs relevant excerpts ----"
FILE2="src/openhuman/file_state/ops.rs"
if [ -f "$FILE2" ]; then
nl -ba "$FILE2" | sed -n '1,140p'
fiRepository: tinyhumansai/openhuman
Length of output: 3536
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/openhuman/file_state/types.rs"
echo "---- types.rs (lines 1-120) ----"
if [ -f "$FILE" ]; then
sed -n '1,120p' "$FILE"
fi
echo "---- types.rs (around line ~65) ----"
sed -n '50,90p' "$FILE"
FILE2="src/openhuman/file_state/ops.rs"
echo "---- ops.rs (lines 1-140) ----"
if [ -f "$FILE2" ]; then
sed -n '1,160p' "$FILE2"
fi
echo "---- ops.rs (around stale checks) ----"
sed -n '90,140p' "$FILE2"
sed -n '150,230p' "$FILE2"
sed -n '230,330p' "$FILE2"Repository: tinyhumansai/openhuman
Length of output: 17895
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "check_stale_read|stale_reads_for_parent|parent_stale_files" src/openhuman/file_state -S
rg -n "acquire_path_lock|record_read|record_write" src/openhuman -SRepository: tinyhumansai/openhuman
Length of output: 5972
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "check_stale_read|parent_stale_files|check_partial_read" src/openhuman/tools/impl/filesystem -S
rg -n "check_stale_read|parent_stale_files|check_partial_read" src/openhuman -SRepository: tinyhumansai/openhuman
Length of output: 2539
Fix stale-read ordering: don’t rely on Instant-only equality semantics
check_stale_read (src/openhuman/file_state/ops.rs) and stale_reads_for_parent / parent_stale_files (src/openhuman/file_state/types.rs + ops.rs) treat staleness as ws.timestamp > read_stamp.timestamp, but both reads and writes record timestamp: Instant::now().
Instant is only monotonic/non-decreasing, not unique—two consecutive Instant::now() calls can return the same value—so a sibling write that happens after a read can still compare equal and slip past the stale-read guard.
Switch to a process-local strictly increasing sequence (e.g., AtomicU64 fetch_add) and compare (sequence) (or sequence as a tie-breaker alongside Instant) so ordering is strict and unambiguous.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/file_state/types.rs` around lines 12 - 18, The current
staleness logic relies on ReadStamp.timestamp: Instant which can be equal across
nearby events; change the ordering to use a process-local strictly increasing
sequence number instead: add a u64 sequence field to ReadStamp (and any write
stamp structures) and initialize/increment it using a global
AtomicU64.fetch_add(1, Ordering::SeqCst) on every read and write; update all
places that set timestamp (reads/writes) to also set sequence, and change
comparisons in check_stale_read, stale_reads_for_parent, and parent_stale_files
to compare sequence (or sequence as a tie-breaker with Instant) so ordering is
strict and unambiguous.
| // Acquire per-path locks for all unique paths before any reads. | ||
| let unique_paths: Vec<String> = { | ||
| let mut seen = std::collections::HashSet::new(); | ||
| parsed | ||
| .iter() | ||
| .filter_map(|e| { | ||
| if seen.insert(e.path.clone()) { | ||
| Some(e.path.clone()) | ||
| } else { | ||
| None | ||
| } | ||
| }) | ||
| .collect() | ||
| }; | ||
| let mut _path_guards = Vec::new(); | ||
| for p in &unique_paths { | ||
| let full = self.security.action_dir.join(p); | ||
| if let Ok(resolved) = tokio::fs::canonicalize(&full).await { | ||
| if let Some(guard) = file_state::acquire_path_lock(&resolved).await { | ||
| _path_guards.push(guard); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Acquire batch path locks in a deterministic order.
These locks are taken in caller-supplied edit order. Two concurrent patches over the same files in opposite orders will deadlock after each grabs its first lock. Resolve/sort/dedup the path list before locking so every task acquires the same global order.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/tools/impl/filesystem/apply_patch.rs` around lines 147 - 169,
The current lock acquisition uses caller-supplied edit order (built into
unique_paths) which can deadlock; make the order deterministic by deduplicating
and sorting paths before locking: replace the HashSet-based unique_paths
construction with a deterministic collection (e.g., a BTreeSet or collect into a
Vec then sort and dedup) derived from parsed entries’ e.path values, then
iterate that sorted list when resolving with self.security.action_dir,
tokio::fs::canonicalize, and calling file_state::acquire_path_lock so
_path_guards are always acquired in the same global order.
| Ok(contents) => { | ||
| if let Some(agent_id) = file_state::current_file_state_agent_id() { | ||
| let mtime = tokio::fs::metadata(&resolved_path) | ||
| .await | ||
| .ok() | ||
| .and_then(|m| m.modified().ok()) | ||
| .unwrap_or(std::time::SystemTime::UNIX_EPOCH); | ||
| file_state::record_read(&agent_id, resolved_path, mtime, false); | ||
| } | ||
| Ok(ToolResult::success(contents)) |
There was a problem hiding this comment.
Record the read without another await after read_to_string.
The extra tokio::fs::metadata(...).await creates a yield point after the contents are read but before the read stamp is recorded. If a sibling writes in that window, this read gets timestamped after the sibling write even though the returned contents are older, so the next write can slip past check_stale_read. Reuse the metadata you already fetched for the size check, or otherwise call record_read immediately after the read completes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/tools/impl/filesystem/file_read.rs` around lines 94 - 103, The
read recording currently performs a separate await via
tokio::fs::metadata(...).await after read_to_string, creating a race; fix by
reusing the metadata you already obtained for the size check (or capture
metadata before the read) and call file_state::record_read(&agent_id,
resolved_path, <reuse_mtime>, false) immediately after the read completes
without performing another await. Locate the Ok(contents) branch in file_read.rs
where file_state::current_file_state_agent_id() and file_state::record_read(...)
are used and replace the extra tokio::fs::metadata(...).await call with the
previously fetched metadata's modified() time (or the metadata captured before
the read) so record_read uses that mtime synchronously in that branch.
| // File-state guard: reject writes based on stale or partial reads. | ||
| if let Some(agent_id) = file_state::current_file_state_agent_id() { | ||
| if let Some(msg) = file_state::check_stale_read(&agent_id, &resolved_target) { | ||
| tracing::debug!( | ||
| agent = %agent_id, | ||
| path = %resolved_target.display(), | ||
| "[file_state] file_write blocked: stale read" | ||
| ); | ||
| return Ok(ToolResult::error(msg)); | ||
| } | ||
| if let Some(msg) = file_state::check_partial_read(&agent_id, &resolved_target) { | ||
| tracing::debug!( | ||
| agent = %agent_id, | ||
| path = %resolved_target.display(), | ||
| "[file_state] file_write blocked: partial read" | ||
| ); | ||
| return Ok(ToolResult::error(msg)); | ||
| } | ||
| } | ||
|
|
||
| match tokio::fs::write(&resolved_target, content).await { | ||
| Ok(()) => Ok(ToolResult::success(format!( | ||
| "Written {} bytes to {path}", | ||
| content.len() | ||
| ))), | ||
| Ok(()) => { | ||
| if let Some(agent_id) = file_state::current_file_state_agent_id() { | ||
| file_state::record_write(&agent_id, resolved_target); | ||
| } | ||
| Ok(ToolResult::success(format!( | ||
| "Written {} bytes to {path}", | ||
| content.len() | ||
| ))) | ||
| } |
There was a problem hiding this comment.
Serialize file_write's check→write section on the path lock.
Right now check_stale_read and tokio::fs::write are separated by awaits with no per-path lock. Two agents that both read the same file can both clear the guard and the later write will silently overwrite the earlier one. Wrap the stale/partial checks, the write, and record_write in acquire_path_lock(&resolved_target) so this path gets the same atomicity that edit and apply_patch now have.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/tools/impl/filesystem/file_write.rs` around lines 133 - 162,
The checks and write in file_write are not serialized: wrap the stale/partial
checks, the tokio::fs::write call, and the file_state::record_write call in a
per-path lock by calling acquire_path_lock(&resolved_target) (or its
guard-returning form) so the sequence check_stale_read(&agent_id,
&resolved_target) / check_partial_read(&agent_id, &resolved_target) /
tokio::fs::write(&resolved_target, content).await /
file_state::record_write(&agent_id, resolved_target) executes atomically under
that lock; ensure you still obtain current_file_state_agent_id() inside the
locked section and release the lock after record_write completes.
graycyrus
left a comment
There was a problem hiding this comment.
@senamakel — the PR Submission Checklist check is failing because the "Diff coverage >= 80%" checkbox in the PR description is unchecked (the Coverage Gate CI check itself passed — just tick the box).
I skimmed the diff while CI is red and have a couple of things beyond what CodeRabbit flagged:
Memory growth: reads and writes in FileStateCoordinator are unbounded HashMaps with no eviction strategy. Every read and write across all agent turns in a process lifetime accumulates here indefinitely. For long-running server deployments handling many sequential sessions this causes steady memory growth. You need a session-boundary cleanup hook or TTL-based eviction pass.
Double canonicalize in apply_patch: The lock acquisition loop and the stale-read guard loop both call tokio::fs::canonicalize on the same paths. Save the resolved paths from the first loop and reuse them in the second — two filesystem round-trips per path is redundant.
CodeRabbit's open findings — deadlock in apply_patch lock ordering, missing path lock in file_write, extra-await race window in file_read, non-unique bus agent ID — also need addressing before this is ready. Fix the CI check and those findings and I'll come back for a full pass.
| pub struct FileStateCoordinator { | ||
| /// Per-agent, per-resolved-path read stamps. | ||
| /// Key: `(agent_id, canonical_path)`. | ||
| pub(crate) reads: RwLock<HashMap<(String, PathBuf), ReadStamp>>, |
There was a problem hiding this comment.
[major] These maps accumulate entries for the entire process lifetime with no eviction or session-scoped cleanup. In a long-running server that handles many sequential agent sessions, every read and write ever recorded stays in memory until the process exits. Add a session-boundary cleanup hook (drain entries for a given session ID when it tears down) or a TTL-based eviction pass.
Summary
FileStateCoordinatorthat tracks per-agent read stamps and per-path write stamps across parallel subagents.file_write,edit,apply_patch) detect and reject writes based on stale sibling edits or partial reads.editandapply_patchacquire per-path async locks to serialize read-modify-write sections.spawn_parallel_agentsannotates results withstale_parent_readswhen child writes invalidate the parent's view.OPENHUMAN_FILE_STATE_GUARD=0(orfalse/off/no).Problem
Parallel subagents and worker threads share a workspace. Without coordination, one worker can read a file, a sibling can edit it, and the first worker later writes based on stale content. Path ownership hints reduce this risk but do not enforce it.
Solution
A new
file_statedomain module (src/openhuman/file_state/) implements a process-global singleton coordinator:file_readrecords(agent_id, resolved_path, mtime, partial)after every successful read.file_write,edit,apply_patchrecord(agent_id, resolved_path)after every successful write.editandapply_patchacquire a per-pathtokio::Mutexfor their read-modify-write sections.spawn_parallel_agentschecks if child writes touched files the parent had previously read and annotates the result JSON.FILE_STATE_AGENT_ID(following thesandbox_contextpattern) carries the currently-executing agent's identity so file tools can attribute reads/writes without widening theTooltrait.OPENHUMAN_FILE_STATE_GUARD=0disables the entire coordinator (zero overhead).Design tradeoffs:
parking_lot::RwLock<HashMap>rather than adding adashmapdependency.Submission Checklist
Closes #NNNbelowImpact
Related
Closes #3253
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
issue/3253-guard-cross-agent-file-edits-from-staleValidation Run
pnpm --filter openhuman-app format:check— passed (pre-push hook)pnpm typecheck— N/A (no TS changes)cargo test -- file_state(11 passed),cargo test -- file_read(all passed),cargo test -- spawn_parallel(9 passed)cargo fmt+cargo checkpassedcargo check --manifest-path app/src-tauri/Cargo.tomlpassedValidation Blocked
Behavior Changes
Parity Contract
OPENHUMAN_FILE_STATE_GUARD=0.Duplicate / Superseded PR Handling
Summary by CodeRabbit
Release Notes
OPENHUMAN_FILE_STATE_GUARDconfiguration option to control file state guard behavior.