feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR adds shared worker lifecycle primitives, refactors wallet sync managers to use them, and propagates incomplete-shutdown status through wallet, FFI, and Swift result handling. ChangesCoordinator Shutdown and Lifecycle Refactor
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v3.1-dev #3954 +/- ##
============================================
+ Coverage 87.17% 87.20% +0.02%
============================================
Files 2629 2634 +5
Lines 327221 328719 +1498
============================================
+ Hits 285265 286666 +1401
- Misses 41956 42053 +97
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
packages/rs-platform-wallet/src/manager/identity_sync.rs (1)
418-459: 🩺 Stability & Availability | 🟠 MajorDon't overwrite an unjoined coordinator generation.
stop()removes onlybackground_canceland leaves the previousbackground_joinunjoined. Astop()→start()sequence passes thecancel_guard.is_some()check and overwritesbackground_joinwith a new handle, losing the old OS-thread handle before shutdown can join it. Gate restart on confirming the prior handle has been joined, or ensure every generation's handle is tracked and joined.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 418 - 459, The start() method can overwrite an existing background_join handle before it has been properly joined, causing resource leaks. Before spawning the new identity-sync thread and storing its handle in background_join, add a check to ensure any existing join handle in background_join has been properly cleaned up or joined first. This can be done either by joining the existing handle before proceeding, or by verifying that background_join is None before allowing the new thread spawn to proceed. This ensures the prior thread's OS handle is not lost and can be properly shutdown.packages/rs-platform-wallet/src/manager/platform_address_sync.rs (1)
218-255: 🩺 Stability & Availability | 🟠 MajorAdd generation guard to prevent thread handle loss after stop/start cycles.
After
stop()cancels the token,background_cancelbecomesNonewhile the old thread keeps running. A subsequentstart()seescancel_guard.is_some() == falseand spawns a new thread, unconditionally overwritingbackground_join. The old thread's join handle is lost, making it impossible to join its cleanup later. Additionally, without a generation counter, the exiting old thread clears the new generation'sbackground_canceltoken as it shuts down, creating a race where the new loop runs but appears stopped tois_running().Both
IdentitySyncManagerandShieldedSyncManageralready implement this pattern: they incrementbackground_generationon eachstart(), passmy_gento the spawned thread, and checkbackground_generation.load() == my_genbefore clearingbackground_cancelon exit. Apply the same approach here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around lines 218 - 255, The thread handle for the platform address sync loop can be lost during stop/start cycles because a new thread spawns and unconditionally overwrites the background_join handle before the old thread finishes cleanup. Additionally, the old thread clears the new generation's background_cancel token, creating a race condition. Implement the generation guard pattern already used in IdentitySyncManager and ShieldedSyncManager: add a background_generation atomic counter to the struct, increment it at the start of the start() method, capture the current generation as my_gen before spawning the thread, pass my_gen into the spawned thread closure, and modify the cleanup code (where background_cancel is set to None) to only clear the token if background_generation.load() equals my_gen, ensuring old exiting threads do not interfere with new generations.packages/rs-platform-wallet/src/manager/shielded_sync.rs (1)
245-300: 🩺 Stability & Availability | 🔴 CriticalDon't replace a pending shielded-sync join handle on rapid stop/start cycles.
The
start()method checks onlybackground_cancelto guard against concurrent starts. Whenstop()removes the token frombackground_cancel, a subsequentstart()proceeds to spawn a new thread and overwritesbackground_join, even though the previous generation's thread is still winding down. The generation check (line 290) only prevents the old thread from clearingbackground_cancel—it does not protectbackground_join. This leaves prior-generation handles permanently lost.To fix: either require
quiesce()beforestart()to join all pending generations, or track join handles per-generation and join all pending before spawning a new thread.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 245 - 300, The `start()` method only checks `background_cancel` to guard against concurrent starts, but when `stop()` clears this token, a subsequent `start()` can spawn a new thread and overwrite the `background_join` handle before the previous generation's thread has finished cleaning up. The generation check at the cleanup section prevents the old thread from clearing `background_cancel`, but does not protect `background_join` from being overwritten. Fix this by either adding logic to join all pending prior-generation join handles before storing the new one (by tracking handles per-generation), or by ensuring that before assigning a new join handle to `background_join` in the `start()` method, any existing pending handle from a prior generation is properly joined and waited for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 706-710: The while loop waiting for handler_started to become true
has no timeout, which will cause the test to hang indefinitely if the slow
callback never executes. Wrap the entire while loop (that checks
handler_started.load(AO::Acquire)) with tokio::time::timeout() and provide an
appropriate timeout duration, then handle the timeout error case with an
assertion that explicitly fails the test with a useful message. This ensures CI
fails fast with clear feedback instead of timing out.
- Around line 483-489: In the wallet event adapter task join error handling, the
else branch that handles non-panic JoinErrors currently returns
CoordinatorThreadStatus::Ok, which should instead return
CoordinatorThreadStatus::Error with an appropriate error message. Change the
else clause that follows the is_panic() check to map the error to a
CoordinatorThreadStatus::Error variant containing details about the join error,
rather than treating it as a successful completion.
- Around line 175-181: The current implementation uses
tokio::task::spawn_blocking to wrap handle.join(), but this pattern prevents the
timeout from effectively interrupting the blocking task if the coordinator
thread hangs. Replace the spawn_blocking closure approach with explicit polling:
repeatedly check if the JoinHandle is finished using is_finished() in a loop
until the deadline is reached, and only call join() once the handle confirms it
is finished. This ensures the timeout boundary is enforced even if the
coordinator thread misbehaves or fails to clear is_syncing before join() is
called.
---
Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 418-459: The start() method can overwrite an existing
background_join handle before it has been properly joined, causing resource
leaks. Before spawning the new identity-sync thread and storing its handle in
background_join, add a check to ensure any existing join handle in
background_join has been properly cleaned up or joined first. This can be done
either by joining the existing handle before proceeding, or by verifying that
background_join is None before allowing the new thread spawn to proceed. This
ensures the prior thread's OS handle is not lost and can be properly shutdown.
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 218-255: The thread handle for the platform address sync loop can
be lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.
In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 245-300: The `start()` method only checks `background_cancel` to
guard against concurrent starts, but when `stop()` clears this token, a
subsequent `start()` can spawn a new thread and overwrite the `background_join`
handle before the previous generation's thread has finished cleaning up. The
generation check at the cleanup section prevents the old thread from clearing
`background_cancel`, but does not protect `background_join` from being
overwritten. Fix this by either adding logic to join all pending
prior-generation join handles before storing the new one (by tracking handles
per-generation), or by ensuring that before assigning a new join handle to
`background_join` in the `start()` method, any existing pending handle from a
prior generation is properly joined and waited for completion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 01cab324-105d-4c5d-afe0-0ceb6faff13e
📒 Files selected for processing (5)
packages/rs-platform-wallet-ffi/src/manager.rspackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rs
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/rs-platform-wallet/src/manager/mod.rs (2)
405-408: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winHandle unclean shielded quiesce before clearing state.
quiesce()now returns a meaningful shutdown status. Ignoring it letsclear_shielded()proceed tocoord.clear()afterTimeout,Stopped, orPanicked, which can race a still-running shielded pass that the quiesce barrier was meant to stop.Proposed fix
- self.shielded_sync_manager.quiesce().await; + let status = self.shielded_sync_manager.quiesce().await; + if !status.is_clean() { + return Err(crate::error::PlatformWalletError::ShieldedStoreError( + format!("shielded sync did not stop cleanly before clear: {status:?}"), + )); + } if let Some(coord) = self.shielded_coordinator().await { coord.clear().await?; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 405 - 408, The clear_shielded function ignores the return value from self.shielded_sync_manager.quiesce().await, which now provides meaningful shutdown status information. Instead of ignoring this result, capture the return value and check if the quiesce completed cleanly. If quiesce returns a status indicating Timeout, Stopped, or Panicked, return an appropriate error from clear_shielded rather than continuing to call coord.clear(), which could race with a still-running shielded pass. Only proceed with coord.clear() when quiesce has successfully shut down cleanly.
463-477: 🩺 Stability & Availability | 🟠 MajorWrap
quiescinginAtomicFlagGuardto ensure cancellation-safe reset in all coordinatorquiesce()implementations.The current implementations set
quiescing = truebefore the awaited drain loop and reset it only after. Iftokio::time::timeoutdrops the future during the loop, the reset never executes, permanently wedgingquiescingand blocking all future syncs.All three coordinators (
platform_address_sync.rs:291,identity_sync.rs:494,shielded_sync.rs:334) have identical patterns. Use the sameAtomicFlagGuardapproach already correctly applied tois_syncinginsync_now():pub async fn quiesce(&self) -> super::CoordinatorThreadStatus { self.quiescing.store(true, Ordering::Release); let _quiescing_guard = AtomicFlagGuard::new(&self.quiescing); self.stop(); while self.is_syncing.load(Ordering::Acquire) { tokio::time::sleep(Duration::from_millis(20)).await; } // quiescing.store(false) removed — guard handles reset on all exit paths // ...rest of implementation }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 463 - 477, The `quiesce()` method implementations in all three coordinators (the ones containing the timeout calls for `platform_address_sync_manager`, `identity_sync_manager`, and `shielded_sync_manager`) don't properly handle cancellation when `tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an `AtomicFlagGuard` to ensure it's reset on all exit paths including early cancellation. In each coordinator's `quiesce()` method, after setting `quiescing` to true, immediately create an `AtomicFlagGuard` using `AtomicFlagGuard::new(&self.quiescing)`, and remove any manual `quiescing.store(false)` reset call at the end since the guard will handle it automatically. Use the same pattern already correctly implemented in `sync_now()` for the `is_syncing` flag.
🧹 Nitpick comments (1)
packages/rs-dash-async/src/atomic.rs (1)
8-15: 📐 Maintainability & Code Quality | 🔵 TrivialAdd
#[must_use]annotation toAtomicFlagGuard.The guard is dropped as a temporary if not bound, silently resetting the flag immediately. This breaks the intended guarded scope behavior. Mark the type with
#[must_use]to catch accidental non-binding at compile time.Proposed fix
-pub struct AtomicFlagGuard<'a>(&'a AtomicBool); +#[must_use = "AtomicFlagGuard clears the flag on drop; bind it to keep the flag set for the guarded scope"] +pub struct AtomicFlagGuard<'a>(&'a AtomicBool);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-dash-async/src/atomic.rs` around lines 8 - 15, The AtomicFlagGuard struct can be accidentally dropped without being bound to a variable, causing the flag to be reset immediately and breaking the guarded scope behavior. Add the #[must_use] attribute to the AtomicFlagGuard struct definition to make the compiler warn when the guard is not explicitly bound to a variable, ensuring the developer catches this mistake at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 671-695: The test
`event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean` is
non-deterministic because it accepts both `CoordinatorThreadStatus::Stopped` and
`CoordinatorThreadStatus::Ok` as valid outcomes. This allows the test to pass
without actually exercising the non-panic JoinError branch. To fix this, replace
the current task abort approach with a task that is guaranteed to be pending and
never complete on its own, such as a task that awaits on a channel or a
never-resolving future. This ensures the abort always triggers the Stopped path
deterministically, and update the assertion to only expect
`CoordinatorThreadStatus::Stopped`.
---
Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 405-408: The clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.
- Around line 463-477: The `quiesce()` method implementations in all three
coordinators (the ones containing the timeout calls for
`platform_address_sync_manager`, `identity_sync_manager`, and
`shielded_sync_manager`) don't properly handle cancellation when
`tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an
`AtomicFlagGuard` to ensure it's reset on all exit paths including early
cancellation. In each coordinator's `quiesce()` method, after setting
`quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.
---
Nitpick comments:
In `@packages/rs-dash-async/src/atomic.rs`:
- Around line 8-15: The AtomicFlagGuard struct can be accidentally dropped
without being bound to a variable, causing the flag to be reset immediately and
breaking the guarded scope behavior. Add the #[must_use] attribute to the
AtomicFlagGuard struct definition to make the compiler warn when the guard is
not explicitly bound to a variable, ensuring the developer catches this mistake
at compile time.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 192ea517-ff8a-4b43-9773-c096391c8a49
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
packages/rs-dash-async/src/atomic.rspackages/rs-dash-async/src/lib.rspackages/rs-platform-wallet/Cargo.tomlpackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/rs-platform-wallet/src/manager/platform_address_sync.rs
- packages/rs-platform-wallet/src/manager/shielded_sync.rs
|
✅ Review complete (commit a89e0f9) |
|
🟠 MEDIUM — Sibling Stop/Clear paths call Posted at PR level because the affected call sites are not part of this PR's diff.
Fix: apply the same timeout bound (or a Stop/Clear-specific one) around these 🤖 Co-authored by Claudius the Magnificent AI Agent |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Solid lifecycle hardening overall — joining coordinator threads and the RAII is_syncing guard close real races. Three in-scope blockers undermine the new shutdown contract: the FFI destroy still returns success on non-clean shutdown (Swift may free a context a coordinator thread still owns), the bounded timeout doesn't actually bound anything because spawn_blocking tasks are non-abortable, and start-after-stop overwrites the saved JoinHandle so a later shutdown cannot join the stranded thread. Two suggestions and two test-quality nits round out the review.
🔴 3 blocking | 🟡 2 suggestion(s) | 💬 2 nitpick(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet-ffi/src/manager.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/manager.rs:366-376: Destroy returns ok() on non-clean shutdown — re-opens callback-after-free window
`shutdown()` can now legitimately return `CoordinatorThreadStatus::Timeout` (and `Stopped`/`Panicked`/`Error`) meaning a coordinator's OS thread or the event-adapter task did not actually join. This FFI entry point logs that outcome and still returns `PlatformWalletFFIResult::ok()`. The Swift host (`PlatformWalletManager.swift` deinit) is documented to free the callback `context` once `platform_wallet_manager_destroy` returns; meanwhile the still-alive coordinator thread holds an `Arc<FFIEventHandler>` / `Arc<FFIPersister>` and can fire `on_*_sync_completed` or `persister.store(...)` through the now-dangling `context` pointer. That is precisely the use-after-free this PR set out to close — the previous unbounded wait would have hung instead of returning false success. Surface non-clean shutdowns as a distinct, non-ok result code so the host knows not to free its context (or keep the FFI-owned handler `Arc` alive on the non-clean path, e.g. `mem::forget`, so any lingering callback remains memory-safe).
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:178-192: Timeout doesn't actually bound shutdown — spawn_blocking is non-abortable
`shutdown()` wraps each `quiesce()` in `tokio::time::timeout`, but `join_coordinator_thread` moves the `std::thread::JoinHandle` into `spawn_blocking(move || handle.join())`. Tokio blocking tasks cannot be aborted once started: dropping the outer timeout future stops `await`ing the `JoinHandle` but the underlying blocking task is still parked inside `handle.join()`, keeping the coordinator thread's `Arc<…SyncManager>` (and the host callback contexts it transitively holds) alive. When the caller then drops the multi-thread runtime, `Runtime::drop` returns without waiting for blocking tasks, leaving an OS thread plus a stranded blocking thread alive on a freed runtime — which is the very `"Tokio 1.x context … being shutdown"` race the PR cites in its motivation. Make the join cancellation-aware by polling `handle.is_finished()` (with a short async sleep) before the final `handle.join()`, so dropping the timeout future actually releases all state.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/mod.rs:441-459: assert! on runtime flavor is incorrect — spawn_blocking works on current_thread
The promotion from `debug_assert!` to `assert!` is justified in the docstring by the claim that `spawn_blocking` 'is not available on `current_thread` runtimes and will panic there.' That's not how Tokio works: `spawn_blocking` dispatches to the runtime's shared blocking pool, which both `multi_thread` and `current_thread` runtimes provision; awaiting the returned `JoinHandle` simply yields the runtime task. Today's only in-tree caller (`platform-wallet-ffi/src/runtime.rs:34`) builds a multi-thread runtime, but `shutdown()` is now a public Rust API. Any downstream consumer using `#[tokio::main(flavor = "current_thread")]` (or `Builder::new_current_thread()`) will hit a release-mode panic on a configuration that would have worked. Revert to `debug_assert!` (or drop it entirely) and correct the docstring rationale.
In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-448: start-after-stop overwrites background_join, dropping the previous thread handle
`stop()` takes `background_cancel` (leaving it `None`) but never touches `background_join`. A subsequent `start()` passes the `cancel_guard.is_some()` early-return check, spawns a fresh thread, and at line 447 unconditionally overwrites `background_join` with the new handle — the previous handle is dropped (detached). The old coordinator thread is cancellation-bound but not yet exited; it can still be inside its `block_on` polling `tokio::time` and feeding the event manager when the new handle replaces it. A later `shutdown()` then can only join the new thread, so the original thread can outlive `shutdown()` and touch the host's freed callback context — recreating exactly the runtime-drop race this PR is meant to eliminate. The same pattern applies to `platform_address_sync` and `shielded_sync`. Fix by taking-and-joining (or refusing) any non-`None` handle inside the `start()` lock before installing the new one.
In `packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/platform_address_sync.rs:291-304: quiescing gate is reopened before the join completes — sync_now can slip a pass through
`quiesce()` clears `self.quiescing` to `false` *before* taking `background_join` and awaiting `join_coordinator_thread`. The join can block for up to `SHUTDOWN_JOIN_TIMEOUT_SECS` (30s). During that window the loop is cancelled and won't start new passes, but an external caller invoking `sync_now` (e.g. an FFI on-demand sync) finds a fully open gate — `quiescing=false`, `is_syncing=false` — and runs a complete pass, including the `on_platform_address_sync_completed` host callback. That breaks the documented `quiesce()` contract that no further callback can fire after it returns, undermining the manager-level shutdown guarantee. Same pattern in `identity_sync::quiesce` and `shielded_sync::quiesce`. Move `quiescing.store(false, …)` to *after* the join completes (or have `sync_now`/`sync_wallet` also consult `background_join.is_some()`).
|
Update — SEC-002 fixed in 🤖 Co-authored by Claudius the Magnificent AI Agent |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Prior finding (start()/wedge-backstop detaching a live coordinator) is FIXED at head 76c8bee via reap_prior_or_park + CoordinatorOrphans + join_detached_orphans, with Detached folded into all_clean(). However, two related lifecycle gaps in the same orphan-tracking design remain: (1) quiesce() takes the JoinHandle out of background_join before awaiting join_coordinator_thread, so a wrapping tokio::time::timeout (used by shutdown(), clear_shielded(), and the FFI shielded_sync_stop bridge) drops the live handle and detaches the thread without parking it; on the explicit retry path the next call sees background_join == None and reports NotRunning/clean while a wedged thread may still call host callbacks. (2) clear_shielded() consults only the current shielded quiesce status and wipes the commitment-tree store even when an earlier wedged shielded coordinator is still alive in coordinator_orphans, breaking the PR's 'Clear leaves the store intact whenever a sync drain is incomplete' invariant. Both are in scope for the PR's stated FFI lifecycle/UAF hardening.
🔴 2 blocking
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:240-261: Timed-out quiesce drops the JoinHandle and silently detaches a live coordinator
join_coordinator_thread takes the JoinHandle by value. Every caller of quiesce() — shutdown(), clear_shielded(), and the FFI platform_wallet_manager_shielded_sync_stop — wraps it in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...). Each coordinator's quiesce() first synchronously takes background_join (identity_sync.rs:579, shielded_sync.rs:423, platform_address_sync.rs analogous) and only then awaits this helper. If the outer timeout fires while this loop is in tokio::time::sleep, the future is dropped, the locally-owned handle is dropped with it, and the still-live OS thread is detached — not parked in coordinator_orphans. The FFI/host contract is explicitly retry-capable here (Timeout → ErrorShutdownIncomplete → host should be able to retry stop/destroy), but a retry now sees background_join == None and join_coordinator_thread returns NotRunning, which is_clean() treats as clean. The host then frees the persister / event-handler context while a wedged thread still holds Arcs to it and may still fire one final callback — the exact UAF class the orphan list was added to close. The fix is to keep the handle reachable across timeout/cancel: either don't take() from background_join until is_finished() is true, or have join_coordinator_thread re-park the unjoined handle into a manager-scoped orphans list on Drop / on the deadline (mirroring reap_prior_or_park's park-on-wedge contract).
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the store while parked coordinator orphans may still be alive
clear_shielded() gates the store wipe on `self.shielded_sync_manager.quiesce()` returning a clean status, but a tight stop()→start() sequence can park a previously-wedged shielded coordinator thread into coordinator_orphans (reap_prior_or_park, mod.rs:296-332). Those parked threads hold Arc references to the same shielded coordinator / persister context the clear() below is about to wipe, and shutdown() is currently the only path that drains them via join_detached_orphans. As soon as the current quiesce reports clean, this path calls coord.clear() while a parked orphan may still be inside its sync_now drop / persister fan-out, violating the PR's stated invariant that Clear leaves the store intact whenever a sync drain is incomplete and re-creating the store-desync the orphan tracking was added to surface. Mirror the shutdown() path: after quiesce, drain coordinator_orphans (with SHUTDOWN_ORPHAN_GRACE_SECS) and refuse the wipe with ShieldedShutdownIncomplete if any orphan is still Detached/Panicked.
|
@thepastaclaw please review |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental review against head 3cca1cf. The new tokio::join! concurrent drain in shutdown() is structurally sound — each coordinator's quiesce() touches only its own quiescing/is_syncing atomics and background_cancel/background_join mutexes, and the shared coordinator_orphans Mutex is not taken inside quiesce(), so racing the three drains is safe; per-coordinator inner timeouts correctly preserve per-thread CoordinatorThreadStatus that a single outer timeout would have flattened. However, both prior blocking findings are unchanged on this head: a timed-out quiesce still loses the only JoinHandle and silently detaches a live coordinator OS thread, and clear_shielded still wipes the shielded store without consulting coordinator_orphans. Both were unanimously re-flagged by all six agent reviews and both are required for this PR's stated lifetime/UAF guarantees to actually hold.
🔴 1 blocking
1 additional finding(s) omitted (not in diff).
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the shielded store without draining parked coordinator orphans
clear_shielded (mod.rs:644–674) gates coord.clear().await on shielded_sync_manager.quiesce() returning is_clean(), and the doc on SHUTDOWN_JOIN_TIMEOUT_SECS (mod.rs:407–408) explicitly lists clear_shielded and the FFI shielded-stop bridge as using the same backstop. But neither path inspects self.coordinator_orphans. Shielded coordinator start() parks a prior wedged shielded OS thread into that shared list via reap_prior_or_park (mod.rs:296–332) whenever a tight stop()→start() cannot reap within the 1 s wedge backstop. Only shutdown() actually drains the list via join_detached_orphans (mod.rs:820–824).
A parked shielded thread still holds an Arc to the same NetworkShieldedCoordinator and persister context that coord.clear() (mod.rs:670–672) is about to wipe; the doc on lines 663–666 warns exactly about this — "a surviving pass writing into a store we just cleared, desyncing the host's own wipe from a repopulated tree." Because the orphan drain is missing, a sequence of stop()→start() that wedges the prior shielded thread, followed by clear_shielded, lets the wipe proceed while the parked thread is still alive and can persister.store(...) / fire on_shielded_* host callbacks against the cleared store.
The same defect is in the FFI platform_wallet_manager_shielded_sync_stop bridge: it awaits only quiesce(), so a non-clean orphan goes unreported and the host (which sees ok()) is free to release the callback context that a parked shielded thread still references.
Fix shape (symmetric with shutdown()): in both clear_shielded and platform_wallet_manager_shielded_sync_stop, after quiesce() returns clean, run a bounded join_detached_orphans(&self.coordinator_orphans, deadline) and fold the result into the precondition — return ShieldedShutdownIncomplete { status } (and leave the store intact) unless both the live drain and the orphan drain are clean. Caveat: coordinator_orphans is shared across all three coordinator kinds; either accept the conservative wait for identity/platform-address orphans before a shielded Clear, or tag parked handles with their coordinator kind so Clear can drain only the shielded subset.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Both prior blocking findings are structurally fixed in the new ThreadRegistry engine (prior-1 by the slot-owned JoinHandle + generation-scoped Repark drop-guard; prior-2 by clear_shielded's any_alive_for(ShieldedSync) gate under a held quiescing guard). One new blocking regression was introduced by the CoordinatorLifecycle dedup: the shared quiesce() no longer raises the quiescing gate itself or drains is_syncing — it only delegates to ThreadRegistry::quiesce, whose drain hook is skipped when no slot is registered — so a direct sync_now/sync_wallet in flight (which the FFI exposes without requiring the background loop to be running) is no longer awaited before quiesce reports clean. Two narrower follow-ups (symmetric orphan check on shielded_sync_stop FFI; out-of-lock prior-handle reap racing shutdown's orphan reap) round out the review.
🔴 1 blocking | 🟡 2 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:146-149: Shared quiesce() no longer drains direct in-flight sync passes
The pre-refactor per-coordinator `quiesce()` explicitly raised `quiescing` and awaited `is_syncing` to fall before joining (`self.quiescing.store(true, ...); self.stop(); while self.is_syncing { sleep(20ms) }`). The new shared `CoordinatorLifecycle::quiesce` is `let _g = AtomicFlagGuard::new(&self.quiescing); self.registry.quiesce(self.worker).await.into()`. `AtomicFlagGuard::new` (rs-dash-async/src/atomic.rs:18-22) deliberately does **not** raise the flag — only the registry's drain hook does, and `ThreadRegistry::quiesce` (registry.rs:507-519) early-returns `NotRunning` without invoking the drain hook whenever no live slot is registered.
The FFI exposes `platform_wallet_manager_shielded_sync_sync_now` / `shielded_sync_wallet` (shielded_sync.rs:208, 590) — and the same shape for identity / platform-address — which call `shielded_sync().sync_now(true)` / `sync_wallet(..., true)` directly, with no requirement that the background loop is running. So a host-driven direct pass can be in flight while `clear_shielded` calls `quiesce()`: the registry has no slot, the drain hook never fires, the gate is never raised, and `quiesce()` returns `NotRunning` — clean — immediately. `clear_shielded` (mod.rs:531-562) then takes the `is_clean()` branch, raises the gate via `hold_quiescing_gate()` (too late — the in-flight pass already passed `begin_pass()` and does not recheck mid-pass), and `any_alive_for(ShieldedSync)` returns false (it only inspects slots/orphans, not `is_syncing`). `coord.clear().await` wipes the store while the direct pass keeps writing through `coord.sync(...)` and its persister fan-out.
This directly contradicts the documented contract on `clear_shielded` (mod.rs:517-518: "A concurrent direct `sync_now`/`sync_wallet` is held off (the quiescing gate stays raised across the liveness check and the wipe)"), and is the same store-desync class the prior F2 finding closed for parked orphans. Fix by restoring the explicit gate raise + `is_syncing` drain in the shared `quiesce`, before delegating to the registry — symmetric with what each coordinator did before the dedup.
In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:117-134: shielded_sync_stop FFI does not consult any_alive_for(ShieldedSync)
`platform_wallet_manager_shielded_sync_stop` returns `Success` whenever the per-key `quiesce()` lands clean (Ok/NotRunning). It does not consult `registry.any_alive_for(WalletWorker::ShieldedSync)` — the symmetric gate `clear_shielded` adopted at manager/mod.rs:553 (the F2 fix) and the destroy path covers via `CoordinatorExitStatus::all_clean()` which folds in `detached_threads`.
A prior tight start->stop reap that had to park a wedged shielded thread as an orphan would not block the next stop's `Success`, even though the parked thread still holds `Arc`s to the host's `FFIEventHandler::context` / `FFIPersister::context` pointers and can fire one final callback through them. The function's docstring (shielded_sync.rs:80-83) says "On this code the host must NOT free the callback context immediately" only on the non-clean branch — implying that on `Success` the host MAY free it.
Today's only consumer (Swift `deinit`) always follows stop with destroy, and destroy's `manager.shutdown()` drains orphans, so no UAF currently reaches a host. The contract gap is still worth closing symmetrically with the other two quiesce-then-touch-host-context teardown paths.
In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:406-414: start_thread's out-of-lock prior reap races shutdown's orphan reap
`start_thread` takes the prior handle under the slot lock, releases the lock at line 406, then calls `self.reap_prior_or_park(prior, key)` synchronously at line 413. `reap_prior_or_park` for an OS thread (registry.rs:737-762) spins up to `reap_backstop` (1 s in the wallet) before pushing a still-wedged handle into orphans. The `closing` latch and shutdown's tier snapshot only synchronize on the slot lock — they do not block on this out-of-lock reap. Interleaving from different threads:
1. A: `start_thread` takes the slot lock, passes the `closing` check, takes the wedged `prior`, installs the new gen handle, releases the lock. A is now spinning in `reap_prior_or_park`.
2. B: `shutdown()` takes the slot lock, sets `closing=true`, snapshots tiers (sees only the NEW gen handle — `prior` was already moved out), releases.
3. B drains the new gen via `quiesce` (clean).
4. B calls `reap_orphans_impl(reap_backstop)` — orphan list is still empty because A hasn't pushed yet.
5. B returns a clean `ShutdownReport` → `CoordinatorExitStatus::all_clean()` → FFI `destroy` returns `ok()` → host frees its callback context.
6. A's 1 s backstop expires; A pushes the still-live wedged prior into orphans and returns.
The prior thread is now a live orphan holding `Arc`s to the freed host context — the same UAF surface F1 was designed to close. Requires (a) a wedged-past-backstop prior, (b) `start_thread` invoked concurrently with `shutdown` from a different thread, and (c) shutdown's reap_orphans phase to land in A's 1 s backstop window — narrow, but the asymmetric to the closing-latch protection the PR explicitly adds. Close by parking `prior` immediately under the slot lock and reaping asynchronously, or by tracking in-flight reaps so `reap_orphans_impl` waits for them to settle.
|
@coderabbitai review all |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
packages/rs-platform-wallet-ffi/src/manager.rs (1)
363-365: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUpdate the inline comment to match the new FFI behavior.
Lines 363-365 still say destroy will "just log it and drop it", but Lines 367-382 now surface non-clean shutdowns as
ErrorShutdownIncomplete. Leaving the old comment here is misleading in the callback-lifetime path this PR is hardening.✏️ Suggested edit
- // It now joins the coordinator OS threads and returns their - // per-thread exit status; the C ABI exposes none of that, so we - // just log it (a panicked loop is worth surfacing) and drop it. + // It now joins the coordinator OS threads and returns their + // per-thread exit status. The C ABI does not expose that full + // structure directly, but a non-clean aggregate is surfaced as + // `ErrorShutdownIncomplete` so the host can keep its callback + // context alive until lingering workers settle.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet-ffi/src/manager.rs` around lines 363 - 365, The inline comment near the destroy/callback-lifetime handling in manager::destroy is outdated and still says the shutdown result is only logged and dropped. Update that comment to describe the current FFI behavior in destroy and the related callback-lifetime path: the coordinator OS threads are joined, and non-clean shutdowns are surfaced as ErrorShutdownIncomplete instead of being silently discarded. Keep the wording aligned with the destroy logic and ErrorShutdownIncomplete handling so the comment matches the implemented shutdown semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-dash-async/src/registry.rs`:
- Around line 674-679: The shutdown path in `Registry::shutdown` is discarding
the orphan result status from `reap_orphans_impl()`, which can hide non-clean
exits. Update `ShutdownReport` and its construction so the orphan status is
preserved alongside `detached`, and adjust `ShutdownReport::all_clean()` to
consider that stored status instead of treating parked orphans as clean by
default.
In `@packages/rs-platform-wallet/src/changeset/core_bridge.rs`:
- Around line 68-110: When the cancel branch in core_bridge.rs wins the
tokio::select! race, the loop exits immediately and can drop buffered broadcast
events, so update the cancellation handling in the main event loop to drain
receiver before breaking. Use the existing build_core_changeset and
persister.store flow for each pending event from receiver.try_recv(), preserve
the core.is_empty_no_records() skip, and keep the existing error/warn logging
style keyed by wallet_id so shutdown still persists any queued CoreChangeSet
updates.
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 407-410: Update the `stop()` documentation in `identity_sync.rs`
to reflect that background sync passes are cancellable and may exit early when
shutdown is requested. Adjust the wording around `sync_now()`/`tokio::select!`
in `stop()` and any related comments so it no longer claims a pass will always
run to completion after `stop()`, and make the doc mention that cancellation can
interrupt an in-flight `.await`.
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 515-522: The `clear_shielded`/`shielded_sync_start` path in
`manager::mod` still relies on a host-side precondition, but `start()` can race
`clear()` and re-persist into a wiped store. Add a per-key clearing latch in the
wallet manager/registry logic, have `shielded_sync_start` honor it, and keep it
held across quiesce, liveness check, and `coord.clear()` so the clear/start
exclusion is enforced inside the wallet rather than by the caller.
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 206-210: Update the `stop()` documentation in
`PlatformAddressSync` to reflect that cancellation can win over `sync_now()` in
the `tokio::select!` loop, so a background pass may exit before reaching
`on_platform_address_sync_completed`. Adjust the docs/comments near `stop()`,
`sync_now()`, and the select loop to clearly state that shutdown does not
guarantee completion of an in-flight sync pass.
In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 226-230: Update the stop() documentation to reflect that
background-pass cancellation can interrupt an in-flight sync_now(false) call
rather than guaranteeing the persister fan-out completes. In shielded_sync.rs,
adjust the docs/comments around the tokio::select! path in
ShieldedSyncManager::stop and the background coordinator.sync() flow to state
that cancellation may win the race and the sync can be dropped at an await. Keep
the description aligned with the actual behavior of sync_now(false) and the
cancel.cancelled() branch.
---
Nitpick comments:
In `@packages/rs-platform-wallet-ffi/src/manager.rs`:
- Around line 363-365: The inline comment near the destroy/callback-lifetime
handling in manager::destroy is outdated and still says the shutdown result is
only logged and dropped. Update that comment to describe the current FFI
behavior in destroy and the related callback-lifetime path: the coordinator OS
threads are joined, and non-clean shutdowns are surfaced as
ErrorShutdownIncomplete instead of being silently discarded. Keep the wording
aligned with the destroy logic and ErrorShutdownIncomplete handling so the
comment matches the implemented shutdown semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ef1d0648-a464-44ab-8151-583159ad51f5
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
packages/rs-dash-async/Cargo.tomlpackages/rs-dash-async/src/atomic.rspackages/rs-dash-async/src/lib.rspackages/rs-dash-async/src/registry.rspackages/rs-platform-wallet-ffi/Cargo.tomlpackages/rs-platform-wallet-ffi/src/error.rspackages/rs-platform-wallet-ffi/src/manager.rspackages/rs-platform-wallet-ffi/src/shielded_sync.rspackages/rs-platform-wallet/Cargo.tomlpackages/rs-platform-wallet/src/changeset/core_bridge.rspackages/rs-platform-wallet/src/changeset/mod.rspackages/rs-platform-wallet/src/error.rspackages/rs-platform-wallet/src/lib.rspackages/rs-platform-wallet/src/manager/coordinator_lifecycle.rspackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rspackages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swiftpackages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
All three prior findings (911f99f) are resolved at 748c4f8: CoordinatorLifecycle::quiesce now self-raises the quiescing gate and drains in-flight direct passes (coordinator_lifecycle.rs:166-173, with a regression test); the shielded_sync_stop FFI now consults manager.shielded_worker_alive() and returns ErrorShutdownIncomplete when a prior-generation shielded orphan is parked (shielded_sync.rs:127-157); and start_thread now parks the prior worker under the slot lock via park_prior_locked before the out-of-lock reap, with a regression test in the registry. One in-scope cumulative concern remains: the shared quiescing AtomicBool means a concurrent quiesce()'s AtomicFlagGuard drop can lower the gate that clear_shielded's hold_quiescing_gate is supposed to keep raised continuously — the mechanism does not match the doc's continuous-hold promise in clear_shielded_inner. In practice this is gated by the host's UI-thread serialization, so it is a suggestion, not a blocker.
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-173: clear_shielded's "continuously held" gate can be lowered by a concurrent quiesce()
`clear_shielded_inner` holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across drain → liveness check → wipe, expecting the shielded `quiescing` boolean to stay `true` for the whole sequence (and the doc at manager/mod.rs:517-520 explicitly promises "the quiescing gate is raised continuously" so concurrent direct `sync_now`/`sync_wallet` are held off). But the same `AtomicBool` backs `CoordinatorLifecycle::quiesce`, which builds its own `AtomicFlagGuard` (coordinator_lifecycle.rs:171); `AtomicFlagGuard::drop` unconditionally stores `false` (rs-dash-async/src/atomic.rs:25-29). If `platform_wallet_manager_shielded_sync_stop` is invoked concurrently with `clear_shielded` (the FFI handle store only takes a brief read lock in `with_item`, so two calls on the same handle can interleave), the stop's `quiesce()` returns and its dropping guard clears the flag while clear is still between its drain and its wipe. A direct `sync_now`/`sync_wallet` then sees `quiescing=false` in `begin_pass`, claims `is_syncing`, and can re-persist into the store `coord.clear()` is about to wipe — a store/host desync. The host's UI-thread serialization mitigates this today, but the documented invariant on `clear_shielded` only excludes concurrent `shielded_sync_start`, not `_stop`. Fix options: refcount the gate (so the inner drop doesn't lower it under the outer hold), use a clear-specific second flag that `begin_pass` also consults, or tighten the host-serialization doc to enumerate `_stop` too.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental review of 748c4f8..6ed5200. Latest delta adds the server-advertised rate-limit ban path in rs-dapi-client + dashmate gateway wiring, simplifies platform_wallet_manager_shielded_sync_stop to cancel-only, drains buffered wallet events on cancel in event-adapter (T7), and threads orphan_status through ShutdownReport. No new blocking defects found. The prior architectural finding around clear_shielded's continuously-held gate vs. CoordinatorLifecycle::quiesce on the same AtomicBool is STILL VALID but now latent: ShieldedSyncManager::quiesce remains pub on the same atomic, however no FFI export or in-tree Rust caller races it against clear_shielded (verified). Carried forward as suggestion only. Model coverage degraded: Codex (general/rust-quality/ffi-engineer) did NOT run this round due to ACP/native auth stall; review is Claude-only.
Model coverage note: Claude general, Claude rust-quality, and Claude ffi-engineer ran. Codex general/rust-quality/ffi-engineer did not run because ACP-Codex authentication stalled during the authenticate handshake and native Codex CLI returned 401/missing bearer; this review is therefore degraded from the usual Claude+Codex matrix.
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:166-249: Public ShieldedSyncManager::quiesce can still lower clear_shielded's continuously-held gate (carry-forward of prior 748c4f82-1)
`clear_shielded_inner` (manager/mod.rs:548) holds `_clearing_gate = shielded_sync_manager.hold_quiescing_gate()` across the entire drain → liveness-check → store-wipe sequence, expecting the shielded `quiescing` boolean to stay raised so no direct `sync_now`/`sync_wallet` slips past and re-persists into the store being wiped (mod.rs docs 515-524). The hold guard at coordinator_lifecycle.rs:243-249 and the RAII guard inside `CoordinatorLifecycle::quiesce` at line 171 both wrap the SAME `Arc<AtomicBool>`, and `AtomicFlagGuard::drop` unconditionally `store(false, Release)`s. A concurrent invocation of `ShieldedSyncManager::quiesce` (which is still `pub` at shielded_sync.rs:280) would lower the gate mid-Clear, opening exactly the window the docstring promises is closed.
Reachability on this head: verified there is no FFI export that triggers this — `platform_wallet_manager_shielded_sync_stop` is cancel-only in this delta, `destroy` goes through the registry shutdown path whose drain hook only stores `true` with no RAII guard (lines 121-131), and `shielded_clear` uses `quiesce_under_held_gate`. The only in-tree caller of `shielded_sync().quiesce()` is the internal test at coordinator_lifecycle.rs:346. So the cross-boundary exploit path is closed by surface; the residual is a fragile in-process Rust contract that future callers can break silently. The module docs at mod.rs:515-524 already acknowledge this as a follow-up (per-key clearing latch / refcount on the registry). Suggested cheap structural fixes: narrow `ShieldedSyncManager::quiesce` to `pub(crate)` (no FFI consumer today), refcount the gate, or force every caller through `quiesce_under_held_gate` after `hold_quiescing_gate`. Non-blocking — flagged so the latent fragility does not regrow once new shielded entry points are added.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Two blocking issues found at HEAD 284752a. (1) ShieldedSyncManager::start()'s spawn_periodic_loop calls reopen_quiescing_gate() before start_thread's clearing-latch check — a racing FFI shielded_sync_start during clear_shielded silently lowers the clear's continuously-held quiescing gate, partially defeating the new latch. (2) The Swift ShieldedService.clearLocalState catches clearShielded()'s ShieldedShutdownIncomplete error and continues to wipe SwiftData, violating the explicit PlatformWalletResult contract ("must NOT commit its own persistence wipe — the Rust store was left intact"). Prior carry-forward finding (public ShieldedSyncManager::quiesce can lower the held gate) is STILL VALID as a latent suggestion. Several lower-severity issues around the ClearingGuard's set-membership semantics and its test coverage are kept as suggestions/nitpicks.
🔴 2 blocking | 🟡 1 suggestion(s) | 💬 1 nitpick(s)
1 additional finding(s) omitted (not in diff).
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:131-161: Refused shielded (re)start still lowers clear_shielded's continuously-held quiescing gate
`spawn_periodic_loop` calls `self.reopen_quiescing_gate()` at line 137 (which `store(false, Release)`s `quiescing`) before invoking `registry.start_thread(...)` at line 142. The new per-key clearing latch correctly refuses the start inside `start_thread` (registry.rs:414-416), but by then the gate has already been lowered. `clear_shielded_inner` (manager/mod.rs:390-415) holds the gate up via `hold_quiescing_gate()` precisely to keep direct `sync_now`/`sync_wallet` callers blocked across drain → liveness-check → wipe, and a single shared `AtomicFlagGuard` always clears the flag on drop — the clear path has no way to re-raise it. So an FFI `platform_wallet_manager_shielded_sync_start` (exported at rs-platform-wallet-ffi/src/shielded_sync.rs:57) landing during a clear will: (a) call `ShieldedSyncManager::start()` → `spawn_periodic_loop` → `reopen_quiescing_gate()` setting `quiescing=false`, (b) be refused by the clearing latch at the registry, (c) leave the gate down for the remainder of the clear. A direct `sync_now`/`sync_wallet` going through `begin_pass` then observes `quiescing=false` and slips past the barrier, re-persisting into the store the clear is about to wipe — the exact race the new latch was added to close. Fix by checking the clearing latch BEFORE reopening the gate (e.g. early-return from `spawn_periodic_loop` if `registry.is_clearing(worker)`), or ordering the gate reopen so it only fires when the registry actually accepts the start (e.g. inside the closure passed to `start_thread`, or only after a non-rejecting outcome). The same hazard exists structurally for the identity-sync and platform-address coordinators if they ever gain a clearing latch.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-628: Swift clear flow wipes SwiftData rows after Rust reports the shielded store was left intact
`platform_wallet_manager_shielded_clear` maps `PlatformWalletError::ShieldedShutdownIncomplete` to `ErrorShutdownIncomplete` and returns BEFORE touching the Rust shielded store (rs-platform-wallet-ffi/src/shielded_sync.rs:446-461). The Swift error contract states this explicitly: `PlatformWalletResult.swift:188-193` says on `shutdownIncomplete` "the host ... must NOT commit its own persistence wipe — the Rust store was left intact so it can be retried once the pass settles." But `ShieldedService.clearLocalState` catches the thrown error, logs it via `SDKLogger.error`, and falls through into `modelContext.delete(model: PersistentShieldedNote/OutgoingNote/SyncState/Activity)` at lines 624-627. If Rust refused the clear because a coordinator didn't drain in time, SwiftData rows are deleted while the Rust commitment tree remains populated — the exact cross-language desync the new FFI result code was added to prevent. The legacy `// Best-effort — failure logs but doesn't abort the wipe` comment at lines 606-607 predates this contract and is now wrong. The catch must distinguish `.shutdownIncomplete` from other errors and return early without wiping host rows.
In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: ClearingGuard uses set-membership semantics — nested or overlapping guards release the latch on the first drop
`hold_clearing` inserts the key into a `BTreeSet<K>` and `ClearingGuard::drop` removes it unconditionally. With two live `ClearingGuard`s for the same key (nested call by future code, OR two concurrent FFI `platform_wallet_manager_shielded_clear` invocations on the same handle — `HandleStorage::with_item` only holds a read lock), the first guard to drop removes the set entry and the surviving guard's protection lapses silently. Today only `clear_shielded_inner` takes the latch and the test `hold_clearing_nested_guards_share_the_latch` (registry.rs:1857) documents the quirk via comment — but the public `pub fn hold_clearing` signature gives no hint to callers, and a concurrent host-thread clear is reachable through the FFI surface. Convert the backing store to `BTreeMap<K, NonZeroUsize>` so guards refcount/compose, or serialize `clear_shielded` per key at the manager boundary so re-entrant clears can't overlap.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta (97e373b) only adds nonisolated(unsafe) to two Swift handler properties for Swift 6 strict-concurrency compliance — that change is narrow and sound. All five prior findings remain unaddressed at HEAD, including two blocking issues: (1) spawn_periodic_loop unconditionally lowers the shared quiescing gate before the per-key clearing latch can refuse the (re)start, opening a window where a direct sync_now/sync_wallet can persist into the store being cleared, and (2) Swift ShieldedService.clearLocalState wipes SwiftData rows even when Rust returns ErrorShutdownIncomplete — the comment now explicitly codifies 'best-effort, failure logs but doesn't abort the wipe', which is the opposite of the documented FFI contract.
🔴 2 blocking | 🟡 1 suggestion(s)
1 additional finding(s) omitted (not in diff).
2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:131-161: [prior-1, carried forward] `spawn_periodic_loop` lowers the shared quiescing gate before the per-key clearing latch can refuse the (re)start
`spawn_periodic_loop` calls `self.reopen_quiescing_gate()` at line 137 — an unconditional `quiescing.store(false, Release)` on the `Arc<AtomicBool>` — BEFORE handing off to `registry.start_thread(...)` at line 142, where the per-key `hold_clearing(ShieldedSync)` latch refuses the start. During `clear_shielded_inner`, the clear flow continuously holds both the registry clearing latch and the quiescing gate (via `hold_quiescing_gate`, which is a non-restoring RAII guard). Race: (T1) clear holds latch + gate=true; (T2) host calls `shielded_sync_start()` → `spawn_periodic_loop` → `reopen_quiescing_gate` writes `false` → `start_thread` is refused by the latch, but the gate is already down; (T3) a direct `sync_now`/`sync_wallet` enters `begin_pass`, observes `quiescing == false`, takes the `is_syncing` slot, and runs to completion — persisting into the shielded store inside the clear-then-wipe window. The `quiesce_under_held_gate` debug_assert at line 219-222 catches the inconsistency only in debug builds; release builds proceed silently. Fix: only reopen the gate once `start_thread` confirms acceptance — either consult `lock_clearing` in `spawn_periodic_loop` and skip the reopen on refusal, move the reopen inside the registry's slot critical section after the latch check, or use the registry's accept signal to gate the store.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:196-203: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
`CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. Today the wallet's `shutdown` routes through `registry.shutdown()` rather than this method, but `ShieldedSyncManager::quiesce` remains `pub` and reachable via the `shielded_sync_arc` accessor — the trap is one accidental call away from re-opening the same lapse Clear was hardened to prevent. The structural fix is to give the gate a refcount/holder-aware re-raise (so concurrent owners compose), or to gate the public `quiesce` against the clearing latch and assert no other holder is active.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:606-628: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
`PlatformWalletFFIResultCode::ErrorShutdownIncomplete` is the explicit signal that the Rust shielded store was left intact: the FFI doc in `packages/rs-platform-wallet-ffi/src/error.rs` states 'the manager is NOT torn down and the store was left intact (Clear aborted before touching it). The host may retry the clear, and must not commit its own persistence wipe — doing so would desync the host's rows from the still-populated shared tree.' The current implementation at lines 608-616 catches the thrown error, logs it via `SDKLogger.error`, then falls through to step 2 (lines 623-628), unconditionally deleting `PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, and `PersistentShieldedActivity`. The comment at line 606-607 even codifies this: 'Best-effort — failure logs but doesn't abort the wipe.' That is the exact split-brain the FFI contract was designed to prevent: when Rust correctly preserves notes (orphan-alive backstop, drain timeout, non-clean drain status), Swift discards its mirror, and the next `bindShielded` + sync repopulates host rows from the surviving Rust state — the user-visible 'Clear' becomes a no-op on the Rust side while host data is lost. Minimum fix: inspect the thrown error code and short-circuit the SwiftData wipe (surfacing the error to the user / scheduler) on `errorShutdownIncomplete`.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried forward from prior review (still valid at bcf9388): prior-2 (blocking — Swift clearLocalState still wipes SwiftData on ErrorShutdownIncomplete), prior-3 (suggestion — public quiesce() drops AtomicFlagGuard on the shared atomic), prior-4 (suggestion — ClearingGuard uses set membership), prior-5 (nitpick — nested-guards test doesn't exercise the documented lapse). prior-1 is partially addressed: the new is_clearing short-circuit + SEC-002 regression test close the common race, but the check and the gate-write are still two distinct critical sections, leaving a narrow residual TOCTOU window — carried forward as a suggestion. No new defects identified in the latest delta (rs-platform-wallet-storage / secrets / keyless rehydration look careful: fail-closed topology check, transactional rollback, per-row skip taxonomy, structured FFI outcome).
🔴 1 blocking | 🟡 3 suggestion(s) | 💬 1 nitpick(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
The Rust FFI contract for `ErrorShutdownIncomplete` on `platform_wallet_manager_shielded_clear` (rs-platform-wallet-ffi/src/error.rs around the `ShieldedShutdownIncomplete → ErrorShutdownIncomplete` mapping, plus shielded_sync.rs:454) is explicit: when the manager surfaces this code, the Rust shielded store was left intact because Clear aborted before touching it (drain non-clean / orphan-alive backstop). The host must retry the clear and must NOT commit its own persistence wipe; doing so desyncs host rows from the still-populated shared tree. ShieldedService.swift:608-616 catches the throw, logs via `SDKLogger.error`, then falls through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the exact split-brain the FFI contract was designed to prevent. The Swift surface already distinguishes this code: `PlatformWalletResultCode.errorShutdownIncomplete` exists (PlatformWalletResult.swift:47) and decodes into `PlatformWalletError.shutdownIncomplete(detail)` (line 240), so the host can branch on it. Minimum fix: on the typed `shutdownIncomplete` case (or any thrown error from `clearShielded()`, since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user instead of proceeding.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:137-148: [prior-1, partial fix landed] Residual TOCTOU between `is_clearing` check and `reopen_quiescing_gate`
The new `is_clearing(self.worker)` short-circuit at line 145 closes the common race (clear is already holding the latch when spawn arrives), and the SEC-002 regression test in mod.rs:1071-1108 covers that ordering. But the check (line 145) and the gate-lower (line 148) are not done under a single critical section: `is_clearing` takes `lock_clearing`, returns the boolean, releases the lock; then `reopen_quiescing_gate` writes `false` to the shared atomic. If `clear_shielded_inner` acquires the latch AND raises the quiescing gate between those two points, the reopen at 148 lowers a gate Clear believes it holds continuously across drain → liveness → wipe. In that window a direct `sync_now`/`sync_wallet` going through `begin_pass` (lines 306-341) observes `quiescing == false`, claims the `is_syncing` slot, runs the pass to completion, and persists into the store about to be wiped. `quiesce_under_held_gate`'s `debug_assert!` (line 230) catches the inconsistency only in debug builds; release silently proceeds with the drain. The window is microseconds-narrow, but the design invariant is continuous gate-hold. Structural fix: have the registry expose an atomic observe-and-act under one `lock_clearing()` scope (e.g. `with_clearing_check`) so the gate write is sequenced with the latch read, or move the gate write inside the registry slot critical section alongside the latch check.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:207-214: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
`CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. `ShieldedSyncManager::quiesce` remains `pub` and reachable through the manager's shielded accessor — one accidental call away from re-opening the same lapse Clear was hardened to prevent. Structural fix: refcount/owner-aware re-raise (so concurrent owners compose), or gate the public `quiesce` against the clearing latch and assert no other holder is active before its guard drop is allowed to fire.
In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: [prior-4, carried forward] `ClearingGuard` uses set-membership — overlapping guards release the latch on the first drop
`hold_clearing` does `lock_clearing().insert(key)` into a `BTreeSet<K>`, and `ClearingGuard::Drop` (lines 1099-1103) unconditionally removes the key. With two live guards for the same key (a nested or overlapping clear from a future caller, a test, or an accidental shielded re-clear before the first guard drops), the first drop removes the key while the second guard is still alive — silently lapsing the latch protection the still-live guard is supposed to provide. The new `is_clearing()` check (line 620) inherits the same accounting weakness because it reads the same set membership. The test at lines 1859-1883 acknowledges this in its comment but does not fix the type-level mismatch between the RAII guard's compositional appearance and its set-based semantics. Minimum fix: refcount per key (`BTreeMap<K, NonZeroUsize>`), or add a `debug_assert!` that the insert was novel so accidental nesting fails loudly in debug.
Note: GitHub refused to serve this PR's diff via gh pr diff because it exceeds the 20,000-line limit, so this exact-SHA review is posted body-only. Existing inline threads are reused rather than duplicated.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta resolves prior-4 and prior-5: ClearingGuard now refcounts via BTreeMap<K, NonZeroUsize> so nested/concurrent guards compose, and the new test hold_clearing_inner_drop_does_not_lapse_outer_protection is non-vacuous and asserts the lapse window. prior-2 (Swift ShieldedService wipe-on-error) remains unchanged and blocking. prior-3 is partially addressed by the new is_clearing bail in public quiesce(), but the same check-then-act TOCTOU pattern from prior-1 now exists at both spawn_periodic_loop (lines 145-148) and quiesce (lines 217-224); collapsed into a single structural finding.
🔴 1 blocking | 🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the FFI contract
Unchanged at 6b655aab. The Rust FFI contract for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain/orphan-alive backstop). The host MUST NOT commit its own persistence wipe in that case, or SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is designed to prevent.
Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it. Minimum fix: on any thrown error from `clearShielded()` (since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1 + residual prior-3] Check-then-act TOCTOU around the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce` call sites
Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.
1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).
2. `quiesce` (217-224): the new fix to bail when `is_clearing` is true closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose `Drop` (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.
Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.
Note: Posted body-only because normal inline poster failed: command failed 1: python3 scripts/review_poster.py dashpay/platform 3954 6b655aa --dry-run.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried-forward prior findings dominate this incremental review. prior-2 (Swift host wipes SwiftData rows on a failed Rust shielded clear) is STILL VALID and remains blocking — the catch at ShieldedService.swift:611-615 still logs-and-falls-through into the SwiftData delete block, contradicting the Rust FFI's fail-closed ErrorShutdownIncomplete contract. prior-1+prior-3 (TOCTOU between is_clearing and the shared quiescing atomic at both spawn_periodic_loop:145-148 and quiesce:217-224) is STILL VALID as a suggestion. The latest delta (e46249f) only adds a one-shot panic=abort startup warn on with_reap_backstop, doc back-references on AtomicFlagGuard/EpilogueGuard, and two cfg-pinned tests — verified correct, no new defects.
🔴 1 blocking | 🟡 1 suggestion(s)
Findings not posted inline (2)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract — Unchanged at e46249f. The Rust FFI forplatform_wallet_manager_shielded_clearsurfacesErrorShutdownIncompletespecifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe... - [SUGGESTION]
packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU betweenis_clearingand the sharedquiescingatomic at bothspawn_periodic_loopandquiesce— Unchanged at e46249f. Same structural pattern at two call sites — neither performs theis_clearingcheck atomically with the subsequent write to the sharedquiescingatomic. 1.spawn_periodic_loop(145-148):is_clearing(self.worker)at 145 takeslock_clearing, returns the boolean, rele...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract
Unchanged at e46249fe. The Rust FFI for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe in that case, or the SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is engineered to prevent.
Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant for a fail-closed FFI. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it.
Minimum fix: on any thrown error from `clearShielded()`, short-circuit the SwiftData wipe, set `lastError` to a retry-actionable message, and return so the user can retry Clear once Rust quiesces.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU between `is_clearing` and the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce`
Unchanged at e46249fe. Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.
1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).
2. `quiesce` (217-224): the new `is_clearing` early-return closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose Drop (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.
Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Both prior findings are FIXED at f424caf: the Swift fail-closed path now early-returns before the SwiftData wipe when clearShielded() throws (ShieldedService.swift:618), and the is_clearing↔quiescing TOCTOU is closed by switching to a refcounted Arc with composable RefcountedFlagGuard. One new blocking issue remains: the drain_hook's unconditional fetch_add is paired with a fetch_sub that only fires when registry.quiesce() returns a non-NotRunning status AND when cancel_join_and_drain runs to completion. Both agents independently flag this asymmetry; codex via the production-reachable timeout cancellation path that clear_shielded_inner wraps around quiesce_under_held_gate, claude via the registry's Superseded / inner-NotRunning return sites. Either path leaks a +1 into quiescing, leaving the shielded coordinator permanently gated so every future begin_pass bails.
🔴 1 blocking
Findings not posted inline (1)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinator —drain_hookunconditionally doesquiescing.fetch_add(+1)(line 128) and is paired with a single conditionalfetch_sub(-1)here, gated onstatus != WorkerStatus::NotRunning. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning sta...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinator
`drain_hook` unconditionally does `quiescing.fetch_add(+1)` (line 128) and is paired with a single conditional `fetch_sub(-1)` here, gated on `status != WorkerStatus::NotRunning`. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning status, and the wallet-side future always runs to completion past line 285." Neither half of that invariant holds:
1. **Cancellation / timeout path (production-reachable).** `clear_shielded_inner` (manager/mod.rs:412-419) wraps `quiesce_under_held_gate()` in `tokio::time::timeout`. A wedged shielded background pass can keep the inner future suspended past line 702 of `registry.quiesce` (drain hook has already been awaited and `fetch_add(+1)` has run) while it polls the join loop. When the outer timeout elapses, the future is dropped, so the matching `fetch_sub` at line 286 never runs. The `_clearing_gate` in `clear_shielded_inner` then drops, releasing only its own contribution. Net result: `quiescing` stays at +1 for the lifetime of this `CoordinatorLifecycle`, so every subsequent `begin_pass` bails on the `quiescing.load() > 0` check at line 378. The shielded coordinator is silently and permanently quiesced — manual `sync_now`, `sync_wallet`, and the background loop all stop doing work — even though the host application has handled the timeout and moved on.
2. **Superseded / inner-NotRunning return paths inside `registry.quiesce`.** The early exit at registry.rs:693 (cancel and handle both absent → drain hook never fires) is safe under the current condition. The two NotRunning returns *inside* the join loop are not: `Step::Superseded` (registry.rs:760/777) and the inner `Step::NotRunning` (registry.rs:763/777) both happen *after* the hook has already fired at line 702, but return `WorkerStatus::NotRunning`, so the conditional `fetch_sub` is skipped — same leak. Today these branches are unreachable in the wallet because the `closing` / `hold_clearing` latches prevent same-key restart racing `quiesce`, but the comment at lines 277-284 documents an invariant stricter than what `registry.quiesce` actually guarantees. The PR description explicitly anticipates rs-dapi adopting `ThreadRegistry`; any future consumer that calls `quiesce()` without an external restart-prevention mutex inherits a silent gate leak.
The fundamental issue is the asymmetric ownership of the refcount: the registry's `drain_hook` raises the gate, but the wallet has to release it from outside the registry, with no signal for "did the hook actually fire on this call." The cleanest fixes are (a) remove the `drain_hook` contribution to `quiescing` entirely — the public `quiesce()` already raises the gate manually before calling `cancel_join_and_drain` (line 236), and `quiesce_under_held_gate` requires the caller to hold one, so the hook's raise is redundant; or (b) restructure so the hook's contribution is held by a cancellation-safe guard the registry hands back to the caller via `quiesce`'s return value, so Drop on any path (including future cancellation) releases the hook's ref. Either way, the conditional decrement here is unsafe and the comment overstates the registry's contract.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta correctly resolves prior-1 by removing the registry-side drain_hook (worker_config now returns drain: None) and relying on the lifecycle's cancellation-safe RefcountedFlagGuard. The new regression test (quiesce_cancellation_during_drain_does_not_leak_quiescing_refcount) is non-vacuous. New issue introduced by the same delta: PlatformWalletManager::shutdown() goes through registry.shutdown() -> registry.quiesce(key), never through CoordinatorLifecycle::quiesce(), so the per-coordinator quiescing gate is no longer raised at all during manager shutdown. The pre-fix drain hook used to provide that barrier; both reviewers independently flag its loss. The mod.rs:455 docstring is now stale relative to the implementation, and the prior race window where a concurrent host-issued sync_now/sync_wallet can slip past begin_pass during destroy is reopened.
🔴 1 blocking
Carried-forward prior findings
- Prior f424caf blocker is FIXED: FIXED at commit 36aa3a3 via the recommended fix (a) from the prior review:
CoordinatorLifecycle::worker_config()now returnsdrain: None(coordinator_lifecycle.rs:111-117), so no registry-sidefetch_addis ever paired with a wallet-side conditionalfetch_sub. The gate is now raised exclusively by the lifecycle's ownRefcountedFlagGuard—quiesce()'s local guard at line 221 and the Clear flow'shold_quiescing_gate— both of whoseDropis cancellation-safe (runs on future-drop unwind). New non-vacuous regression testquiesce_cancellation_during_drain_does_not_leak_quiescing_refcount(lines 613-711) aborts aquiesce()future mid-drain while a background pass is parked instd::thread::sleepand asserts the refcount returns to 0 — it would time out against the prior drain-hook design. Both the cancellation/timeout leak path and the Superseded/inner-NotRunning paths from the prior finding are eliminated by removing the asymmetric ownership entirely. (Note: the removal also eliminates the gate during PlatformWalletManager::shutdown() — see the new blocking finding above.)
New findings in the latest delta
In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
Settingworker_config().drain = Nonecorrectly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by whichPlatformWalletManager::shutdown()raised the per-coordinatorquiescinggate.shutdown()(manager/mod.rs:487-489) callsself.registry.shutdown().await, which iterates workers and callsself.quiesce(key)on the registry directly (registry.rs:834-866). It never goes throughCoordinatorLifecycle::quiesce()(the only path that now raises the gate via the localRefcountedFlagGuardat line 221) orquiesce_under_held_gate()(the Clear flow's path). Withdrain: None,registry.quiesceruns no hook and the gate stays at 0 throughout the entire teardown.
Consequences:
-
Production race during FFI
destroy. The FFI bridge invokesPlatformWalletManager::shutdown(), which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issuedsync_now/sync_walleton another thread can passbegin_pass'squiescing > 0check (line 357), claimis_syncing, run its body after the registry has cancelled/joined the background loop, and callpersister.store(...)or fire a host completion callback afterdestroyreturned and the host freed the callback context. The pre-fix drain hook'sfetch_addran insideregistry.quiesceand held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place forlifecycle.quiesce()direct callers and the shielded Clear flow. -
Stale documented invariant. The mod.rs:455 docstring on
shutdown()still claims "Each worker's drain raises itsquiescinggate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.
Fix options: (a) have PlatformWalletManager::shutdown() route per-coordinator teardown through each coordinator's lifecycle.quiesce() instead of (or in addition to) registry.shutdown(), so the lifecycle-owned RefcountedFlagGuard is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the RefcountedFlagGuard to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
Setting `worker_config().drain = None` correctly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by which `PlatformWalletManager::shutdown()` raised the per-coordinator `quiescing` gate. `shutdown()` (manager/mod.rs:487-489) calls `self.registry.shutdown().await`, which iterates workers and calls `self.quiesce(key)` on the registry directly (registry.rs:834-866). It never goes through `CoordinatorLifecycle::quiesce()` (the only path that now raises the gate via the local `RefcountedFlagGuard` at line 221) or `quiesce_under_held_gate()` (the Clear flow's path). With `drain: None`, `registry.quiesce` runs no hook and the gate stays at 0 throughout the entire teardown.
Consequences:
1. Production race during FFI `destroy`. The FFI bridge invokes `PlatformWalletManager::shutdown()`, which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issued `sync_now` / `sync_wallet` on another thread can pass `begin_pass`'s `quiescing > 0` check (line 357), claim `is_syncing`, run its body after the registry has cancelled/joined the background loop, and call `persister.store(...)` or fire a host completion callback after `destroy` returned and the host freed the callback context. The pre-fix drain hook's `fetch_add` ran inside `registry.quiesce` and held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place for `lifecycle.quiesce()` direct callers and the shielded Clear flow.
2. Stale documented invariant. The mod.rs:455 docstring on `shutdown()` still claims "Each worker's drain raises its `quiescing` gate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.
Fix options: (a) have `PlatformWalletManager::shutdown()` route per-coordinator teardown through each coordinator's `lifecycle.quiesce()` instead of (or in addition to) `registry.shutdown()`, so the lifecycle-owned `RefcountedFlagGuard` is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the `RefcountedFlagGuard` to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.
Note: Posted as a body-only review because GitHub refuses gh pr diff for this PR with PullRequest.diff too_large (>20,000 lines), so the normal inline mapper could not run.
…builds PR dashpay#3954 review thread dashpay#5 (comment 3481707091) asked for runtime visibility when the registry's reap path runs under `panic = "abort"`: an EpilogueGuard or AtomicFlagGuard panic during teardown aborts the process instead of unwinding, so Drop never releases the orphan-liveness gate and the registry's join-and-orphan-liveness invariant can stay held forever in the worst case. Stable Rust has NO runtime API to query the active panic strategy — only `#[cfg(panic = "...")]` at compile time. iOS release builds intentionally choose `panic = "abort"`, so a hard gate (compile-time error or refusal to construct) is wrong. A compile-time-gated, one-shot `tracing::warn!` from `ThreadRegistry::with_reap_backstop` is a strict improvement over the doc-caveat-only state today: operators get a single audit-trail line at startup pointing them at the registry.rs / atomic.rs caveats. The warn is gated by a process-wide `std::sync::Once` so repeated registry construction (tests, restart paths, multi-instance hosts) fires it once. The static is hoisted to module scope (also `#[cfg(panic = "abort")]`) so the in-crate regression test can probe `Once::is_completed()` directly without subscribing to tracing from a `#[test]`. Doc caveats on `EpilogueGuard` (registry.rs) and `AtomicFlagGuard` (atomic.rs) are extended with a one-liner pointing operators at the new constructor-emitted warn. Tests: - `with_reap_backstop_emits_panic_abort_warn_under_abort_builds` (only compiled under abort) — proves the Once trips on first call and stays one-shot across a second construction. Build-cfg-pinned: if the cfg block in `with_reap_backstop` is ever deleted, this test disappears with it and CI loses the signal. - `with_reap_backstop_no_warn_under_unwind` (only compiled under unwind) — sentinel proving the no-op cfg branch compiles and `with_reap_backstop` keeps behaving like a plain constructor on the default profile. Verification: - `cargo test -p dash-async --all-targets` — 36 passed (was 35). - `cargo clippy -p dash-async --all-targets -- -D warnings` — clean. - `cargo fmt --all -- --check` — clean. - `cargo check -p platform-wallet -p platform-wallet-ffi --all-targets` — clean. - `RUSTFLAGS="-C panic=abort" cargo build -p dash-async --lib` — clean. (`--tests` requires nightly `-Zpanic_abort_tests`; lib path covers the new cfg branch.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
|
Blocking finding addressed in On the asserted UAF (refuted at HEAD, but the point stands): the concurrent direct But the mechanical observations were correct and worth fixing: with
|
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The prior shutdown-gate blocker is fixed in the current head because manager shutdown now routes coordinator teardown through each lifecycle quiesce(). However, the new implementation releases those lifecycle gates before the registry shutdown latch is set, leaving a narrow but real post-quiesce race where direct syncs or restarts can enter after the coordinator barrier but before the rest of shutdown is closed.
🔴 1 blocking
Findings not posted inline (1)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closed —quiesce_coordinators().awaitraises each lifecycle gate only for the duration of that coordinator'squiesce(). As soon as line 516 returns, those guards have dropped, butThreadRegistry::shutdown()has not yet set the registryclosinglatch until line 523 begins polling the registry futur...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closed
`quiesce_coordinators().await` raises each lifecycle gate only for the duration of that coordinator's `quiesce()`. As soon as line 516 returns, those guards have dropped, but `ThreadRegistry::shutdown()` has not yet set the registry `closing` latch until line 523 begins polling the registry future. A concurrent Rust caller holding the manager or coordinator `Arc` can enter `sync_now()` / shielded `sync_wallet()` in that gap, see `quiescing == 0`, claim `is_syncing`, and continue persisting or firing callbacks while shutdown tears down the event adapter and returns. A concurrent `start()` can also install a fresh coordinator before `closing` is set; the registry may then correctly report that new worker's shutdown status, but the merge at lines 524-525 overwrites it with the stale status captured before the restart. Hold the coordinator teardown gates, or set an equivalent manager/registry closing barrier, across the entire registry shutdown phase before reporting that shutdown is clean.
|
Round-2 blocking finding fixed in Confirmed: round-1 held each coordinator's Fix: Restart/merge: also fixed. The per-worker merge now overwrites a coordinator's slot only when the registry classified it Test: new
|
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried-forward prior findings: prior-1 is FIXED. The current head raises every coordinator gate before coordinator quiesce and holds those guards through registry shutdown and status merging, and I found no new in-scope latest-delta findings. Targeted test execution could not complete because the build needed to download Tenderdash proto sources and the network escalation retry was rejected.
Note: Posted body-only because GitHub refused the PR diff (PullRequest.diff too_large, over 20,000 lines), so inline mapping could not run. The verified output contained no findings.
… join [#3954, independent on v3.1-dev] Introduce a shared `ThreadRegistry<WalletWorker>` lifecycle engine in `dash-async` (weight-ordered graceful shutdown: lower weights drain first, equal weights concurrently; reap-or-park orphan handling; cancellation-safe `RefcountedFlagGuard`). Migrate the three periodic sync coordinators (identity / platform-address / shielded) onto a shared `CoordinatorLifecycle` that drives them through the registry and gates passes via an `is_syncing` / `quiescing` handshake. `PlatformWalletManager::shutdown()` now holds every coordinator's quiescing gate across the whole teardown, quiesces the weight-0 coordinators concurrently, then drains the weight-10 wallet-event adapter and any parked orphans, returning a per-worker `ShutdownReport`. `clear_shielded` refuses (leaving the commitment-tree store intact) when the in-flight pass does not drain cleanly, surfaced as `ShieldedShutdownIncomplete` / `ErrorShutdownIncomplete = 19`. FFI `destroy` reports the incomplete-shutdown code so the Swift host defers freeing its callback context. Reconstructed independently on a clean v3.1-dev base: only the graceful- shutdown contribution is included; the drifted #3692 rehydration/storage work merged into the source branch is excluded. <sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
17e1c07 to
a89e0f9
Compare
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
No carried-forward prior findings from the 17e1c07 review; the earlier shutdown-gate issue remains fixed. I verified one new in-scope latest-delta regression: the SQLite same-path open guard was removed while buffers remain per-persister. The destroy finding was dropped because the PR explicitly documents destroy as terminal and the Swift wrapper implements that contract by retaining callback contexts on incomplete shutdown.
🔴 2 blocking
Findings not posted inline (1)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers — The latest delta removed the process-wide canonical-path registry and theAlreadyOpenerror, soopen()now returns a freshSqlitePersisterwith its ownBuffer::new()for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot co...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet-storage/src/sqlite/persister.rs`:
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.
|
Re: thepastaclaw's 2026-06-29 re-review ( This is a storage-crate concern that belongs to #3968, not this PR. #3954 ( The concern itself is real and already addressed in #3968: the process-wide canonical-path registry + Dispositioning as out of scope for #3954; tracked and fixed in #3968. 🤖 Co-authored by Claudius the Magnificent AI Agent |
Why this PR exists
ThreadRegistry(rs-dash-async) and aCoordinatorLifecyclethat own worker registration, weighted graceful-shutdown join, and quiescing gates; the identity / platform-address / shielded sync managers are re-plumbed onto it.v3.1-dev— references zero feat(platform-wallet): watch-only rehydration from persistor (seedless load) #3692 symbols; reworks the v3.1-dev sync managers directly. Combines with feat(platform-wallet): watch-only rehydration from persistor (seedless load) #3692/feat(platform-wallet): persistence readers + seedless load() wiring (split from #3692) #3968 on thedash-evo-toolintegration branch (where it also picks up a shutdown cancel-drain enhancement that needs feat(platform-wallet): watch-only rehydration from persistor (seedless load) #3692's changeset reshape to exist).What was done
rs-dash-async:ThreadRegistry, atomic flag guards,ShutdownReport/WorkerStatus.rs-platform-wallet:CoordinatorLifecycle, registry-driven event-adapter task,shutdown()with weighted join + detached-orphan reporting;ShieldedShutdownIncompleteerror + FFI code 19.Testing
cargo fmt --all --check;cargo clippy --workspace --all-targets -- -D warnings;cargo test -p platform-wallet -p dash-async(dash-async 38; platform-wallet 322 with--all-features; 47 shutdown/registry tests incl.shutdown_joins_all_workers,shutdown_waits_for_in_flight_pass_to_drain,restart_in_place). The 4 Swift files need a separate iOS build check.Breaking changes
!— adds an FFI error code (19) and a shutdown error variant.Checklist
🤖 Co-authored by Claudius the Magnificent AI Agent