Skip to content

Apply lengthy-operation timeout to state-dict checkpoint save/export#543

Merged
jlamypoirier merged 1 commit into
mainfrom
jlp_checkpoint_export_timeout
Jun 17, 2026
Merged

Apply lengthy-operation timeout to state-dict checkpoint save/export#543
jlamypoirier merged 1 commit into
mainfrom
jlp_checkpoint_export_timeout

Conversation

@jlamypoirier

Copy link
Copy Markdown
Collaborator

Authored by Claude Opus 4.8 (via Claude Code), reviewed by @jlamypoirier.

Problem

StateDictCheckpointHandler.save() (used for fast_llm- and HuggingFace-format checkpoint export via Trainer and the offline convert) runs its per-parameter weight-gather collectives under the default DistributedConfig.timeout (60 s) instead of the lengthy-operation training.timeout (default 3600 s).

The load path does not have this problem: load() wraps its collectives in SafeLoad(timeout=config.timeout), and _save_checkpoint already passes training.timeout to the surrounding barriers and to get_save_config(...) — but nothing applied that timeout to the gather collectives inside save().

Saving a state-dict checkpoint is rank-0-serialized (rank 0 writes the model_*.safetensors files while the other ranks wait, with gathers interleaved between files). On a large model and/or slow/networked storage, a gather on the waiting ranks can exceed 60 s, at which point the NCCL collective watchdog fires and training aborts with:

RuntimeError: Desync detected for barrier export <step> exit (...)

The distributed (distributed-format) checkpoint is unaffected because every rank writes its own shard in parallel, so no single collective stalls past 60 s.

Observed multi-node (16×H100, 2 nodes) exporting a ~7B model to networked storage: the step-200 distributed checkpoint completed fine, while the fast_llm export at step 400 aborted ~60 s into the save.

Fix

Wrap the gather loop in save() with set_timeout(world_group, config.timeout), mirroring the load path. No config change is needed — training.timeout already defaults to 3600 s and flows into config.timeout.

Notes / scope

  • Verified by re-running the same multi-node export with the fix: the export at the previously-failing step completes and training continues.
  • iter_tensors() / iter_checkpoint() (the streaming weight-broadcast consumer) shares the same gather pattern but has different control flow (a generator driven by an external consumer) and its own timeout handling; left out of scope here. Worth a follow-up look.

StateDictCheckpointHandler.save() ran its weight-gather collectives
under the default DistributedConfig timeout (60s) rather than the
lengthy-operation training timeout, unlike load() which wraps its
collectives in SafeLoad(timeout=config.timeout). Saving is
rank-0-serialized, so on a large model or slow storage a gather on the
waiting ranks can exceed 60s and the NCCL watchdog aborts with a
barrier desync. Wrap the save gather loop in set_timeout(world_group,
config.timeout) to match the load path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jlamypoirier jlamypoirier merged commit bdfbb5a into main Jun 17, 2026
5 checks passed
@jlamypoirier jlamypoirier deleted the jlp_checkpoint_export_timeout branch June 17, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant