Clarifying workflow ID consistency and HA by lukeknep · Pull Request #4794 · temporalio/documentation

lukeknep · 2026-06-29T22:51:55Z

What does this PR do?

For High Availability and Conflict Resolution, Workflow ID uniqueness behaves slightly differently. This clarifies it.

Notes to reviewers

Internal context: https://temporaltechnologies.slack.com/archives/C071L1W22UE/p1781556361935309

┆Attachments: EDU-6621 Clarifying workflow ID consistency and HA

vercel · 2026-06-29T22:52:01Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
temporal-documentation	Ready	Preview, Comment	Jul 1, 2026 7:32pm

github-actions · 2026-06-29T22:52:28Z

📖 Docs PR preview links

Cloud
- High Availability
  - Failovers
Workflow Execution
- Workflow Id and Run Id

brianmacdonald-temporal · 2026-06-30T13:09:15Z

+- **Operations that had not yet replicated at the moment of failover.** A Workflow start, Signal, Update, or other
+  operation that the active region accepted but had not yet replicated falls within the
+  [recovery point](/cloud/rpo-rto). If the original region recovers, these operations are reconciled into the active
+  Namespace and virtually nothing is lost. If the original region is _permanently_ lost — for example, an unrecoverable


I think we might want to be careful about "virtually" here. Seems like the sort of thing a lawyer would pounce on. I get it that we can't guarantee that absolutely nothing is lost, but what falls within the scope of "virtually"?

I think we can remove the "virtually nothing is lost" line and just apply the following link to explain "reconciled into the active Namespace" /cloud/high-availability/failovers#conflict-resolution

So we'd want that sentence to read:
If the original region recovers, those operations are reconciled into the active Namespace according to the Conflict Resolution process.
Is that correct?

brianmacdonald-temporal · 2026-06-30T13:11:42Z

 Namespaces with replicas rely on asynchronous event replication. Updates made to the primary may not immediately be
 reflected in the replica due to <ToolTipTerm term="replication lag" />, particularly during failovers. In the event of a
-non-graceful failover, replication lag may cause a temporary setback in Workflow progress.
+non-graceful failover, replication lag has two distinct effects, which are important to tell apart:


Why is it important to tell these two effects apart? I get that one is recoverable and the other may or may not be 100% recoverable, but what can the user do about it? Or is the purpose of adding this section to say "there may be a circumstance in which some operations can't be recovered"?

That's the motivating question that prompted the edit, but might not be the right final framing. A customer wanted to know more precise details of what happens to data after a regional outage that does not recover. There are two answers:

A workflow with history in both clusters will lose the progress that was recorded in the replication lag and re-drive those activities.

StartWorkflow, SignalWorkflow, and UpdateWorkflow that happen within the replication lag will be lost. This can create a scenario where Temporal Cloud responds with a 200 to a StartWorkflow, the outage happens in the next 1 minute, and then we lose the data that the workflow started/signaled/updated.

I think for this specific part of this specific section, the right framing might be more like "Conflict Resolution can only recover data from a functioning Temporal cluster. If the active cluster completely fails and does not recover, Workflow API calls that fall within the 1 minute RPO may be permanently lost."

I suggest we put the scary stuff inside a caution element. So how's this?

Suggested change

non-graceful failover, replication lag has two distinct effects, which are important to tell apart:

In the event of a non-graceful failover, replication lag has two distinct effects:

**A temporary setback in Workflow progress.** Operations that had already replicated remain durable in the replica. Operations still in the replication backlog are reconciled when the original region recovers, so progress is set back temporarily but not lost.

**Operations that had not yet replicated at the moment of failover.** A Workflow start, Signal, Update, or other operation that the active region accepted but had not yet replicated falls within the

[recovery point](/cloud/rpo-rto). If the original region recovers, those operations are reconciled into the active Namespace according to the Conflict Resolution process. If the original region is _permanently_ lost — for example, an unrecoverable cell outage — operations within the recovery point window may be lost entirely. Temporal Cloud targets a recovery point of under one minute for these outages.

:::caution

Conflict Resolution can only recover data from a functioning Temporal Server. If the active server completely fails and does not recover, Workflow API calls that fall within the 1 minute RPO may be permanently lost.

:::

Address reviewer comments on the Conflict resolution section: - Remove "virtually nothing is lost" wording - Reference the Conflict Resolution process instead - Move permanent-loss detail into a :::caution admonition Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Removed redundant explanation about Workflow Id uniqueness during failover scenarios.

Clarify the definition of manual failover and its scenarios.

Clarify the conditions under which data may be permanently lost during a server failure.

Clarified the conditions under which operations may be lost during failover and emphasized the importance of a functioning Temporal Server for Conflict Resolution.

Clarify the durability of Workflow starts and Signals during failovers.

Clarify effects of non-graceful failover on Workflow progress and replication.

…lover Expand the bare assertion that conflict resolution preserves Workflow Id uniqueness into a titled subsection that walks the mechanism: steady-state enforcement, divergence during a forced/split-brain failover, and the fork-not-merge reconciliation that leaves a single Open Execution per Workflow Id. Notes that uniqueness is distinct from durability. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lukeknep · 2026-07-01T19:32:25Z

+`SignalWorkflowExecution` call that returns success is durably committed in the active region, and replicated
+asynchronously to the replica.
+
+### How Workflow Id uniqueness is preserved after a forced failover


@temporal-nick can you check for correctness?

lukeknep · 2026-07-01T19:32:47Z

 Namespaces with replicas rely on asynchronous event replication. Updates made to the primary may not immediately be
 reflected in the replica due to <ToolTipTerm term="replication lag" />, particularly during failovers. In the event of a
-non-graceful failover, replication lag may cause a temporary setback in Workflow progress.
+non-graceful failover, replication lag causes temporary setback in Workflow progress. At the moment of non-graceful


@brianmacdonald-temporal you had some comments here. Could you review my rewrite?

I'd still suggest putting the text starting with "Note that" in a Caution box. I think we want to be clear on the circumstances in which data cannot be fully recovered. Other than that, the text itself looks good.

temporal-nick · 2026-07-01T20:41:48Z

+
+- Operations that had already replicated remain durable in the replica.
+- Operations that had not yet replicated (i.e., that are still in the replication backlog) are reconciled when the
+  region recovers. according to the Conflict Resolution process. - Note that Conflict Resolution can only recover data


This is a period . and not a comma ,

temporal-nick · 2026-07-01T22:43:09Z

+3. **Fork instead of merge.** Temporal Cloud does not interleave the divergent histories. Events from the previously
+   active Namespace that arrive after the failover cannot be directly applied, so Temporal Cloud forks the Event History
+   and creates a new branch history, each branch identified by its own Run Id. Its
+   <ToolTipTerm term="conflict resolution" /> process keeps one branch as the Open Execution and supersedes the other,


One "execution", not one branch. The execution from the source-side will become a "zombie" workflow execution that is eventually terminated on the source once the target replicates information about a competing workflow ID

yux0 · 2026-07-02T05:16:47Z

-
-After failover, Temporal Cloud creates a new branch history for execution and begins its <ToolTipTerm term="conflict resolution" /> process. The Temporal Service ensures that Event Histories remain valid and are replayable by SDKs post-failover or after conflict resolution.
+The same durability boundary applies to Workflow starts and Signals: a `StartWorkflowExecution` or
+`SignalWorkflowExecution` call that returns success is durably committed in the active region, and replicated


also SignalWithStartWorkflowExecution?

yux0 · 2026-07-02T05:25:03Z

+
+The [Workflow Id uniqueness guarantee](/workflow-execution/workflowid-runid) — at most one Open Workflow Execution per
+Workflow Id — is always enforced within the active Namespace, and conflict resolution preserves it across a failover.
+The guarantee limits how many Executions are _Open_ at the same time, not how many


There are two policies on workflow ID:

ID reuse policy: whether a Workflow Execution is allowed to spawn with a particular Workflow Id, if that Workflow Id has been used with a previous, and now Closed, Workflow Execution.

ID conflict policy: determines how to resolve a conflict when spawning a new Workflow Execution with a particular Workflow Id used by an existing Open Workflow Execution.

I think here you try to attach the conflict policy. It is a little confusing as the ID reuse policy manages the workflow ID based on previous run.

Clarifying workflow ID consistency and HA

317bfd7

lukeknep requested a review from a team as a code owner June 29, 2026 22:51

vercel Bot deployed to Preview June 29, 2026 22:52 View deployment

brianmacdonald-temporal reviewed Jun 30, 2026

View reviewed changes

sync-by-unito Bot assigned brianmacdonald-temporal Jun 30, 2026

vercel Bot deployed to Preview July 1, 2026 19:00 View deployment

Simplify note on Workflow Id uniqueness

4aa3276

Removed redundant explanation about Workflow Id uniqueness during failover scenarios.

vercel Bot deployed to Preview July 1, 2026 19:05 View deployment

Refine manual failover explanation in documentation

6ff7dfe

Clarify the definition of manual failover and its scenarios.

vercel Bot deployed to Preview July 1, 2026 19:07 View deployment

Update caution note on data loss during failovers

1ca47e4

Clarify the conditions under which data may be permanently lost during a server failure.

vercel Bot deployed to Preview July 1, 2026 19:09 View deployment

Update failover documentation for clarity

111be20

Clarified the conditions under which operations may be lost during failover and emphasized the importance of a functioning Temporal Server for Conflict Resolution.

vercel Bot deployed to Preview July 1, 2026 19:10 View deployment

Update failover section for Workflow execution semantics

81fe9a8

Clarify the durability of Workflow starts and Signals during failovers.

vercel Bot deployed to Preview July 1, 2026 19:12 View deployment

Update conflict resolution section for clarity

25259e0

Clarify effects of non-graceful failover on Workflow progress and replication.

vercel Bot deployed to Preview July 1, 2026 19:15 View deployment

lukeknep and others added 3 commits July 1, 2026 12:27

Trim redundant clause from steady-state uniqueness note

f1785c1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Simplify divergence condition in uniqueness note

d84dd88

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel Bot deployed to Preview July 1, 2026 19:30 View deployment

Remove durability-vs-uniqueness caveat sentence

f58c302

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel Bot deployed to Preview July 1, 2026 19:32 View deployment

lukeknep commented Jul 1, 2026

View reviewed changes

temporal-nick reviewed Jul 1, 2026

View reviewed changes

yux0 reviewed Jul 2, 2026

View reviewed changes

-non-graceful failover, replication lag has two distinct effects, which are important to tell apart:
+In the event of a non-graceful failover, replication lag has two distinct effects:
+**A temporary setback in Workflow progress.** Operations that had already replicated remain durable in the replica. Operations still in the replication backlog are reconciled when the original region recovers, so progress is set back temporarily but not lost.
+**Operations that had not yet replicated at the moment of failover.** A Workflow start, Signal, Update, or other operation that the active region accepted but had not yet replicated falls within the
+  [recovery point](/cloud/rpo-rto). If the original region recovers, those operations are reconciled into the active Namespace according to the Conflict Resolution process. If the original region is _permanently_ lost — for example, an unrecoverable cell outage — operations within the recovery point window may be lost entirely. Temporal Cloud targets a recovery point of under one minute for these outages.
+:::caution
+Conflict Resolution can only recover data from a functioning Temporal Server. If the active server completely fails and does not recover, Workflow API calls that fall within the 1 minute RPO may be permanently lost.
+:::

Uh oh!

Conversation

lukeknep commented Jun 29, 2026 • edited by sync-by-unito Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Notes to reviewers

Uh oh!

vercel Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 29, 2026

📖 Docs PR preview links

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lukeknep commented Jun 29, 2026 •

edited by sync-by-unito Bot

Loading

vercel Bot commented Jun 29, 2026 •

edited

Loading