Skip to content

Clarifying workflow ID consistency and HA#4794

Open
lukeknep wants to merge 12 commits into
mainfrom
workflowid-ha-clarification
Open

Clarifying workflow ID consistency and HA#4794
lukeknep wants to merge 12 commits into
mainfrom
workflowid-ha-clarification

Conversation

@lukeknep

@lukeknep lukeknep commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

For High Availability and Conflict Resolution, Workflow ID uniqueness behaves slightly differently. This clarifies it.

Notes to reviewers

Internal context: https://temporaltechnologies.slack.com/archives/C071L1W22UE/p1781556361935309

┆Attachments: EDU-6621 Clarifying workflow ID consistency and HA

@lukeknep lukeknep requested a review from a team as a code owner June 29, 2026 22:51
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Ready Ready Preview, Comment Jul 1, 2026 7:32pm

Request Review

@github-actions

Copy link
Copy Markdown
Contributor

📖 Docs PR preview links

- **Operations that had not yet replicated at the moment of failover.** A Workflow start, Signal, Update, or other
operation that the active region accepted but had not yet replicated falls within the
[recovery point](/cloud/rpo-rto). If the original region recovers, these operations are reconciled into the active
Namespace and virtually nothing is lost. If the original region is _permanently_ lost — for example, an unrecoverable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to be careful about "virtually" here. Seems like the sort of thing a lawyer would pounce on. I get it that we can't guarantee that absolutely nothing is lost, but what falls within the scope of "virtually"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the "virtually nothing is lost" line and just apply the following link to explain "reconciled into the active Namespace" /cloud/high-availability/failovers#conflict-resolution

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'd want that sentence to read:
If the original region recovers, those operations are reconciled into the active Namespace according to the Conflict Resolution process.
Is that correct?

Namespaces with replicas rely on asynchronous event replication. Updates made to the primary may not immediately be
reflected in the replica due to <ToolTipTerm term="replication lag" />, particularly during failovers. In the event of a
non-graceful failover, replication lag may cause a temporary setback in Workflow progress.
non-graceful failover, replication lag has two distinct effects, which are important to tell apart:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to tell these two effects apart? I get that one is recoverable and the other may or may not be 100% recoverable, but what can the user do about it? Or is the purpose of adding this section to say "there may be a circumstance in which some operations can't be recovered"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the motivating question that prompted the edit, but might not be the right final framing. A customer wanted to know more precise details of what happens to data after a regional outage that does not recover. There are two answers:

  • A workflow with history in both clusters will lose the progress that was recorded in the replication lag and re-drive those activities.
  • StartWorkflow, SignalWorkflow, and UpdateWorkflow that happen within the replication lag will be lost. This can create a scenario where Temporal Cloud responds with a 200 to a StartWorkflow, the outage happens in the next 1 minute, and then we lose the data that the workflow started/signaled/updated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this specific part of this specific section, the right framing might be more like "Conflict Resolution can only recover data from a functioning Temporal cluster. If the active cluster completely fails and does not recover, Workflow API calls that fall within the 1 minute RPO may be permanently lost."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we put the scary stuff inside a caution element. So how's this?

Suggested change
non-graceful failover, replication lag has two distinct effects, which are important to tell apart:
In the event of a non-graceful failover, replication lag has two distinct effects:
**A temporary setback in Workflow progress.** Operations that had already replicated remain durable in the replica. Operations still in the replication backlog are reconciled when the original region recovers, so progress is set back temporarily but not lost.
**Operations that had not yet replicated at the moment of failover.** A Workflow start, Signal, Update, or other operation that the active region accepted but had not yet replicated falls within the
[recovery point](/cloud/rpo-rto). If the original region recovers, those operations are reconciled into the active Namespace according to the Conflict Resolution process. If the original region is _permanently_ lost — for example, an unrecoverable cell outage — operations within the recovery point window may be lost entirely. Temporal Cloud targets a recovery point of under one minute for these outages.
:::caution
Conflict Resolution can only recover data from a functioning Temporal Server. If the active server completely fails and does not recover, Workflow API calls that fall within the 1 minute RPO may be permanently lost.
:::

Address reviewer comments on the Conflict resolution section:
- Remove "virtually nothing is lost" wording
- Reference the Conflict Resolution process instead
- Move permanent-loss detail into a :::caution admonition

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removed redundant explanation about Workflow Id uniqueness during failover scenarios.
Clarify the definition of manual failover and its scenarios.
Clarify the conditions under which data may be permanently lost during a server failure.
Clarified the conditions under which operations may be lost during failover and emphasized the importance of a functioning Temporal Server for Conflict Resolution.
Clarify the durability of Workflow starts and Signals during failovers.
Clarify effects of non-graceful failover on Workflow progress and replication.
lukeknep and others added 3 commits July 1, 2026 12:27
…lover

Expand the bare assertion that conflict resolution preserves Workflow Id
uniqueness into a titled subsection that walks the mechanism: steady-state
enforcement, divergence during a forced/split-brain failover, and the
fork-not-merge reconciliation that leaves a single Open Execution per
Workflow Id. Notes that uniqueness is distinct from durability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`SignalWorkflowExecution` call that returns success is durably committed in the active region, and replicated
asynchronously to the replica.

### How Workflow Id uniqueness is preserved after a forced failover

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@temporal-nick can you check for correctness?

Namespaces with replicas rely on asynchronous event replication. Updates made to the primary may not immediately be
reflected in the replica due to <ToolTipTerm term="replication lag" />, particularly during failovers. In the event of a
non-graceful failover, replication lag may cause a temporary setback in Workflow progress.
non-graceful failover, replication lag causes temporary setback in Workflow progress. At the moment of non-graceful

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianmacdonald-temporal you had some comments here. Could you review my rewrite?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still suggest putting the text starting with "Note that" in a Caution box. I think we want to be clear on the circumstances in which data cannot be fully recovered. Other than that, the text itself looks good.


- Operations that had already replicated remain durable in the replica.
- Operations that had not yet replicated (i.e., that are still in the replication backlog) are reconciled when the
region recovers. according to the Conflict Resolution process. - Note that Conflict Resolution can only recover data

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a period . and not a comma ,

3. **Fork instead of merge.** Temporal Cloud does not interleave the divergent histories. Events from the previously
active Namespace that arrive after the failover cannot be directly applied, so Temporal Cloud forks the Event History
and creates a new branch history, each branch identified by its own Run Id. Its
<ToolTipTerm term="conflict resolution" /> process keeps one branch as the Open Execution and supersedes the other,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One "execution", not one branch. The execution from the source-side will become a "zombie" workflow execution that is eventually terminated on the source once the target replicates information about a competing workflow ID


After failover, Temporal Cloud creates a new branch history for execution and begins its <ToolTipTerm term="conflict resolution" /> process. The Temporal Service ensures that Event Histories remain valid and are replayable by SDKs post-failover or after conflict resolution.
The same durability boundary applies to Workflow starts and Signals: a `StartWorkflowExecution` or
`SignalWorkflowExecution` call that returns success is durably committed in the active region, and replicated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also SignalWithStartWorkflowExecution?


The [Workflow Id uniqueness guarantee](/workflow-execution/workflowid-runid) — at most one Open Workflow Execution per
Workflow Id — is always enforced within the active Namespace, and conflict resolution preserves it across a failover.
The guarantee limits how many Executions are _Open_ at the same time, not how many

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two policies on workflow ID:

  1. ID reuse policy: whether a Workflow Execution is allowed to spawn with a particular Workflow Id, if that Workflow Id has been used with a previous, and now Closed, Workflow Execution.
  2. ID conflict policy: determines how to resolve a conflict when spawning a new Workflow Execution with a particular Workflow Id used by an existing Open Workflow Execution.

I think here you try to attach the conflict policy. It is a little confusing as the ID reuse policy manages the workflow ID based on previous run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants