Skip to content

gcs: keep bridge alive across live-migration transport swap#2771

Merged
rawahars merged 2 commits into
microsoft:mainfrom
rawahars:lm_gcs_bridge
Jun 19, 2026
Merged

gcs: keep bridge alive across live-migration transport swap#2771
rawahars merged 2 commits into
microsoft:mainfrom
rawahars:lm_gcs_bridge

Conversation

@rawahars

Copy link
Copy Markdown
Contributor

Add SetMigrating / ResumeOnConn on the bridge (plumbed through GuestConnection and Guest) so callers can park the recv/send loops during a UVM migration blackout and swap in the new hvsock without dropping in-flight RPCs. CreateConnection gains a coldStart bool so the migration destination skips the fresh-boot handshake.

Drive-bys: shim Stop honours caller ctx, Capabilities is nil-safe, ErrGuestConnectionUnavailable is exported, add session-id/action log fields.

Add SetMigrating / ResumeOnConn on the bridge (plumbed through
GuestConnection and Guest) so callers can park the recv/send loops
during a UVM migration blackout and swap in the new hvsock without
dropping in-flight RPCs. CreateConnection gains a coldStart bool so
the migration destination skips the fresh-boot handshake.

Drive-bys: shim Stop honours caller ctx, Capabilities is nil-safe,
ErrGuestConnectionUnavailable is exported, add session-id/action
log fields.

Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
@rawahars rawahars requested a review from a team as a code owner June 11, 2026 19:56
Comment thread internal/gcs/bridge.go
// SetMigrating toggles tolerance of transport-level failures around a
// live-migration blackout. Explicit [bridge.Close] and the RPC timeout
// kill still tear the bridge down.
func (brdg *bridge) SetMigrating(migrating bool) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine but I dont get why its necessary. Why would we want transport level tolerance only when migrating? I get that this is a local loopback connection so in practice it likely never disconnects but doesnt it seem reasonable to just implement the bridge such that on disconnect its auto paused, and on reconnect it continues? No policy needed ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t want to do that under normal circumstances. This is because our shim depends on the invariant that if the bridge collapses then it’s a fatal error and all the Waits are released and thereafter, the workflow goes into teardown mode.

Just during migration, we avoid the same, so that in case of restore on rollback, we can resume over a fresh socket connection.

Comment thread internal/gcs/bridge.go Outdated

// ResumeOnConn swaps the bridge transport onto conn and wakes the recv
// loop without dropping outstanding RPCs.
func (brdg *bridge) ResumeOnConn(newConn io.ReadWriteCloser) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can ResumeOnConn swap/close conn while sendLoop or recvLoop is using it (creating a data race on the interface field)?
If so, should we consider syncing all conn read/writes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of syncing all the read/writes, I have made changes to convert conn to atomic value which removes the race when the connection is swapped. During migration window, if there are any sends then those will fail as expected and are also tolerated due to the conditional check in send loop.

Comment thread internal/gcs/bridge.go
defer brdg.mu.Unlock()

if brdg.closed {
return ErrBridgeClosed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk of socket/handle leak if bridge closes between accept and swap, since we are not closing the passed connection before returning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. Fixed it in last commit.

…ation

Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>

@marma-dev marma-dev left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rawahars rawahars merged commit 96a7412 into microsoft:main Jun 19, 2026
31 of 33 checks passed
@rawahars rawahars deleted the lm_gcs_bridge branch June 19, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants