fix timeout and audio sync by raghavm243512 · Pull Request #154 · ServiceNow/eva

raghavm243512 · 2026-06-17T18:11:48Z

Series of small/medium changes to fix timeout and keep audio in sync:

ElevenLabs' Conversational AI changed its end-of-utterance audio delivery: it
no longer stops the user-agent stream in a way the user simulator's partial-chunk
heuristic could detect, so the user sim never emitted end-of-turn / trailing
silence and S2S assistants (ElevenLabs, OpenAI realtime) deadlocked waiting on
each other → inactivity timeout.

user sim: derive end-of-turn from ElevenLabs' agent_response event (arm) +
audio-buffer drain (fire), independent of stream shape; emit trailing silence
so the assistant VAD can close the turn
elevenlabs bridge: forward µ-law 8kHz in 250ms chunks (was batching 1s chunks)
elevenlabs server: wall-clock-anchored recording (both tracks, assistant at
playback time) → correct mixed audio; model_response latency keyed off the
user_transcript event
bump elevenlabs SDK to 2.53.0
Update latency calculation to consider detection time/delay

File specifics:

user_simulator/client.py — _on_user_speaks (ElevenLabs' agent_response callback) now signals turn-completion to the audio interface. This is an authoritative event ("the user agent finished its utterance") replacing the broken audio-gap inference.
user_simulator/audio_interface.py — new notify_user_utterance_complete() + a send-loop branch that fires end-of-turn once the audio buffer drains, even when it drains frame-aligned with no leftover partial chunk (the case ElevenLabs' new behavior produces, which the old partial-chunk heuristic missed). This is what causes trailing silence to be sent so the assistant's VAD can close the turn.
assistant/elevenlabs_audio_interface.py — bridge chunk-size fix: was assuming 16kHz PCM and batching ~1-second chunks; now forwards µ-law 8kHz in correct ~250ms chunks, matching the agent's ulaw_8000 input format.

for elevenlabs_server.py, timing logic change is significant:

Stream alignment

The buffers for each channel (assistant vs user) must be time aligned in order for audio_mixed.wav to be constructed correctly.

ElevenLabs doesn't stream a response's audio as a steady real-time feed. It delivers it in bursts — often most of a multi-second response arrives in a fraction of a second, sometimes split into a couple of chunks with gaps between them. The old code appended those bytes to the assistant buffer the instant they arrived from ElevenLabs ("receive time"). So a 6-second spoken response could land in the buffer over ~1 second of arrival, putting it at the wrong position relative to when it was actually heard by the user sim — the two channels drift.

A single session start reference (_record_t0) is established once. Before appending audio to either channel, that channel is padded with silence up to now − t0 — the real elapsed time. This is "wall-clock anchoring": both channels are placed on the same real timeline, so genuine pauses are preserved and the two stay aligned.
The assistant track is now recorded at playback time, not receive time — specifically at the point where the pacer (the loop that forwards audio to the user sim in real-time 20ms chunks) actually emits each chunk. That's the moment the audio is "heard,"
The wall-clock padding is applied only at the start of each assistant turn, not per-chunk. Within a turn the chunks are appended back-to-back. This matters because of the burstiness: if we padded on every chunk, ElevenLabs' small intra-turn delivery gaps would get filled with silence inside a word, garbling the speech.

latency measurement

For latency measurement, the reference for "user finished" used to be the user sim's user_speech_stop event — but on Elevenlabs server that event is derived from the audio stream and is late or missing. So the latency was often unmeasured or wrong.

It now uses ElevenLabs' user_transcript callback — ElevenLabs' own signal that it has finished hearing and transcribing the user — as the reference point. That's emitted reliably per turn, and it's the same class of signal the OpenAI-realtime uses.

The old code aligned them by byte count (pad whichever buffer is shorter to match the other's length), which has no relationship to real elapsed time.

Net result: no more timeout, and alignment on audio_mixed seems to be perfect in regards to timeline calculation (top graph):

fix ElevenLabs timeouts

a50a3da

raghavm243512 force-pushed the pr/elevenlabs_fix branch from 3f38e45 to a50a3da Compare June 17, 2026 22:44

raghavm243512 added 2 commits June 18, 2026 10:22

latency calculation fix

cd4038c

test fix

2dd5b85

raghavm243512 marked this pull request as ready for review June 18, 2026 17:33

raghavm243512 added 2 commits June 18, 2026 10:59

Merge branch 'main' into pr/elevenlabs_fix

85b724c

merge main

e39ad56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix timeout and audio sync#154

fix timeout and audio sync#154
raghavm243512 wants to merge 5 commits into
mainfrom
pr/elevenlabs_fix

raghavm243512 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raghavm243512 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

File specifics:

for elevenlabs_server.py, timing logic change is significant:

Stream alignment

latency measurement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raghavm243512 commented Jun 17, 2026 •

edited

Loading