fix timeout and audio sync#154
Open
raghavm243512 wants to merge 5 commits into
Open
Conversation
3f38e45 to
a50a3da
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Series of small/medium changes to fix timeout and keep audio in sync:
ElevenLabs' Conversational AI changed its end-of-utterance audio delivery: it
no longer stops the user-agent stream in a way the user simulator's partial-chunk
heuristic could detect, so the user sim never emitted end-of-turn / trailing
silence and S2S assistants (ElevenLabs, OpenAI realtime) deadlocked waiting on
each other → inactivity timeout.
audio-buffer drain (fire), independent of stream shape; emit trailing silence
so the assistant VAD can close the turn
playback time) → correct mixed audio; model_response latency keyed off the
user_transcript event
File specifics:
user_simulator/client.py — _on_user_speaks (ElevenLabs' agent_response callback) now signals turn-completion to the audio interface. This is an authoritative event ("the user agent finished its utterance") replacing the broken audio-gap inference.
user_simulator/audio_interface.py — new notify_user_utterance_complete() + a send-loop branch that fires end-of-turn once the audio buffer drains, even when it drains frame-aligned with no leftover partial chunk (the case ElevenLabs' new behavior produces, which the old partial-chunk heuristic missed). This is what causes trailing silence to be sent so the assistant's VAD can close the turn.
assistant/elevenlabs_audio_interface.py — bridge chunk-size fix: was assuming 16kHz PCM and batching ~1-second chunks; now forwards µ-law 8kHz in correct ~250ms chunks, matching the agent's ulaw_8000 input format.
for elevenlabs_server.py, timing logic change is significant:
Stream alignment
The buffers for each channel (assistant vs user) must be time aligned in order for audio_mixed.wav to be constructed correctly.
ElevenLabs doesn't stream a response's audio as a steady real-time feed. It delivers it in bursts — often most of a multi-second response arrives in a fraction of a second, sometimes split into a couple of chunks with gaps between them. The old code appended those bytes to the assistant buffer the instant they arrived from ElevenLabs ("receive time"). So a 6-second spoken response could land in the buffer over ~1 second of arrival, putting it at the wrong position relative to when it was actually heard by the user sim — the two channels drift.
latency measurement
For latency measurement, the reference for "user finished" used to be the user sim's user_speech_stop event — but on Elevenlabs server that event is derived from the audio stream and is late or missing. So the latency was often unmeasured or wrong.
It now uses ElevenLabs' user_transcript callback — ElevenLabs' own signal that it has finished hearing and transcribing the user — as the reference point. That's emitted reliably per turn, and it's the same class of signal the OpenAI-realtime uses.
The old code aligned them by byte count (pad whichever buffer is shorter to match the other's length), which has no relationship to real elapsed time.
Net result: no more timeout, and alignment on audio_mixed seems to be perfect in regards to timeline calculation (top graph):
