Skip to content

fix timeout and audio sync#154

Open
raghavm243512 wants to merge 5 commits into
mainfrom
pr/elevenlabs_fix
Open

fix timeout and audio sync#154
raghavm243512 wants to merge 5 commits into
mainfrom
pr/elevenlabs_fix

Conversation

@raghavm243512

@raghavm243512 raghavm243512 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Series of small/medium changes to fix timeout and keep audio in sync:

ElevenLabs' Conversational AI changed its end-of-utterance audio delivery: it
no longer stops the user-agent stream in a way the user simulator's partial-chunk
heuristic could detect, so the user sim never emitted end-of-turn / trailing
silence and S2S assistants (ElevenLabs, OpenAI realtime) deadlocked waiting on
each other → inactivity timeout.

  • user sim: derive end-of-turn from ElevenLabs' agent_response event (arm) +
    audio-buffer drain (fire), independent of stream shape; emit trailing silence
    so the assistant VAD can close the turn
  • elevenlabs bridge: forward µ-law 8kHz in 250ms chunks (was batching 1s chunks)
  • elevenlabs server: wall-clock-anchored recording (both tracks, assistant at
    playback time) → correct mixed audio; model_response latency keyed off the
    user_transcript event
  • bump elevenlabs SDK to 2.53.0
  • Update latency calculation to consider detection time/delay

File specifics:

user_simulator/client.py — _on_user_speaks (ElevenLabs' agent_response callback) now signals turn-completion to the audio interface. This is an authoritative event ("the user agent finished its utterance") replacing the broken audio-gap inference.
user_simulator/audio_interface.py — new notify_user_utterance_complete() + a send-loop branch that fires end-of-turn once the audio buffer drains, even when it drains frame-aligned with no leftover partial chunk (the case ElevenLabs' new behavior produces, which the old partial-chunk heuristic missed). This is what causes trailing silence to be sent so the assistant's VAD can close the turn.
assistant/elevenlabs_audio_interface.py — bridge chunk-size fix: was assuming 16kHz PCM and batching ~1-second chunks; now forwards µ-law 8kHz in correct ~250ms chunks, matching the agent's ulaw_8000 input format.

for elevenlabs_server.py, timing logic change is significant:

Stream alignment

The buffers for each channel (assistant vs user) must be time aligned in order for audio_mixed.wav to be constructed correctly.

ElevenLabs doesn't stream a response's audio as a steady real-time feed. It delivers it in bursts — often most of a multi-second response arrives in a fraction of a second, sometimes split into a couple of chunks with gaps between them. The old code appended those bytes to the assistant buffer the instant they arrived from ElevenLabs ("receive time"). So a 6-second spoken response could land in the buffer over ~1 second of arrival, putting it at the wrong position relative to when it was actually heard by the user sim — the two channels drift.

  • A single session start reference (_record_t0) is established once. Before appending audio to either channel, that channel is padded with silence up to now − t0 — the real elapsed time. This is "wall-clock anchoring": both channels are placed on the same real timeline, so genuine pauses are preserved and the two stay aligned.
  • The assistant track is now recorded at playback time, not receive time — specifically at the point where the pacer (the loop that forwards audio to the user sim in real-time 20ms chunks) actually emits each chunk. That's the moment the audio is "heard,"
  • The wall-clock padding is applied only at the start of each assistant turn, not per-chunk. Within a turn the chunks are appended back-to-back. This matters because of the burstiness: if we padded on every chunk, ElevenLabs' small intra-turn delivery gaps would get filled with silence inside a word, garbling the speech.

latency measurement

For latency measurement, the reference for "user finished" used to be the user sim's user_speech_stop event — but on Elevenlabs server that event is derived from the audio stream and is late or missing. So the latency was often unmeasured or wrong.

It now uses ElevenLabs' user_transcript callback — ElevenLabs' own signal that it has finished hearing and transcribing the user — as the reference point. That's emitted reliably per turn, and it's the same class of signal the OpenAI-realtime uses.

The old code aligned them by byte count (pad whichever buffer is shorter to match the other's length), which has no relationship to real elapsed time.

Net result: no more timeout, and alignment on audio_mixed seems to be perfect in regards to timeline calculation (top graph):
Screenshot 2026-06-18 at 9 31 05 AM

@raghavm243512 raghavm243512 marked this pull request as ready for review June 18, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant