WebSocket

The Speech Engine upstream WebSocket protocol defines the interface your server must implement so that ElevenLabs can connect to it during a Speech Engine conversation. Unlike other ElevenLabs WebSocket channels where your client connects to ElevenLabs, the Speech Engine reverses this relationship: **ElevenLabs is the WebSocket client and your server is the WebSocket server**. This page shows the WebSocket API shape, however we recommend using the provided server side SDKs instead of implementing this yourself. The SDKs include several helper methods and automatically handle auth for you. You can find SDK installation instructions and guides in the [Speech Engine quickstart](/docs/eleven-api/guides/cookbooks/speech-engine). Configure your server's publicly reachable WebSocket URL in the `wsUrl` field when creating or updating a Speech Engine via the REST API. When a user starts a conversation with that agent, ElevenLabs will open a WebSocket connection to your server and begin the message exchange described below. ## Connection flow 1. A user starts a conversation with a Speech Engine agent (via the ElevenLabs client SDK or API). 2. ElevenLabs opens a WebSocket connection to your `wsUrl`. 3. ElevenLabs sends an `init` message containing the conversation ID. 4. As the user speaks, ElevenLabs transcribes the audio and sends `user_transcript` messages with the full conversation history. 5. Your server calls an LLM and streams the response back as one or more `agent_response` messages. 6. ElevenLabs synthesizes the text to speech and streams the audio back to the user. 7. Periodic `ping` messages keep the connection alive; reply with `pong`. 8. When the conversation ends, ElevenLabs sends a `close` message. ## Authentication Every connection from ElevenLabs includes an `X-Elevenlabs-Speech-Engine-Authorization` header containing a short-lived JWT. Verify this token before accepting the WebSocket upgrade to ensure the connection originates from ElevenLabs. The JWT is signed with **HS256** using the SHA-256 hash of your ElevenLabs API key as the HMAC secret, and has: - **Issuer** (`iss`): `https://api.elevenlabs.io/convai/speech-engine` - **Subject** (`sub`): `convai_speech_engine_upstream` - **Expiry** (`exp`): short-lived; a 60-second clock-skew leeway is applied ## Interruption handling Each `user_transcript` message carries an `event_id`. If the user speaks again before your server finishes responding, a new `user_transcript` arrives with a higher `event_id`. Cancel your in-flight LLM call and begin responding to the new transcript. Any `agent_response` messages sent with an outdated `event_id` are silently discarded by ElevenLabs. ## Streaming responses Send LLM output as a sequence of `agent_response` messages with `is_final: false` for each text chunk, followed by a final `agent_response` with `is_final: true` and an empty `content` string. ElevenLabs begins synthesizing audio as chunks arrive, minimising latency.