Realtime | ElevenLabs Documentation

Realtime speech-to-text transcription service. This WebSocket API enables streaming audio input and receiving transcription results.

Event Flow

Audio chunks are sent as input_audio_chunk messages
Transcription results are streamed back in various formats (partial, committed, with timestamps)
Supports manual commit or VAD-based automatic commit strategies

Authentication is done either by providing a valid API key in the xi-api-key header or by providing a valid token in the token query parameter. Tokens can be generated from the single use token endpoint. Use tokens if you want to transcribe audio from the client side.

Realtime speech-to-text transcription service. This WebSocket API enables streaming audio input and receiving transcription results. ## Event Flow - Audio chunks are sent as `input_audio_chunk` messages - Transcription results are streamed back in various formats (partial, committed, with timestamps) - Supports manual commit or VAD-based automatic commit strategies Authentication is done either by providing a valid API key in the `xi-api-key` header or by providing a valid token in the `token` query parameter. Tokens can be generated from the [single use token endpoint](/docs/api-reference/tokens/create). Use tokens if you want to transcribe audio from the client side.

Handshake

WSS

/v1/speech-to-text/realtime

Headers

xi-api-keystringOptional

Query parameters

model_idstringRequired

ID of the model to use for transcription.

tokenstringOptional

Single use token for authentication. Only used when initiating a session from the client. If provided, xi-api-key is no longer required for authentication.

include_timestampsbooleanOptionalDefaults to false

Whether to receive the committed_transcript_with_timestamps event, which includes word-level timestamps.

include_language_detectionbooleanOptionalDefaults to false

Whether to include the detected language code in the committed_transcript_with_timestamps event.

audio_formatenumOptionalDefaults to pcm_16000

Audio encoding format for speech-to-text.

language_codestringOptional

Language code in ISO 639-1 or ISO 639-3 format.

commit_strategyenumOptionalDefaults to manual

Strategy for committing transcriptions.

Allowed values:

keytermslist of stringsOptional

List of keyterms to bias the model towards. Maximum 50 keyterms, each up to 20 characters. Adds a 20% premium to the base transcription cost.

no_verbatimbooleanOptionalDefaults to false

If true, removes filler words, false starts and disfluencies from the transcript.

vad_silence_threshold_secsdoubleOptional0.3-3Defaults to 1.5

Silence threshold in seconds for VAD.

vad_thresholddoubleOptional0.1-0.9Defaults to 0.4

Threshold for voice activity detection.

min_speech_duration_msintegerOptional50-2000Defaults to 100

Minimum speech duration in milliseconds.

min_silence_duration_msintegerOptional50-2000Defaults to 100

Minimum silence duration in milliseconds.

enable_loggingbooleanOptionalDefaults to true

When enable_logging is set to false zero retention mode will be used for the request. This will mean history features are unavailable for this request. Zero retention mode may only be used by enterprise customers.

Send

inputAudioChunkobjectRequired

Audio data chunk sent from client to server for transcription.

Receive

sessionStartedobjectRequired

Sent when the transcription session is successfully started.

partialTranscriptobjectRequired

Interim transcription result that may change.

committedTranscriptobjectRequired

Committed transcription result that will not change.

committedTranscriptWithTimestampsobjectRequired

Committed transcription result with word-level timestamps.

scribeErrorobjectRequired

Error event during transcription.

scribeAuthErrorobjectRequired

Authentication error during transcription session.

scribeQuotaExceededErrorobjectRequired

Quota exceeded error during transcription session.

scribeThrottledErrorobjectRequired

Throttled error during transcription session.

scribeUnacceptedTermsErrorobjectRequired

Unaccepted terms error during transcription session.

scribeRateLimitedErrorobjectRequired

Rate limited error during transcription session.

scribeQueueOverflowErrorobjectRequired

Queue overflow error during transcription session.

scribeResourceExhaustedErrorobjectRequired

Resource exhausted error during transcription session.

scribeSessionTimeLimitExceededErrorobjectRequired

Session time limit exceeded error during transcription session.

scribeInputErrorobjectRequired

Input error during transcription session.

scribeChunkSizeExceededErrorobjectRequired

Chunk size exceeded error during transcription session.

scribeInsufficientAudioActivityErrorobjectRequired

Insufficient audio activity error during transcription session.

scribeTranscriberErrorobjectRequired

Transcriber error during transcription session.

Realtime speech-to-text transcription service. This WebSocket API enables streaming audio input and receiving transcription results.

Event Flow

Audio chunks are sent as input_audio_chunk messages
Transcription results are streamed back in various formats (partial, committed, with timestamps)
Supports manual commit or VAD-based automatic commit strategies

Headers

xi-api-keystringOptional

Query parameters

model_idstringRequired

ID of the model to use for transcription.

tokenstringOptional

Single use token for authentication. Only used when initiating a session from the client. If provided, xi-api-key is no longer required for authentication.

include_timestampsbooleanOptionalDefaults to false

Whether to receive the committed_transcript_with_timestamps event, which includes word-level timestamps.

include_language_detectionbooleanOptionalDefaults to false

Whether to include the detected language code in the committed_transcript_with_timestamps event.

audio_formatenumOptionalDefaults to pcm_16000

Audio encoding format for speech-to-text.

language_codestringOptional

Language code in ISO 639-1 or ISO 639-3 format.

commit_strategyenumOptionalDefaults to manual

Strategy for committing transcriptions.

Allowed values:

keytermslist of stringsOptional

List of keyterms to bias the model towards. Maximum 50 keyterms, each up to 20 characters. Adds a 20% premium to the base transcription cost.

no_verbatimbooleanOptionalDefaults to false

If true, removes filler words, false starts and disfluencies from the transcript.

vad_silence_threshold_secsdoubleOptional0.3-3Defaults to 1.5

Silence threshold in seconds for VAD.

vad_thresholddoubleOptional0.1-0.9Defaults to 0.4

Threshold for voice activity detection.

min_speech_duration_msintegerOptional50-2000Defaults to 100

Minimum speech duration in milliseconds.

min_silence_duration_msintegerOptional50-2000Defaults to 100

Minimum silence duration in milliseconds.

enable_loggingbooleanOptionalDefaults to true

Audio data chunk sent from client to server for transcription.

Sent when the transcription session is successfully started.

Interim transcription result that may change.

Committed transcription result that will not change.

Committed transcription result with word-level timestamps.

Error event during transcription.

Authentication error during transcription session.

Quota exceeded error during transcription session.

Throttled error during transcription session.

Unaccepted terms error during transcription session.

Rate limited error during transcription session.

Queue overflow error during transcription session.

Resource exhausted error during transcription session.

Session time limit exceeded error during transcription session.

Input error during transcription session.

Chunk size exceeded error during transcription session.

Insufficient audio activity error during transcription session.

Transcriber error during transcription session.

URL	wss://api.elevenlabs.io/v1/speech-to-text/realtime
Method	GET
Status	101 Switching Protocols

Event Flow

HandshakeTry it

Headers

Query parameters

Send

Receive

Event Flow

HandshakeTry it

Headers

Query parameters

Send

Receive

Handshake

Handshake