ElevenLabsmusic

Complete Guide to Using ElevenLabs Speech to Text

Transcribe speech to text with high accuracy, supporting timestamps and speaker detection.

Overview

ElevenLabs Speech to Text converts spoken audio into accurate written text. The model handles diverse accents, speaking speeds, and recording conditions, producing transcriptions that capture both the words and the structure of the speech -- including punctuation, paragraph breaks, and speaker identification.

The transcription engine goes beyond simple word-for-word conversion. It understands context to resolve ambiguous words, properly capitalizes names and proper nouns, and formats numbers and dates appropriately. The output is clean, readable text rather than a raw stream of words, significantly reducing the post-processing work needed for most use cases.

Timestamp support makes this model particularly valuable for video production workflows. The time-coded transcript can be used to generate captions, create searchable video indexes, or synchronize text overlays in the video editor. Combined with ElevenLabs' audio isolation, you can even transcribe speech from noisy recordings by first cleaning the audio.

Capabilities

Transcribes speech with high accuracy across accents and speaking styles
Provides word-level and segment-level timestamps
Detects and labels multiple speakers in a conversation
Handles background noise and imperfect recording conditions
Produces properly punctuated and formatted text output

Use Cases

Generating captions and subtitles for video content in the editor

Transcribing interviews and conversations for written content

Creating searchable text indexes from audio and video archives

Converting meeting recordings into structured notes and summaries

Extracting spoken dialogue from video for script documentation

Input Parameters

audio_url

filerequired

URL of the audio file to transcribe. Accepted formats: MP3, WAV, AAC, MP4, OGG (max 200MB).

The URL of the audio file to transcribe. Supports common audio and video formats. Cleaner audio with less background noise produces more accurate transcriptions.

language_code

select

Language code of the audio. Leave empty for auto-detect.

Options

Auto DetectAfrikaans (af)Albanian (sq)Amharic (am)Arabic (ar)Armenian (hy)Azerbaijani (az)Bashkir (ba)Basque (eu)Belarusian (be)Bengali (bn)Bosnian (bs)Breton (br)Bulgarian (bg)Burmese (my)Catalan (ca)Chinese (zh)Croatian (hr)Czech (cs)Danish (da)Dutch (nl)English (en)Estonian (et)Faroese (fo)Finnish (fi)French (fr)Galician (gl)Georgian (ka)German (de)Greek (el)Gujarati (gu)Haitian Creole (ht)Hausa (ha)Hawaiian (haw)Hebrew (he)Hindi (hi)Hungarian (hu)Icelandic (is)Indonesian (id)Italian (it)Japanese (ja)Javanese (jw)Kannada (kn)Kazakh (kk)Khmer (km)Korean (ko)Lao (lo)Latin (la)Latvian (lv)Lingala (ln)Lithuanian (lt)Luxembourgish (lb)Macedonian (mk)Malagasy (mg)Malay (ms)Malayalam (ml)Maltese (mt)Maori (mi)Marathi (mr)Mongolian (mn)Nepali (ne)Norwegian (no)Occitan (oc)Pashto (ps)Persian (fa)Polish (pl)Portuguese (pt)Punjabi (pa)Romanian (ro)Russian (ru)Sanskrit (sa)Serbian (sr)Shona (sn)Sindhi (sd)Sinhala (si)Slovak (sk)Slovenian (sl)Somali (so)Spanish (es)Sundanese (su)Swahili (sw)Swedish (sv)Tagalog (tl)Tajik (tg)Tamil (ta)Tatar (tt)Telugu (te)Thai (th)Tibetan (bo)Turkish (tr)Turkmen (tk)Ukrainian (uk)Urdu (ur)Uzbek (uz)Vietnamese (vi)Welsh (cy)Yiddish (yi)Yoruba (yo)

Default:

tag_audio_events

toggle

Tag audio events like laughter, applause, etc.

Default: true

diarize

toggle

Whether to annotate who is speaking.

Default: true

Tips & Best Practices

Clean audio first for noisy recordings

Use timestamps for video work

Review proper nouns

Related Models

ElevenLabs Audio Isolation

ElevenLabsView Guide →

ElevenLabs TTS Turbo 2.5

ElevenLabsView Guide →

ElevenLabs Multilingual V2

ElevenLabsView Guide →