Back to Models Guide
ElevenLabsmusic

Complete Guide to Using ElevenLabs Speech to Text

Transcribe speech to text with high accuracy, supporting timestamps and speaker detection.

Try This ModelTutorial

Overview

ElevenLabs Speech to Text converts spoken audio into accurate written text. The model handles diverse accents, speaking speeds, and recording conditions, producing transcriptions that capture both the words and the structure of the speech -- including punctuation, paragraph breaks, and speaker identification.

The transcription engine goes beyond simple word-for-word conversion. It understands context to resolve ambiguous words, properly capitalizes names and proper nouns, and formats numbers and dates appropriately. The output is clean, readable text rather than a raw stream of words, significantly reducing the post-processing work needed for most use cases.

Timestamp support makes this model particularly valuable for video production workflows. The time-coded transcript can be used to generate captions, create searchable video indexes, or synchronize text overlays in the video editor. Combined with ElevenLabs' audio isolation, you can even transcribe speech from noisy recordings by first cleaning the audio.

Capabilities

  • Transcribes speech with high accuracy across accents and speaking styles
  • Provides word-level and segment-level timestamps
  • Detects and labels multiple speakers in a conversation
  • Handles background noise and imperfect recording conditions
  • Produces properly punctuated and formatted text output

Use Cases

1

Generating captions and subtitles for video content in the editor

2

Transcribing interviews and conversations for written content

3

Creating searchable text indexes from audio and video archives

4

Converting meeting recordings into structured notes and summaries

5

Extracting spoken dialogue from video for script documentation

Input Parameters

audio_url
filerequired

URL of the audio file to transcribe. Accepted formats: MP3, WAV, AAC, MP4, OGG (max 200MB).

The URL of the audio file to transcribe. Supports common audio and video formats. Cleaner audio with less background noise produces more accurate transcriptions.

language_code
select

Language code of the audio. Leave empty for auto-detect.

Options
Auto DetectAfrikaans (af)Albanian (sq)Amharic (am)Arabic (ar)Armenian (hy)Azerbaijani (az)Bashkir (ba)Basque (eu)Belarusian (be)Bengali (bn)Bosnian (bs)Breton (br)Bulgarian (bg)Burmese (my)Catalan (ca)Chinese (zh)Croatian (hr)Czech (cs)Danish (da)Dutch (nl)English (en)Estonian (et)Faroese (fo)Finnish (fi)French (fr)Galician (gl)Georgian (ka)German (de)Greek (el)Gujarati (gu)Haitian Creole (ht)Hausa (ha)Hawaiian (haw)Hebrew (he)Hindi (hi)Hungarian (hu)Icelandic (is)Indonesian (id)Italian (it)Japanese (ja)Javanese (jw)Kannada (kn)Kazakh (kk)Khmer (km)Korean (ko)Lao (lo)Latin (la)Latvian (lv)Lingala (ln)Lithuanian (lt)Luxembourgish (lb)Macedonian (mk)Malagasy (mg)Malay (ms)Malayalam (ml)Maltese (mt)Maori (mi)Marathi (mr)Mongolian (mn)Nepali (ne)Norwegian (no)Occitan (oc)Pashto (ps)Persian (fa)Polish (pl)Portuguese (pt)Punjabi (pa)Romanian (ro)Russian (ru)Sanskrit (sa)Serbian (sr)Shona (sn)Sindhi (sd)Sinhala (si)Slovak (sk)Slovenian (sl)Somali (so)Spanish (es)Sundanese (su)Swahili (sw)Swedish (sv)Tagalog (tl)Tajik (tg)Tamil (ta)Tatar (tt)Telugu (te)Thai (th)Tibetan (bo)Turkish (tr)Turkmen (tk)Ukrainian (uk)Urdu (ur)Uzbek (uz)Vietnamese (vi)Welsh (cy)Yiddish (yi)Yoruba (yo)
Default:
tag_audio_events
toggle

Tag audio events like laughter, applause, etc.

Default: true
diarize
toggle

Whether to annotate who is speaking.

Default: true

Tips & Best Practices

Clean audio first for noisy recordings
Use timestamps for video work
Review proper nouns

Related Models