ElevenLabs Text to Speech API: The Developer's Guide to TTS That Actually Works
Last updated:
Affiliate link — we may earn a small commission.
Try the ElevenLabs TTS API
Generate expressive speech in 70+ languages with streaming in under 500ms. Usage-based billing, no minimum commitment.
Most text to speech APIs produce output that sounds like a machine reading a script. The ElevenLabs TTS API produces output that sounds like a person. This guide covers how to integrate it, what the key technical decisions are, and when to use each feature.
The Core API: How It Works
The TTS API takes text and a voice ID and returns audio. The response is a stream — you can play it directly, pipe it to a file, or forward it to a client. Streaming means the first audio chunk arrives in under 500ms, so users do not wait for a full file to generate before they hear anything.
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="YOUR_API_KEY")
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
text="Your text goes here.",
model_id="eleven_multilingual_v2",
output_format="mp3_44100_128",
)
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
import { ElevenLabsClient } from "elevenlabs";
const client = new ElevenLabsClient({ apiKey: "YOUR_API_KEY" });
const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
text: "Your text goes here.",
model_id: "eleven_multilingual_v2",
output_format: "mp3_44100_128",
});
Choosing a Model
Two primary models for most use cases:
| Model | Best for | Latency |
|---|---|---|
eleven_flash_v2_5 | Voice agents, real-time apps, conversational interfaces | Optimised for under 500ms first chunk |
eleven_multilingual_v2 | Audiobooks, voiceovers, podcasts, pre-generated content | Higher quality, generation-time not the constraint |
Use Flash when response speed is the user experience. Use Multilingual v2 when quality is the priority and you are generating content offline.
Choosing a Voice
Voice Library — 10,000+ voices across languages, accents, ages, and styles. Access at elevenlabs.io/voice-library or via the API:
voices = client.voices.get_all()
for voice in voices.voices:
print(voice.voice_id, voice.name)
Filter by language, use case, and style to find a shortlist, then preview before committing to one for production.
Voice Cloning — upload a short audio sample and the API returns a voice ID you can use in any TTS call:
voice = client.clone(
name="My Voice",
files=["sample.mp3"],
description="Narrator voice for my podcast",
)
voice_id = voice.voice_id
Instant Cloning works from a short sample and is fast. Professional Cloning produces higher fidelity, better multilingual consistency, and more stable long-form output — use it for production deployments where the voice is a core product feature.
Voice Design — generate a new voice from a text description:
voice = client.voices.design(
name="Custom Narrator",
voice_description="Middle-aged British man, warm and clear, suited for documentary narration",
text="This is a sample of the voice that will be generated.",
)
Get your API key and start generating
Streaming for Real-Time Applications
For voice agents, conversational tools, and any interface where users hear audio as it generates:
audio_stream = client.text_to_speech.convert_as_stream(
voice_id="JBFqnCBsd6RMkjVDRZzb",
text="This is a real-time streaming response.",
model_id="eleven_flash_v2_5",
)
for chunk in audio_stream:
# send each chunk to your audio output immediately
yield chunk
In a web application, forward chunks as they arrive to the browser rather than waiting for the full file. The difference in perceived responsiveness is significant for conversational use cases.
Voice Settings
Four parameters control how a voice performs:
| Parameter | Range | Effect |
|---|---|---|
stability | 0–1 | Higher = more consistent, less variable. Lower = more natural variation across sentences. |
similarity_boost | 0–1 | How closely output matches the source voice. Higher for cloned voices where identity matters. |
style | 0–1 | Amplifies stylistic characteristics. Use sparingly — high values can distort output. |
use_speaker_boost | bool | Improves speaker similarity for cloned voices. Adds latency. |
For narration: stability 0.5–0.7, similarity 0.7–0.85, style 0–0.3. For conversational voice agents: stability 0.3–0.5, similarity 0.7, style 0.
from elevenlabs import VoiceSettings
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
text="Adjusted voice settings in action.",
model_id="eleven_multilingual_v2",
voice_settings=VoiceSettings(
stability=0.6,
similarity_boost=0.8,
style=0.1,
use_speaker_boost=False,
),
)
Pronunciation Dictionaries
When specific terms need to be pronounced a specific way — brand names, acronyms, technical vocabulary, proper nouns — pronunciation dictionaries give you deterministic control.
Upload a dictionary file once:
with open("pronunciations.pls", "rb") as f:
dictionary = client.pronunciation_dictionary.add_from_file(
file=f,
name="Product Vocabulary",
)
dictionary_id = dictionary.id
Reference it in TTS calls:
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
text="The API returns audio in the specified format.",
model_id="eleven_multilingual_v2",
pronunciation_dictionary_locators=[
{"pronunciation_dictionary_id": dictionary_id, "version_id": dictionary.version_id}
],
)
The dictionary uses IPA (International Phonetic Alphabet) or CMU Arpabet notation. Most use cases only need a small dictionary for brand-specific vocabulary — 20-50 entries covers the common cases.
Multilingual Content
The Multilingual v2 model handles 70+ languages. For content that switches languages within the same audio:
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
text="Hello. Bonjour. Hola. Ciao.", # model detects language per phrase
model_id="eleven_multilingual_v2",
)
The model adapts pronunciation and accent per language segment while maintaining the same voice identity. Quality varies by language — major languages (English, French, Spanish, German, Italian, Portuguese, Japanese, Korean, Chinese) perform strongest.
Output Formats
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
text="Output format selection.",
model_id="eleven_multilingual_v2",
output_format="pcm_24000", # raw PCM for real-time playback
)
Available formats include MP3 at various bitrates, PCM for raw audio, ULAW for telephony, and Opus for low-latency web streaming. Use PCM or Opus for real-time applications; MP3 for file storage and download.
Multi-Speaker Content
For content with multiple characters or speakers, generate separate requests with different voice IDs and stitch:
narrator_audio = client.text_to_speech.convert(
voice_id=NARRATOR_VOICE_ID,
text="The detective looked across the room.",
model_id="eleven_multilingual_v2",
)
character_audio = client.text_to_speech.convert(
voice_id=CHARACTER_VOICE_ID,
text="I didn't do it.",
model_id="eleven_multilingual_v2",
)
Concatenate the audio chunks in order. For audiobooks and podcasts with consistent casts, cache the voice IDs and generate per-segment rather than re-sending everything.
Alternatively, the Text to Dialogue API generates natural multi-speaker conversation from a structured script in a single call.
Start building with the ElevenLabs TTS APICommon Integration Patterns
Podcast automation — generate episode audio from a transcript. Split by paragraph, generate each segment, concatenate, add intro/outro music from the Music API.
Voice agents — stream Flash model output directly to the client as response segments arrive. Combine with Scribe STT for full duplex audio.
Audiobook production — use Multilingual v2, Professional Cloning for a consistent narrator voice, pronunciation dictionaries for proper nouns.
E-learning — generate narration per slide or segment. Cache outputs since the same script will be played multiple times by multiple users.
Accessibility — add TTS to web content or apps. Use the browser audio API to play streamed PCM output directly without file downloads.
Frequently Asked Questions
How fast is the TTS API? Streaming first chunk in under 500ms with the Flash model. Multilingual v2 is higher quality but slower to generate.
What is the difference between Flash and Multilingual v2? Flash for real-time applications where latency is the priority. Multilingual v2 for pre-generated content where quality is the priority.
Can I control pronunciation? Yes. Upload a pronunciation dictionary and reference it in API calls. Supports IPA and CMU Arpabet notation.
How do I choose a voice? Browse the Voice Library at elevenlabs.io/voice-library, filter by use case and language, preview, grab the voice ID. Or clone your own from a short audio sample.
Does it support streaming? Yes. Both Python and TypeScript SDKs expose streaming endpoints. Use PCM or Opus output formats for real-time playback.
Free: AI Voice Tool Comparison Guide
Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.
Try the ElevenLabs TTS API
Generate expressive speech in 70+ languages with streaming in under 500ms. Usage-based billing, no minimum commitment.
Frequently Asked Questions
Last updated: