What is the ElevenLabs Text to Speech API?

An API that converts text to natural-sounding speech in 70+ languages with streaming responses in under 500ms. Access 10,000+ voices or create your own via voice cloning or voice design.

How fast is the ElevenLabs TTS API?

Streaming responses are delivered in under 500ms for the first audio chunk. The Flash model is specifically optimised for latency-sensitive applications like voice agents and real-time interfaces.

Can I control how the API pronounces specific words?

Yes. Pronunciation dictionaries let you define exactly how specific words, brand names, acronyms, or technical terms are spoken. Upload once and reference in subsequent API calls.

Does the TTS API support streaming?

Yes. Both the Python and TypeScript SDKs expose streaming endpoints that deliver audio chunks as they are generated. This is essential for real-time applications where latency matters.

How many languages does the TTS API support?

70+ languages and accents. ElevenLabs v3 maintains quality across language switching within the same audio, which is useful for multilingual content.

Can I clone my own voice via the API?

Yes. Instant Cloning from a short audio sample, or Professional Cloning for production-grade results. Voice Design lets you generate a new voice from a text description.

Guide8 min read

ElevenLabs Text to Speech API: The Developer's Guide to TTS That Actually Works

Q: What is the difference between Flash and Multilingual v2?

Flash is optimised for low latency — best for real-time applications and voice agents where response speed matters most. Multilingual v2 produces the highest quality output — best for pre-generated content like audiobooks, voiceovers, and podcasts.

By VoiceToolsReview Editorial Team

Last updated: 27 April 2026

Affiliate link — we may earn a small commission.

Try the ElevenLabs TTS API

Generate expressive speech in 70+ languages with streaming in under 500ms. Usage-based billing, no minimum commitment.

Get your API key Full API getting started guide

Most text to speech APIs produce output that sounds like a machine reading a script. The ElevenLabs TTS API produces output that sounds like a person. This guide covers how to integrate it, what the key technical decisions are, and when to use each feature.

The Core API: How It Works

The TTS API takes text and a voice ID and returns audio. The response is a stream — you can play it directly, pipe it to a file, or forward it to a client. Streaming means the first audio chunk arrives in under 500ms, so users do not wait for a full file to generate before they hear anything.

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text="Your text goes here.",
    model_id="eleven_multilingual_v2",
    output_format="mp3_44100_128",
)

with open("output.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

import { ElevenLabsClient } from "elevenlabs";

const client = new ElevenLabsClient({ apiKey: "YOUR_API_KEY" });

const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  text: "Your text goes here.",
  model_id: "eleven_multilingual_v2",
  output_format: "mp3_44100_128",
});

Choosing a Model

Two primary models for most use cases:

Model	Best for	Latency
`eleven_flash_v2_5`	Voice agents, real-time apps, conversational interfaces	Optimised for under 500ms first chunk
`eleven_multilingual_v2`	Audiobooks, voiceovers, podcasts, pre-generated content	Higher quality, generation-time not the constraint

Use Flash when response speed is the user experience. Use Multilingual v2 when quality is the priority and you are generating content offline.

Choosing a Voice

Voice Library — 10,000+ voices across languages, accents, ages, and styles. Access at elevenlabs.io/voice-library or via the API:

voices = client.voices.get_all()
for voice in voices.voices:
    print(voice.voice_id, voice.name)

Filter by language, use case, and style to find a shortlist, then preview before committing to one for production.

Voice Cloning — upload a short audio sample and the API returns a voice ID you can use in any TTS call:

voice = client.clone(
    name="My Voice",
    files=["sample.mp3"],
    description="Narrator voice for my podcast",
)
voice_id = voice.voice_id

Instant Cloning works from a short sample and is fast. Professional Cloning produces higher fidelity, better multilingual consistency, and more stable long-form output — use it for production deployments where the voice is a core product feature.

Voice Design — generate a new voice from a text description:

voice = client.voices.design(
    name="Custom Narrator",
    voice_description="Middle-aged British man, warm and clear, suited for documentary narration",
    text="This is a sample of the voice that will be generated.",
)

Get your API key and start generating

Streaming for Real-Time Applications

For voice agents, conversational tools, and any interface where users hear audio as it generates:

audio_stream = client.text_to_speech.convert_as_stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text="This is a real-time streaming response.",
    model_id="eleven_flash_v2_5",
)

for chunk in audio_stream:
    # send each chunk to your audio output immediately
    yield chunk

In a web application, forward chunks as they arrive to the browser rather than waiting for the full file. The difference in perceived responsiveness is significant for conversational use cases.

Voice Settings

Four parameters control how a voice performs:

Parameter	Range	Effect
`stability`	0–1	Higher = more consistent, less variable. Lower = more natural variation across sentences.
`similarity_boost`	0–1	How closely output matches the source voice. Higher for cloned voices where identity matters.
`style`	0–1	Amplifies stylistic characteristics. Use sparingly — high values can distort output.
`use_speaker_boost`	bool	Improves speaker similarity for cloned voices. Adds latency.

For narration: stability 0.5–0.7, similarity 0.7–0.85, style 0–0.3. For conversational voice agents: stability 0.3–0.5, similarity 0.7, style 0.

from elevenlabs import VoiceSettings

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text="Adjusted voice settings in action.",
    model_id="eleven_multilingual_v2",
    voice_settings=VoiceSettings(
        stability=0.6,
        similarity_boost=0.8,
        style=0.1,
        use_speaker_boost=False,
    ),
)

Pronunciation Dictionaries

When specific terms need to be pronounced a specific way — brand names, acronyms, technical vocabulary, proper nouns — pronunciation dictionaries give you deterministic control.

Upload a dictionary file once:

with open("pronunciations.pls", "rb") as f:
    dictionary = client.pronunciation_dictionary.add_from_file(
        file=f,
        name="Product Vocabulary",
    )
dictionary_id = dictionary.id

Reference it in TTS calls:

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text="The API returns audio in the specified format.",
    model_id="eleven_multilingual_v2",
    pronunciation_dictionary_locators=[
        {"pronunciation_dictionary_id": dictionary_id, "version_id": dictionary.version_id}
    ],
)

The dictionary uses IPA (International Phonetic Alphabet) or CMU Arpabet notation. Most use cases only need a small dictionary for brand-specific vocabulary — 20-50 entries covers the common cases.

Multilingual Content

The Multilingual v2 model handles 70+ languages. For content that switches languages within the same audio:

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text="Hello. Bonjour. Hola. Ciao.",  # model detects language per phrase
    model_id="eleven_multilingual_v2",
)

The model adapts pronunciation and accent per language segment while maintaining the same voice identity. Quality varies by language — major languages (English, French, Spanish, German, Italian, Portuguese, Japanese, Korean, Chinese) perform strongest.

Output Formats

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text="Output format selection.",
    model_id="eleven_multilingual_v2",
    output_format="pcm_24000",  # raw PCM for real-time playback
)

Available formats include MP3 at various bitrates, PCM for raw audio, ULAW for telephony, and Opus for low-latency web streaming. Use PCM or Opus for real-time applications; MP3 for file storage and download.

Multi-Speaker Content

For content with multiple characters or speakers, generate separate requests with different voice IDs and stitch:

narrator_audio = client.text_to_speech.convert(
    voice_id=NARRATOR_VOICE_ID,
    text="The detective looked across the room.",
    model_id="eleven_multilingual_v2",
)

character_audio = client.text_to_speech.convert(
    voice_id=CHARACTER_VOICE_ID,
    text="I didn't do it.",
    model_id="eleven_multilingual_v2",
)

Concatenate the audio chunks in order. For audiobooks and podcasts with consistent casts, cache the voice IDs and generate per-segment rather than re-sending everything.

Alternatively, the Text to Dialogue API generates natural multi-speaker conversation from a structured script in a single call.

Start building with the ElevenLabs TTS API

Common Integration Patterns

Podcast automation — generate episode audio from a transcript. Split by paragraph, generate each segment, concatenate, add intro/outro music from the Music API.

Voice agents — stream Flash model output directly to the client as response segments arrive. Combine with Scribe STT for full duplex audio.

Audiobook production — use Multilingual v2, Professional Cloning for a consistent narrator voice, pronunciation dictionaries for proper nouns.

E-learning — generate narration per slide or segment. Cache outputs since the same script will be played multiple times by multiple users.

Accessibility — add TTS to web content or apps. Use the browser audio API to play streamed PCM output directly without file downloads.

Frequently Asked Questions

How fast is the TTS API? Streaming first chunk in under 500ms with the Flash model. Multilingual v2 is higher quality but slower to generate.

What is the difference between Flash and Multilingual v2? Flash for real-time applications where latency is the priority. Multilingual v2 for pre-generated content where quality is the priority.

Can I control pronunciation? Yes. Upload a pronunciation dictionary and reference it in API calls. Supports IPA and CMU Arpabet notation.

How do I choose a voice? Browse the Voice Library at elevenlabs.io/voice-library, filter by use case and language, preview, grab the voice ID. Or clone your own from a short audio sample.

Does it support streaming? Yes. Both Python and TypeScript SDKs expose streaming endpoints. Use PCM or Opus output formats for real-time playback.

Free: AI Voice Tool Comparison Guide

Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.