AI Voice Review
Guide8 min read

Getting Started with the ElevenLabs API: From API Key to Working Audio

By VoiceToolsReview Editorial Team

Last updated:

Affiliate link — we may earn a small commission.

Get your ElevenLabs API key

Usage-based billing with no minimum commitment. Start building in minutes.

The ElevenLabs API gives you programmatic access to voice generation, transcription, music, sound effects, voice cloning, and dubbing — all under a single API key. This guide covers the fastest path from zero to working audio in your project.

What You Get Access To

One API key covers everything:

  • Text to Speech — expressive speech in 70+ languages, streaming responses in under 500ms, 10,000+ voices or create your own
  • Speech to Text — industry-leading transcription accuracy across 99 languages, processes at 20-50x real-time, streaming support for real-time applications
  • Music — studio-grade, commercially licensed AI music from a text prompt, with control over genre, mood, tempo, vocals, and structure
  • Sound Effects — generate realistic, loopable audio from text descriptions, four variations per request
  • Voice Cloning — clone a voice from a short audio sample or design a new one from a text description
  • Dubbing — translate and voice-over audio/video while preserving the original speaker's voice
  • Voice Changer, Voice Isolator, Forced Alignment, Text to Dialogue

Step 1: Create Your Account and Get an API Key

Sign up at elevenlabs.io. Navigate to developer settings and generate an API key. The key authenticates every request and can be scoped to specific workspaces.

Usage-based billing, no minimum commitment

You pay for what you generate. There is no upfront cost or minimum subscription to start using the API. Check elevenlabs.io/pricing for current per-character and per-minute rates across each API.

Step 2: Install the SDK

Python:

pip install elevenlabs

TypeScript / Node:

npm install elevenlabs

For mobile, SDKs are available for Flutter, Swift, and Kotlin (via the Agents platform). If you are using another language, the REST API accepts standard HTTP requests — any language that can make a POST request works.

Step 3: Make Your First TTS Call

Python:

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",  # voice ID from the library
    text="Hello. This is your first ElevenLabs API call.",
    model_id="eleven_multilingual_v2",
)

# audio is a generator — write to file or stream directly
with open("output.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

TypeScript:

import { ElevenLabsClient } from "elevenlabs";
import { createWriteStream } from "fs";

const client = new ElevenLabsClient({ apiKey: "YOUR_API_KEY" });

const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  text: "Hello. This is your first ElevenLabs API call.",
  model_id: "eleven_multilingual_v2",
});

const writeStream = createWriteStream("output.mp3");
audio.pipe(writeStream);

The response is a stream. You can write it to a file, pipe it into an audio player, or send it directly to the client in a web application.

Get your API key — usage-based, no minimum commitment

Step 4: Pick a Voice

The Voice Library has 10,000+ voices filterable by language, accent, style, age, and use case. Each voice has a unique voice ID — the string you pass to the API.

Three options for choosing a voice:

Browse the library — find a voice that fits your use case at elevenlabs.io/voice-library, copy its ID.

Clone a voice — upload a short audio sample. The API returns a voice ID you can use in any TTS call. Instant Cloning works from a short sample; Professional Cloning produces higher fidelity results for production use.

Design a voice — send a text description ("middle-aged British man, warm and authoritative, clear pronunciation") and the API generates a new voice to that spec.

Step 5: Fine-Tune Output

For most use cases the default output is production-ready. When you need more control:

Pronunciation dictionaries — define exactly how specific terms are spoken. Useful for brand names, technical vocabulary, acronyms, or domain-specific words the model might mispronounce. Upload a dictionary file once and reference it in subsequent API calls.

SSML tags — control pacing, emphasis, pauses, and pitch at the markup level. Useful when precise timing matters, like matching narration to video cuts.

Emotion and style control — the stability and similarity_boost parameters adjust how consistent and expressive the output is. Lower stability produces more natural variation; higher stability produces consistent, measured delivery.

Multi-speaker content — for dialogue or multi-character content, send separate requests with different voice IDs and stitch the audio outputs together.

Step 6: Streaming for Real-Time Applications

For applications where latency matters — voice agents, conversational interfaces, live narration — use the streaming endpoint. Streaming delivers the first audio chunk in under 500ms, so users hear audio starting almost immediately rather than waiting for the full file to generate.

Python streaming:

audio_stream = client.text_to_speech.convert_as_stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text="Streaming audio starts playing immediately.",
    model_id="eleven_flash_v2_5",  # optimised for low latency
)

for chunk in audio_stream:
    # send chunk to audio player or client
    play(chunk)

The Flash model is optimised for latency-sensitive applications. The Multilingual v2 model produces the highest quality output for pre-generated content.

Step 7: Explore the Rest of the API

The same API key covers everything else. A few common next steps after TTS:

Speech to Text (Scribe):

with open("meeting.mp3", "rb") as f:
    result = client.speech_to_text.convert(
        file=f,
        model_id="scribe_v1",
        diarize=True,            # speaker separation
        timestamps_granularity="word",  # word-level timing
    )
print(result.text)

Processes at 20-50x real-time. Useful for meeting transcription, podcast processing, call centre recordings, and anywhere you need accurate text from audio at volume.

Music generation:

result = client.text_to_music.convert(
    text="Upbeat acoustic guitar, warm and positive, 60 seconds, no vocals",
    make_instrumental=True,
)

Studio-grade output, commercially licensed for standard use. An additional license is required for advertising, film, TV, and games.

Sound effects:

result = client.text_to_sound_effects.convert(
    text="Heavy rain on a tin roof, distant thunder",
    duration_seconds=10,
)

Four variations generated per request. Output is loopable, which matters for game audio and ambient tracks.

Start building with the ElevenLabs API

Building Without Backend Code

If you are using vibe-coding tools — Lovable, Replit, v0, or Cursor — you can integrate ElevenLabs without writing backend code yourself. Describe the audio feature you want and these tools will handle the API integration. The ElevenLabs API is explicitly compatible with these workflows, with documentation written to support this use case.

Production Considerations

Authentication — API keys can be scoped to specific workspaces. Rotate keys and never expose them client-side in browser applications.

Error handling — the SDKs surface typed errors for rate limits, invalid voice IDs, and model unavailability. Build retry logic for transient failures in production pipelines.

Compliance — SOC 2, HIPAA, and GDPR compliant. EU and India data residency options available. Zero retention mode available for sensitive audio. Dedicated support and custom SLAs available at enterprise tier.

Monitoring — usage is tracked per API key in the developer dashboard. Monitor character counts and audio minutes against your billing expectations.

The API is trusted by teams at Meta, Stripe, Perplexity, Twilio, and Chess.com.

Frequently Asked Questions

What is the ElevenLabs API? Programmatic access to text to speech, speech to text, AI music, sound effects, voice cloning, dubbing, and more. Everything in the ElevenLabs platform, accessible via API.

What languages are supported? Python and TypeScript SDKs with streaming support. Flutter, Swift, and Kotlin for mobile. REST API for any other language.

How does pricing work? Usage-based billing with no minimum commitment. TTS is billed per character, STT per audio minute, music and SFX per generation, dubbing per source audio minute. Check elevenlabs.io/pricing for current rates.

How fast is the TTS API? Streaming responses in under 500ms. The Flash model is optimised for latency-sensitive applications.

Is output commercially licensed? Yes. Music requires an additional license for advertising, film, TV, games, and enterprise distribution.

Is it enterprise-compliant? SOC 2, HIPAA, and GDPR compliant. EU and India data residency options. Zero retention mode available.

Free: AI Voice Tool Comparison Guide

Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.

Get your ElevenLabs API key

Usage-based billing with no minimum commitment. Start building in minutes.

Frequently Asked Questions

Last updated: