Getting Started with the ElevenLabs API: From API Key to Working Audio
Last updated:
Affiliate link — we may earn a small commission.
Get your ElevenLabs API key
Usage-based billing with no minimum commitment. Start building in minutes.
The ElevenLabs API gives you programmatic access to voice generation, transcription, music, sound effects, voice cloning, and dubbing — all under a single API key. This guide covers the fastest path from zero to working audio in your project.
What You Get Access To
One API key covers everything:
- Text to Speech — expressive speech in 70+ languages, streaming responses in under 500ms, 10,000+ voices or create your own
- Speech to Text — industry-leading transcription accuracy across 99 languages, processes at 20-50x real-time, streaming support for real-time applications
- Music — studio-grade, commercially licensed AI music from a text prompt, with control over genre, mood, tempo, vocals, and structure
- Sound Effects — generate realistic, loopable audio from text descriptions, four variations per request
- Voice Cloning — clone a voice from a short audio sample or design a new one from a text description
- Dubbing — translate and voice-over audio/video while preserving the original speaker's voice
- Voice Changer, Voice Isolator, Forced Alignment, Text to Dialogue
Step 1: Create Your Account and Get an API Key
Sign up at elevenlabs.io. Navigate to developer settings and generate an API key. The key authenticates every request and can be scoped to specific workspaces.
You pay for what you generate. There is no upfront cost or minimum subscription to start using the API. Check elevenlabs.io/pricing for current per-character and per-minute rates across each API.
Step 2: Install the SDK
Python:
pip install elevenlabs
TypeScript / Node:
npm install elevenlabs
For mobile, SDKs are available for Flutter, Swift, and Kotlin (via the Agents platform). If you are using another language, the REST API accepts standard HTTP requests — any language that can make a POST request works.
Step 3: Make Your First TTS Call
Python:
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="YOUR_API_KEY")
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb", # voice ID from the library
text="Hello. This is your first ElevenLabs API call.",
model_id="eleven_multilingual_v2",
)
# audio is a generator — write to file or stream directly
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
TypeScript:
import { ElevenLabsClient } from "elevenlabs";
import { createWriteStream } from "fs";
const client = new ElevenLabsClient({ apiKey: "YOUR_API_KEY" });
const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
text: "Hello. This is your first ElevenLabs API call.",
model_id: "eleven_multilingual_v2",
});
const writeStream = createWriteStream("output.mp3");
audio.pipe(writeStream);
The response is a stream. You can write it to a file, pipe it into an audio player, or send it directly to the client in a web application.
Get your API key — usage-based, no minimum commitmentStep 4: Pick a Voice
The Voice Library has 10,000+ voices filterable by language, accent, style, age, and use case. Each voice has a unique voice ID — the string you pass to the API.
Three options for choosing a voice:
Browse the library — find a voice that fits your use case at elevenlabs.io/voice-library, copy its ID.
Clone a voice — upload a short audio sample. The API returns a voice ID you can use in any TTS call. Instant Cloning works from a short sample; Professional Cloning produces higher fidelity results for production use.
Design a voice — send a text description ("middle-aged British man, warm and authoritative, clear pronunciation") and the API generates a new voice to that spec.
Step 5: Fine-Tune Output
For most use cases the default output is production-ready. When you need more control:
Pronunciation dictionaries — define exactly how specific terms are spoken. Useful for brand names, technical vocabulary, acronyms, or domain-specific words the model might mispronounce. Upload a dictionary file once and reference it in subsequent API calls.
SSML tags — control pacing, emphasis, pauses, and pitch at the markup level. Useful when precise timing matters, like matching narration to video cuts.
Emotion and style control — the stability and similarity_boost parameters adjust how consistent and expressive the output is. Lower stability produces more natural variation; higher stability produces consistent, measured delivery.
Multi-speaker content — for dialogue or multi-character content, send separate requests with different voice IDs and stitch the audio outputs together.
Step 6: Streaming for Real-Time Applications
For applications where latency matters — voice agents, conversational interfaces, live narration — use the streaming endpoint. Streaming delivers the first audio chunk in under 500ms, so users hear audio starting almost immediately rather than waiting for the full file to generate.
Python streaming:
audio_stream = client.text_to_speech.convert_as_stream(
voice_id="JBFqnCBsd6RMkjVDRZzb",
text="Streaming audio starts playing immediately.",
model_id="eleven_flash_v2_5", # optimised for low latency
)
for chunk in audio_stream:
# send chunk to audio player or client
play(chunk)
The Flash model is optimised for latency-sensitive applications. The Multilingual v2 model produces the highest quality output for pre-generated content.
Step 7: Explore the Rest of the API
The same API key covers everything else. A few common next steps after TTS:
Speech to Text (Scribe):
with open("meeting.mp3", "rb") as f:
result = client.speech_to_text.convert(
file=f,
model_id="scribe_v1",
diarize=True, # speaker separation
timestamps_granularity="word", # word-level timing
)
print(result.text)
Processes at 20-50x real-time. Useful for meeting transcription, podcast processing, call centre recordings, and anywhere you need accurate text from audio at volume.
Music generation:
result = client.text_to_music.convert(
text="Upbeat acoustic guitar, warm and positive, 60 seconds, no vocals",
make_instrumental=True,
)
Studio-grade output, commercially licensed for standard use. An additional license is required for advertising, film, TV, and games.
Sound effects:
result = client.text_to_sound_effects.convert(
text="Heavy rain on a tin roof, distant thunder",
duration_seconds=10,
)
Four variations generated per request. Output is loopable, which matters for game audio and ambient tracks.
Start building with the ElevenLabs APIBuilding Without Backend Code
If you are using vibe-coding tools — Lovable, Replit, v0, or Cursor — you can integrate ElevenLabs without writing backend code yourself. Describe the audio feature you want and these tools will handle the API integration. The ElevenLabs API is explicitly compatible with these workflows, with documentation written to support this use case.
Production Considerations
Authentication — API keys can be scoped to specific workspaces. Rotate keys and never expose them client-side in browser applications.
Error handling — the SDKs surface typed errors for rate limits, invalid voice IDs, and model unavailability. Build retry logic for transient failures in production pipelines.
Compliance — SOC 2, HIPAA, and GDPR compliant. EU and India data residency options available. Zero retention mode available for sensitive audio. Dedicated support and custom SLAs available at enterprise tier.
Monitoring — usage is tracked per API key in the developer dashboard. Monitor character counts and audio minutes against your billing expectations.
The API is trusted by teams at Meta, Stripe, Perplexity, Twilio, and Chess.com.
Frequently Asked Questions
What is the ElevenLabs API? Programmatic access to text to speech, speech to text, AI music, sound effects, voice cloning, dubbing, and more. Everything in the ElevenLabs platform, accessible via API.
What languages are supported? Python and TypeScript SDKs with streaming support. Flutter, Swift, and Kotlin for mobile. REST API for any other language.
How does pricing work? Usage-based billing with no minimum commitment. TTS is billed per character, STT per audio minute, music and SFX per generation, dubbing per source audio minute. Check elevenlabs.io/pricing for current rates.
How fast is the TTS API? Streaming responses in under 500ms. The Flash model is optimised for latency-sensitive applications.
Is output commercially licensed? Yes. Music requires an additional license for advertising, film, TV, games, and enterprise distribution.
Is it enterprise-compliant? SOC 2, HIPAA, and GDPR compliant. EU and India data residency options. Zero retention mode available.
Free: AI Voice Tool Comparison Guide
Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.
Get your ElevenLabs API key
Usage-based billing with no minimum commitment. Start building in minutes.
Frequently Asked Questions
Last updated: