AI Voice Review
Guide8 min read

How to Add Voice to Your App Using the ElevenLabs API

By VoiceToolsReview Editorial Team

Last updated:

Affiliate link — we may earn a small commission.

Add voice to your app with ElevenLabs

Get an API key and make your first call in minutes. Usage-based billing, no minimum commitment.

Adding voice to a product used to mean either building TTS infrastructure yourself or accepting low-quality output from commodity APIs. The ElevenLabs API is a practical third option: high-quality voice generation, transcription, and audio generation accessible via a well-documented API with SDKs for every common stack.

This guide covers the most common integration patterns — TTS in a web app, real-time transcription, and voice agents — with concrete code examples.

The Core Pattern: Never Expose the API Key Client-Side

Before anything else: the API key goes on the server, not in the browser. Any pattern where an API key is embedded in client-side JavaScript is a security risk.

The standard pattern:

  1. Client sends text (or audio) to your backend
  2. Backend calls the ElevenLabs API
  3. Backend streams or forwards the audio response to the client

This applies regardless of your stack.

Adding TTS to a Web App (Node + React)

Backend (Node/Express):

import express from "express";
import { ElevenLabsClient } from "elevenlabs";

const app = express();
const client = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });

app.post("/api/speak", async (req, res) => {
  const { text } = req.body;

  res.setHeader("Content-Type", "audio/mpeg");
  res.setHeader("Transfer-Encoding", "chunked");

  const audio = await client.textToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
    text,
    model_id: "eleven_flash_v2_5",
    output_format: "mp3_44100_128",
  });

  audio.pipe(res);
});

Frontend (React):

async function speak(text: string) {
  const response = await fetch("/api/speak", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text }),
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  audio.play();
}

For longer text, stream the audio in chunks and use the Web Audio API to start playback before the full response arrives. This gives you the sub-500ms first-audio experience that makes voice feel responsive.

Get your ElevenLabs API key

Adding TTS to a Python Backend

from elevenlabs.client import ElevenLabs
from flask import Flask, request, Response

app = Flask(__name__)
client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

@app.route("/speak", methods=["POST"])
def speak():
    text = request.json["text"]

    def generate():
        audio = client.text_to_speech.convert_as_stream(
            voice_id="JBFqnCBsd6RMkjVDRZzb",
            text=text,
            model_id="eleven_flash_v2_5",
        )
        for chunk in audio:
            yield chunk

    return Response(generate(), mimetype="audio/mpeg")

The convert_as_stream method yields audio chunks as they are generated. Forwarding them directly to the HTTP response means the client starts receiving audio without waiting for the full generation.

Adding Real-Time Transcription (STT)

Scribe processes audio at 20-50x real-time with industry-leading accuracy across 99 languages. The integration pattern for transcription:

Python:

def transcribe_file(filepath: str) -> str:
    with open(filepath, "rb") as f:
        result = client.speech_to_text.convert(
            file=f,
            model_id="scribe_v1",
            language_code="en",          # optional — auto-detects if omitted
            diarize=True,                # speaker identification
            timestamps_granularity="word",  # word-level timing
        )
    return result.text

For real-time microphone transcription, stream audio chunks from the browser to your backend and forward them to the Scribe streaming endpoint:

async def transcribe_stream(audio_chunks):
    async for chunk in client.speech_to_text.stream(
        audio=audio_chunks,
        model_id="scribe_v1",
    ):
        yield chunk.text  # partial transcript as it arrives

Use cases: meeting transcription, podcast processing, call centre recordings, live captioning, voice search input.

Building a Voice Agent

A voice agent takes spoken user input, processes it, and responds with spoken audio. The architecture:

User speaks → Microphone → Browser → Backend
Backend → Scribe STT → Text transcript
Text transcript → LLM → Response text
Response text → ElevenLabs TTS → Audio stream
Audio stream → Browser → Speaker

Latency matters here. Use the Flash model for TTS — first audio chunk in under 500ms. Stream the LLM response token-by-token into the TTS API rather than waiting for the full LLM response before generating audio.

async def voice_agent_respond(user_audio: bytes) -> AsyncGenerator[bytes, None]:
    # Step 1: transcribe user speech
    transcript = await transcribe(user_audio)

    # Step 2: generate LLM response (streaming)
    llm_response = await llm.stream(prompt=transcript)

    # Step 3: stream response text into TTS as it arrives
    audio_stream = client.text_to_speech.convert_as_stream(
        voice_id=AGENT_VOICE_ID,
        text=llm_response,
        model_id="eleven_flash_v2_5",
    )

    async for audio_chunk in audio_stream:
        yield audio_chunk

Forward audio chunks to the browser as they arrive. The user starts hearing the response before it has fully generated — which is what makes a voice agent feel conversational rather than robotic.

ElevenLabs also has an Agents platform

For production voice agents, ElevenLabs offers a dedicated Agents platform with mobile SDKs (Flutter, Swift, Kotlin) and built-in conversation management. The API pattern above is right for custom implementations; the Agents platform is right for standardised conversational deployments.

Adding Voice to a Game

For game projects, the most common pattern is pre-generating audio assets and caching them, rather than real-time generation per line.

NPC dialogue:

import hashlib, os

def get_npc_line(text: str, voice_id: str) -> str:
    # cache by text + voice hash to avoid re-generating
    key = hashlib.md5(f"{voice_id}:{text}".encode()).hexdigest()
    cache_path = f"audio_cache/{key}.mp3"

    if not os.path.exists(cache_path):
        audio = client.text_to_speech.convert(
            voice_id=voice_id,
            text=text,
            model_id="eleven_multilingual_v2",
        )
        with open(cache_path, "wb") as f:
            for chunk in audio:
                f.write(chunk)

    return cache_path

Sound effects on demand:

def generate_sfx(description: str) -> bytes:
    result = client.text_to_sound_effects.convert(
        text=description,        # "explosion in a concrete room"
        duration_seconds=2.5,
    )
    return b"".join(result)

For procedurally generated games where dialogue cannot be pre-cached, use the streaming API with the Flash model and buffer a few seconds ahead of playback.

Building Without Backend Code

If you are using Lovable, Replit, v0, or Cursor, you can describe the audio feature you want and these tools will handle the integration. The ElevenLabs API is designed to work with these workflows — you do not need deep engineering experience to add voice to a project.

Example prompt for Lovable or Cursor:

"Add a button that reads the current page content aloud. Use the ElevenLabs API with voice ID JBFqnCBsd6RMkjVDRZzb. Call through a backend endpoint so the API key is not exposed client-side."

The vibe-coding tool generates the frontend component and the backend route, with the API integration handled automatically.

Common Mistakes to Avoid

Exposing the API key client-side — always proxy through a backend endpoint.

Not streaming — waiting for the full audio file to generate before playing it makes voice feel slow. Stream from the first chunk.

Using Multilingual v2 for real-time applications — use Flash for latency-sensitive paths. Multilingual v2 is for quality-first, offline generation.

Not caching generated audio — if the same text will be spoken multiple times (e.g., UI prompts, canned responses), cache the generated audio and serve from cache. Re-generating identical content wastes credits and adds latency.

Not handling rate limits — build retry logic with exponential backoff. The SDK surfaces typed rate limit errors.

Build your first voice feature with ElevenLabs

Frequently Asked Questions

How do I add TTS to a React app? Call the ElevenLabs API from a backend endpoint, stream the audio response to the browser, and play it with the Web Audio API or an HTML audio element. Never expose your API key client-side.

How do I build a voice agent? Combine Scribe STT (user speech → text), an LLM (text → response), and ElevenLabs TTS (response → audio). Stream all three for low-latency conversational feel.

Can I build this without writing backend code? Yes. Vibe-coding tools like Lovable, Replit, v0, and Cursor handle the API integration — describe what you want to build.

Which model should I use for a voice agent? Flash (eleven_flash_v2_5) — optimised for latency-sensitive applications, first audio chunk in under 500ms.

How do I add AI voice to a game? Pre-generate and cache NPC dialogue audio assets. Use the streaming API for procedural content. Generate sound effects on demand from text descriptions.

Free: AI Voice Tool Comparison Guide

Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.

Add voice to your app with ElevenLabs

Get an API key and make your first call in minutes. Usage-based billing, no minimum commitment.

Frequently Asked Questions

Last updated: