What is the difference between a voice agent and a chatbot?

Chatbots use text input and output. Voice agents use speech — they transcribe spoken input to text, process it with a language model, then synthesise the response back to speech in real time. The conversation feels natural because there is no typing required and responses arrive as spoken audio.

What does a voice agent need to work?

Three components: speech to text (to transcribe what the user says), a language model (to understand and generate a response), and text to speech (to speak the response). Low latency in all three is essential for the conversation to feel natural.

Can I build a voice agent with the ElevenLabs API?

Yes. ElevenLabs provides a Conversational AI API that integrates speech to text (Scribe), language model processing, and text to speech (ElevenLabs v3) into a single real-time voice interface. You can build phone agents, web-based voice assistants, and embedded voice UIs.

How low is the latency on ElevenLabs voice agents?

ElevenLabs targets sub-500ms latency from end of user speech to start of agent response. Latency depends on network conditions and the complexity of the language model processing — but for most conversational use cases, the response feels immediate.

What are the main use cases for voice agents?

Customer support (phone and web), appointment scheduling, product demos, onboarding flows, voice-enabled internal tools, educational tutors, and interactive audio experiences.

Guide8 min read

What Is a Voice Agent? How Conversational AI Works and How to Build One

Q: What is a voice agent?

A voice agent is an AI system that communicates through natural speech — it listens to spoken input, processes the meaning, and responds with synthesised speech in real time. Unlike chatbots that require typing, voice agents operate entirely through spoken conversation. Examples include phone support systems, voice-enabled assistants, and AI customer service reps.

By VoiceToolsReview Editorial Team

Last updated: 5 May 2026

Affiliate link — we may earn a small commission.

Build a voice agent with the ElevenLabs API

Add real-time speech input and output to any application. Usage-based billing, no minimum commitment.

Get your API key Read the full ElevenLabs API review

A voice agent is an AI system that communicates through spoken conversation. You speak — it listens, understands, and responds out loud, in real time. No typing, no clicking. Just a conversation.

This is the interface that phone support systems, in-car assistants, and smart devices have used for years. What has changed is the underlying quality — early voice systems were rigid and brittle. Modern voice agents, built on large language models and neural TTS, handle free-form conversation with the naturalness of a real call.

This guide explains how voice agents work architecturally, and how to build one with the ElevenLabs API.

How a Voice Agent Works

Three components, in sequence, on every turn:

1. Speech to Text (STT) — the user's spoken input is captured and transcribed to text in real time. The STT model must handle natural speech: incomplete sentences, filler words, accents, background noise, and overlapping audio.

2. Language Model (LLM) — the transcribed text is passed to a language model that interprets the user's intent and generates a response. The LLM holds conversation context across turns — it knows what was said earlier in the call.

3. Text to Speech (TTS) — the LLM's text response is synthesised to audio and played back. For a voice agent to feel natural, TTS must be fast and expressive — robotic-sounding synthesis breaks the conversational illusion.

The latency of each step compounds. If STT takes 300ms, LLM takes 400ms, and TTS takes 300ms, the user waits 1 second before hearing the response. Optimising for low end-to-end latency across all three is the primary engineering challenge in voice agent development.

The ElevenLabs Conversational AI API

ElevenLabs provides a Conversational AI API that integrates all three components into a single real-time interface:

Scribe for speech to text with low-latency transcription
LLM integration with configurable model (OpenAI, Anthropic, or your own) for response generation
ElevenLabs v3 for text to speech output in a configured agent voice
WebSocket-based streaming for real-time bidirectional audio

The API manages turn-taking, interruption handling, and conversation state. You configure the agent's personality, voice, and system prompt — the platform handles the audio processing infrastructure.

Building a Basic Voice Agent

Prerequisites

pip install elevenlabs websockets pyaudio

WebSocket Connection

The Conversational AI API uses WebSocket streaming. Audio flows in both directions simultaneously:

import asyncio
import json
import websockets
import pyaudio
from base64 import b64encode, b64decode

ELEVENLABS_API_KEY = "your_api_key"
AGENT_ID = "your_agent_id"  # configured in ElevenLabs dashboard

async def run_voice_agent():
    uri = f"wss://api.elevenlabs.io/v1/convai/conversation?agent_id={AGENT_ID}"
    headers = {"xi-api-key": ELEVENLABS_API_KEY}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        print("Connected to voice agent")

        # Start audio capture and playback concurrently
        await asyncio.gather(
            send_audio(ws),
            receive_audio(ws)
        )

Get your ElevenLabs API key

Sending Audio Input

async def send_audio(ws):
    """Capture microphone audio and stream to the agent."""
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=16000,
        input=True,
        frames_per_buffer=1024
    )

    try:
        while True:
            audio_chunk = stream.read(1024, exception_on_overflow=False)
            audio_b64 = b64encode(audio_chunk).decode("utf-8")

            message = {
                "user_audio_chunk": audio_b64
            }
            await ws.send(json.dumps(message))
    finally:
        stream.stop_stream()
        stream.close()
        p.terminate()

Receiving and Playing Agent Responses

async def receive_audio(ws):
    """Receive agent audio chunks and play them."""
    p = pyaudio.PyAudio()
    output_stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=22050,
        output=True
    )

    try:
        async for message in ws:
            data = json.loads(message)

            if data.get("type") == "audio":
                audio_b64 = data.get("audio_event", {}).get("audio_base_64", "")
                if audio_b64:
                    audio_bytes = b64decode(audio_b64)
                    output_stream.write(audio_bytes)

            elif data.get("type") == "agent_response":
                transcript = data.get("agent_response_event", {}).get("agent_response", "")
                if transcript:
                    print(f"Agent: {transcript}")

            elif data.get("type") == "user_transcript":
                user_text = data.get("user_transcription_event", {}).get("user_transcript", "")
                if user_text:
                    print(f"User: {user_text}")

    finally:
        output_stream.stop_stream()
        output_stream.close()
        p.terminate()

Configuring Your Agent

Agent configuration — personality, voice, system prompt, and LLM settings — is managed in the ElevenLabs dashboard or via the agent configuration API:

def create_agent(
    name: str,
    voice_id: str,
    system_prompt: str,
    llm_model: str = "gpt-4o"
) -> dict:
    url = "https://api.elevenlabs.io/v1/convai/agents/create"
    headers = {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json"
    }
    payload = {
        "name": name,
        "conversation_config": {
            "agent": {
                "prompt": {
                    "prompt": system_prompt,
                    "llm": llm_model,
                    "temperature": 0.7
                },
                "first_message": "Hello, how can I help you today?",
                "language": "en"
            },
            "tts": {
                "voice_id": voice_id,
                "model_id": "eleven_turbo_v2_5",  # low-latency model for real-time
                "agent_output_audio_format": "pcm_16000"
            },
            "stt": {
                "model_id": "scribe_v1"
            }
        }
    }
    response = requests.post(url, json=payload, headers=headers)
    response.raise_for_status()
    return response.json()

The eleven_turbo_v2_5 TTS model is optimised for low latency — use it for voice agents rather than the standard v3 model. The quality trade-off is minimal for conversational speech.

Build your voice agent — get your API key

Interruption Handling

Real conversations involve interruptions. A voice agent that keeps talking after the user starts speaking feels unnatural. The ElevenLabs Conversational AI API handles interruption detection automatically — when the user begins speaking, agent audio playback stops and the new input is processed.

In your implementation, monitor for interruption event types in the WebSocket messages:

elif data.get("type") == "interruption":
    # Agent was interrupted — clear any buffered audio
    print("User interrupted the agent")
    # Clear your audio output buffer here if buffering ahead

Voice Agent Use Cases

Customer support — replace or augment call centre operations with a voice agent that handles common queries, collects information, and escalates complex issues. Available 24/7, consistent in tone and process.

Appointment scheduling — a voice interface for booking flows. The agent asks questions, checks availability, and confirms appointments through natural conversation rather than form inputs.

Product demos — let potential customers ask questions about your product in a natural voice conversation rather than navigating docs or waiting for a sales call.

Voice-enabled internal tools — query databases, generate reports, or trigger workflows through spoken commands rather than clicking through a UI.

Educational tutors — a voice tutor that answers questions, explains concepts, and gives feedback through spoken conversation — more engaging than text-based Q&A.

Latency Optimisation

For voice agents, latency is a UX problem. Each 100ms of additional lag makes the conversation feel less natural. Key optimisation points:

Use the turbo TTS model (eleven_turbo_v2_5) rather than standard models in real-time contexts
Stream audio immediately — begin playing TTS output as soon as the first chunks arrive, not after full generation completes
Minimise LLM prompt length — longer system prompts increase first-token latency slightly
Choose LLM model appropriately — faster models (GPT-4o-mini, Claude Haiku) reduce latency; more capable models add a few hundred milliseconds

Frequently Asked Questions

What is a voice agent? An AI system that communicates through natural spoken conversation — listens to speech, processes it with an LLM, and responds in synthesised speech in real time.

What are the three components of a voice agent? Speech to text (transcription), language model (understanding and response generation), and text to speech (voice output).

How do I build a voice agent with ElevenLabs? Use the Conversational AI API with a WebSocket connection. Configure your agent via the dashboard or API with a system prompt, voice, and LLM settings.

What TTS model should I use for voice agents? eleven_turbo_v2_5 — it is optimised for low latency in real-time contexts.

How does interruption handling work? The ElevenLabs Conversational AI API detects when the user begins speaking and stops agent audio automatically. Your implementation receives an interruption event via the WebSocket.

Build a voice agent with the ElevenLabs API

Free: AI Voice Tool Comparison Guide

Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.

Build a voice agent with the ElevenLabs API

Add real-time speech input and output to any application. Usage-based billing, no minimum commitment.

Get your API key Read the full ElevenLabs API review

Frequently Asked Questions

Last updated: 5 May 2026