What Is a Voice Agent? How Conversational AI Works and How to Build One
Last updated:
Affiliate link — we may earn a small commission.
Build a voice agent with the ElevenLabs API
Add real-time speech input and output to any application. Usage-based billing, no minimum commitment.
A voice agent is an AI system that communicates through spoken conversation. You speak — it listens, understands, and responds out loud, in real time. No typing, no clicking. Just a conversation.
This is the interface that phone support systems, in-car assistants, and smart devices have used for years. What has changed is the underlying quality — early voice systems were rigid and brittle. Modern voice agents, built on large language models and neural TTS, handle free-form conversation with the naturalness of a real call.
This guide explains how voice agents work architecturally, and how to build one with the ElevenLabs API.
How a Voice Agent Works
Three components, in sequence, on every turn:
1. Speech to Text (STT) — the user's spoken input is captured and transcribed to text in real time. The STT model must handle natural speech: incomplete sentences, filler words, accents, background noise, and overlapping audio.
2. Language Model (LLM) — the transcribed text is passed to a language model that interprets the user's intent and generates a response. The LLM holds conversation context across turns — it knows what was said earlier in the call.
3. Text to Speech (TTS) — the LLM's text response is synthesised to audio and played back. For a voice agent to feel natural, TTS must be fast and expressive — robotic-sounding synthesis breaks the conversational illusion.
The latency of each step compounds. If STT takes 300ms, LLM takes 400ms, and TTS takes 300ms, the user waits 1 second before hearing the response. Optimising for low end-to-end latency across all three is the primary engineering challenge in voice agent development.
The ElevenLabs Conversational AI API
ElevenLabs provides a Conversational AI API that integrates all three components into a single real-time interface:
- Scribe for speech to text with low-latency transcription
- LLM integration with configurable model (OpenAI, Anthropic, or your own) for response generation
- ElevenLabs v3 for text to speech output in a configured agent voice
- WebSocket-based streaming for real-time bidirectional audio
The API manages turn-taking, interruption handling, and conversation state. You configure the agent's personality, voice, and system prompt — the platform handles the audio processing infrastructure.
Building a Basic Voice Agent
Prerequisites
pip install elevenlabs websockets pyaudio
WebSocket Connection
The Conversational AI API uses WebSocket streaming. Audio flows in both directions simultaneously:
import asyncio
import json
import websockets
import pyaudio
from base64 import b64encode, b64decode
ELEVENLABS_API_KEY = "your_api_key"
AGENT_ID = "your_agent_id" # configured in ElevenLabs dashboard
async def run_voice_agent():
uri = f"wss://api.elevenlabs.io/v1/convai/conversation?agent_id={AGENT_ID}"
headers = {"xi-api-key": ELEVENLABS_API_KEY}
async with websockets.connect(uri, extra_headers=headers) as ws:
print("Connected to voice agent")
# Start audio capture and playback concurrently
await asyncio.gather(
send_audio(ws),
receive_audio(ws)
)
Get your ElevenLabs API key
Sending Audio Input
async def send_audio(ws):
"""Capture microphone audio and stream to the agent."""
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=1024
)
try:
while True:
audio_chunk = stream.read(1024, exception_on_overflow=False)
audio_b64 = b64encode(audio_chunk).decode("utf-8")
message = {
"user_audio_chunk": audio_b64
}
await ws.send(json.dumps(message))
finally:
stream.stop_stream()
stream.close()
p.terminate()
Receiving and Playing Agent Responses
async def receive_audio(ws):
"""Receive agent audio chunks and play them."""
p = pyaudio.PyAudio()
output_stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=22050,
output=True
)
try:
async for message in ws:
data = json.loads(message)
if data.get("type") == "audio":
audio_b64 = data.get("audio_event", {}).get("audio_base_64", "")
if audio_b64:
audio_bytes = b64decode(audio_b64)
output_stream.write(audio_bytes)
elif data.get("type") == "agent_response":
transcript = data.get("agent_response_event", {}).get("agent_response", "")
if transcript:
print(f"Agent: {transcript}")
elif data.get("type") == "user_transcript":
user_text = data.get("user_transcription_event", {}).get("user_transcript", "")
if user_text:
print(f"User: {user_text}")
finally:
output_stream.stop_stream()
output_stream.close()
p.terminate()
Configuring Your Agent
Agent configuration — personality, voice, system prompt, and LLM settings — is managed in the ElevenLabs dashboard or via the agent configuration API:
def create_agent(
name: str,
voice_id: str,
system_prompt: str,
llm_model: str = "gpt-4o"
) -> dict:
url = "https://api.elevenlabs.io/v1/convai/agents/create"
headers = {
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json"
}
payload = {
"name": name,
"conversation_config": {
"agent": {
"prompt": {
"prompt": system_prompt,
"llm": llm_model,
"temperature": 0.7
},
"first_message": "Hello, how can I help you today?",
"language": "en"
},
"tts": {
"voice_id": voice_id,
"model_id": "eleven_turbo_v2_5", # low-latency model for real-time
"agent_output_audio_format": "pcm_16000"
},
"stt": {
"model_id": "scribe_v1"
}
}
}
response = requests.post(url, json=payload, headers=headers)
response.raise_for_status()
return response.json()
The eleven_turbo_v2_5 TTS model is optimised for low latency — use it for voice agents rather than the standard v3 model. The quality trade-off is minimal for conversational speech.
Interruption Handling
Real conversations involve interruptions. A voice agent that keeps talking after the user starts speaking feels unnatural. The ElevenLabs Conversational AI API handles interruption detection automatically — when the user begins speaking, agent audio playback stops and the new input is processed.
In your implementation, monitor for interruption event types in the WebSocket messages:
elif data.get("type") == "interruption":
# Agent was interrupted — clear any buffered audio
print("User interrupted the agent")
# Clear your audio output buffer here if buffering ahead
Voice Agent Use Cases
Customer support — replace or augment call centre operations with a voice agent that handles common queries, collects information, and escalates complex issues. Available 24/7, consistent in tone and process.
Appointment scheduling — a voice interface for booking flows. The agent asks questions, checks availability, and confirms appointments through natural conversation rather than form inputs.
Product demos — let potential customers ask questions about your product in a natural voice conversation rather than navigating docs or waiting for a sales call.
Voice-enabled internal tools — query databases, generate reports, or trigger workflows through spoken commands rather than clicking through a UI.
Educational tutors — a voice tutor that answers questions, explains concepts, and gives feedback through spoken conversation — more engaging than text-based Q&A.
Latency Optimisation
For voice agents, latency is a UX problem. Each 100ms of additional lag makes the conversation feel less natural. Key optimisation points:
- Use the turbo TTS model (
eleven_turbo_v2_5) rather than standard models in real-time contexts - Stream audio immediately — begin playing TTS output as soon as the first chunks arrive, not after full generation completes
- Minimise LLM prompt length — longer system prompts increase first-token latency slightly
- Choose LLM model appropriately — faster models (GPT-4o-mini, Claude Haiku) reduce latency; more capable models add a few hundred milliseconds
Frequently Asked Questions
What is a voice agent? An AI system that communicates through natural spoken conversation — listens to speech, processes it with an LLM, and responds in synthesised speech in real time.
What are the three components of a voice agent? Speech to text (transcription), language model (understanding and response generation), and text to speech (voice output).
How do I build a voice agent with ElevenLabs? Use the Conversational AI API with a WebSocket connection. Configure your agent via the dashboard or API with a system prompt, voice, and LLM settings.
What TTS model should I use for voice agents?
eleven_turbo_v2_5 — it is optimised for low latency in real-time contexts.
How does interruption handling work?
The ElevenLabs Conversational AI API detects when the user begins speaking and stops agent audio automatically. Your implementation receives an interruption event via the WebSocket.
Free: AI Voice Tool Comparison Guide
Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.
Build a voice agent with the ElevenLabs API
Add real-time speech input and output to any application. Usage-based billing, no minimum commitment.
Frequently Asked Questions
Last updated: