How to Build a Podcast with AI: Using the ElevenLabs API for Automated Audio Production
Last updated:
Affiliate link — we may earn a small commission.
Build your podcast workflow with the ElevenLabs API
Automate narration, transcription, and audio production with one API. Usage-based billing, no minimum commitment.
Podcast production involves three repetitive audio tasks: narrating scripts, transcribing recordings, and producing consistent audio output at volume. Each of these can be automated with an API.
The ElevenLabs API covers all three: text to speech for narration, Scribe for transcription, and sound effects generation for production elements. This guide covers how to build a podcast production workflow using the API — from script to audio to transcript.
What the ElevenLabs API Provides for Podcast Production
Text to Speech — convert scripts to narrated audio programmatically. ElevenLabs v3 produces human-like narration with natural pacing and emotional inflection. For solo narration formats — explainer podcasts, news summaries, educational content — the output quality is production-ready.
Scribe (Speech to Text) — transcribe episode recordings with word-level timestamps and speaker diarization. For interview formats, Scribe identifies individual speakers automatically and produces structured transcripts. Use these for show notes, SEO content, or searchable episode archives.
Sound Effects — generate sound effects from text descriptions programmatically. Useful for consistent production elements across episodes: transitions, ambient backgrounds, and idents.
Voice Cloning — clone a host voice from a sample recording. The cloned voice generates narration in the host's voice from text — useful for ad reads, episode intros, or producing content when the host is unavailable.
Building a Script-to-Audio Narration Pipeline
The simplest automated podcast workflow converts a written script to narrated audio:
import requests
ELEVENLABS_API_KEY = "your_api_key"
VOICE_ID = "your_voice_id" # narrator voice from Voice Library or cloned voice
def generate_narration(script: str, output_path: str):
url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}"
headers = {
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json"
}
payload = {
"text": script,
"model_id": "eleven_multilingual_v2",
"voice_settings": {
"stability": 0.65,
"similarity_boost": 0.80,
"style": 0.10,
"use_speaker_boost": True
}
}
response = requests.post(url, json=payload, headers=headers)
response.raise_for_status()
with open(output_path, "wb") as f:
f.write(response.content)
return output_path
For podcasts longer than a few minutes, use streaming to avoid buffering the full audio in memory before writing:
def generate_narration_streaming(script: str, output_path: str):
url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream"
headers = {
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json"
}
payload = {
"text": script,
"model_id": "eleven_multilingual_v2",
"voice_settings": {
"stability": 0.65,
"similarity_boost": 0.80
}
}
with requests.post(url, json=payload, headers=headers, stream=True) as response:
response.raise_for_status()
with open(output_path, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
return output_path
Get your ElevenLabs API key
Transcribing Episodes with Scribe
For interview formats or any episode where audio is recorded rather than scripted, Scribe provides transcription with speaker diarization:
def transcribe_episode(audio_path: str) -> dict:
url = "https://api.elevenlabs.io/v1/speech-to-text"
headers = {"xi-api-key": ELEVENLABS_API_KEY}
with open(audio_path, "rb") as audio_file:
files = {"file": (audio_path, audio_file, "audio/mpeg")}
data = {
"model_id": "scribe_v1",
"diarize": True, # identify individual speakers
"timestamps_granularity": "word" # word-level timestamps
}
response = requests.post(url, headers=headers, files=files, data=data)
response.raise_for_status()
return response.json()
def format_show_notes(transcript: dict) -> str:
"""Convert a Scribe transcript into formatted show notes."""
lines = []
for utterance in transcript.get("utterances", []):
speaker = utterance.get("speaker", "Speaker")
text = utterance.get("text", "")
start = utterance.get("start", 0)
timestamp = f"[{int(start // 60):02d}:{int(start % 60):02d}]"
lines.append(f"**{speaker}** {timestamp}: {text}")
return "\n\n".join(lines)
The diarize: True parameter tells Scribe to label speech by speaker — output includes a speaker field per utterance, labelled speaker_0, speaker_1, etc. For a two-person interview, you get a structured transcript you can use directly in show notes or pass to an LLM for summarisation.
Voice Cloning for Consistent Host Audio
If your podcast host records their voice, clone it once and use it for narration, ad reads, and intros without recording sessions:
def clone_voice(sample_audio_path: str, name: str, description: str) -> str:
"""Clone a voice from an audio sample. Returns the voice_id."""
url = "https://api.elevenlabs.io/v1/voices/add"
headers = {"xi-api-key": ELEVENLABS_API_KEY}
with open(sample_audio_path, "rb") as f:
files = {"files": (sample_audio_path, f, "audio/mpeg")}
data = {
"name": name,
"description": description
}
response = requests.post(url, headers=headers, files=files, data=data)
response.raise_for_status()
return response.json()["voice_id"]
The returned voice_id is reusable across all future TTS calls. Store it — you will use it as the VOICE_ID in every narration request.
Full Episode Production Workflow
A complete automated pipeline for a narrated podcast:
import os
def produce_episode(script_path: str, episode_number: int) -> dict:
"""Full pipeline: script → narrated audio."""
with open(script_path, "r") as f:
script = f.read()
# Generate narration
output_audio = f"episode_{episode_number:03d}_narration.mp3"
generate_narration_streaming(script, output_audio)
# Transcribe the generated audio for show notes
transcript = transcribe_episode(output_audio)
show_notes = format_show_notes(transcript)
# Save show notes
show_notes_path = f"episode_{episode_number:03d}_show_notes.md"
with open(show_notes_path, "w") as f:
f.write(show_notes)
return {
"audio": output_audio,
"show_notes": show_notes_path,
"word_count": len(script.split())
}
Voice Settings for Podcast Narration
Voice settings affect how the narration sounds over a full episode. Recommended starting points:
PODCAST_VOICE_SETTINGS = {
"stability": 0.65, # consistent delivery, not robotic
"similarity_boost": 0.80, # preserve voice character across the episode
"style": 0.10, # slight expressiveness without over-emoting
"use_speaker_boost": True # improved clarity for spoken audio
}
# For ad reads — slightly more energy
AD_READ_SETTINGS = {
"stability": 0.55,
"similarity_boost": 0.75,
"style": 0.25,
"use_speaker_boost": True
}
Test your settings on a representative passage — a mix of calm narration, a question, and an emphatic point — before running the full episode.
Generating Sound Effects for Episode Elements
Produce consistent audio elements for transitions, intros, and outros:
def generate_sfx(description: str, output_path: str, duration_seconds: float = 3.0):
url = "https://api.elevenlabs.io/v1/sound-generation"
headers = {
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json"
}
payload = {
"text": description,
"duration_seconds": duration_seconds,
"prompt_influence": 0.3
}
response = requests.post(url, json=payload, headers=headers)
response.raise_for_status()
with open(output_path, "wb") as f:
f.write(response.content)
# Episode elements
generate_sfx("soft podcast intro chime, warm and professional", "intro_chime.mp3", 3.0)
generate_sfx("smooth transition whoosh, subtle", "transition.mp3", 1.5)
generate_sfx("gentle outro music fade", "outro.mp3", 4.0)
Generate these once, cache them, and reuse across every episode for consistent branding.
Costs and Scaling
API usage is billed per character (TTS) and per audio minute (Scribe). For podcast production:
- A 20-minute scripted episode (~3,000 words) generates roughly 18,000 characters of TTS
- A 45-minute interview recording transcribes at the Scribe per-minute rate
- Sound effects generation is separate billing per generation
At scale, cache all generated audio — if the same script segment runs unchanged, store the output file and serve it rather than regenerating. This is especially relevant for intro/outro segments and recurring ad reads.
Frequently Asked Questions
Can I use AI to build a podcast? Yes. TTS generates narration from scripts; Scribe transcribes interview recordings; voice cloning lets you use a host voice programmatically.
How does Scribe handle multiple speakers?
Set diarize: True in the Scribe request. The API identifies speakers and labels each utterance with a speaker ID. For interviews, you get a structured transcript per speaker.
Can I clone my podcast voice?
Yes. Upload a clean sample of the host voice to the Voice Cloning endpoint. The cloned voice is then available as a voice_id for all future TTS requests.
What voice settings work best for podcast narration? Stability 0.60–0.70, similarity boost 0.75–0.85. Test on representative content before committing to full episode generation.
How much does it cost? Usage-based per character (TTS) and per minute (Scribe). No minimum commitment. Check elevenlabs.io/pricing for current rates.
Get your API key — usage-based billing, no minimumFree: AI Voice Tool Comparison Guide
Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.
Build your podcast workflow with the ElevenLabs API
Automate narration, transcription, and audio production with one API. Usage-based billing, no minimum commitment.
Frequently Asked Questions
Last updated: