AI Voice Review
Guide8 min read

How to Build a Podcast with AI: Using the ElevenLabs API for Automated Audio Production

By VoiceToolsReview Editorial Team

Last updated:

Affiliate link — we may earn a small commission.

Build your podcast workflow with the ElevenLabs API

Automate narration, transcription, and audio production with one API. Usage-based billing, no minimum commitment.

Podcast production involves three repetitive audio tasks: narrating scripts, transcribing recordings, and producing consistent audio output at volume. Each of these can be automated with an API.

The ElevenLabs API covers all three: text to speech for narration, Scribe for transcription, and sound effects generation for production elements. This guide covers how to build a podcast production workflow using the API — from script to audio to transcript.

What the ElevenLabs API Provides for Podcast Production

Text to Speech — convert scripts to narrated audio programmatically. ElevenLabs v3 produces human-like narration with natural pacing and emotional inflection. For solo narration formats — explainer podcasts, news summaries, educational content — the output quality is production-ready.

Scribe (Speech to Text) — transcribe episode recordings with word-level timestamps and speaker diarization. For interview formats, Scribe identifies individual speakers automatically and produces structured transcripts. Use these for show notes, SEO content, or searchable episode archives.

Sound Effects — generate sound effects from text descriptions programmatically. Useful for consistent production elements across episodes: transitions, ambient backgrounds, and idents.

Voice Cloning — clone a host voice from a sample recording. The cloned voice generates narration in the host's voice from text — useful for ad reads, episode intros, or producing content when the host is unavailable.

Building a Script-to-Audio Narration Pipeline

The simplest automated podcast workflow converts a written script to narrated audio:

import requests

ELEVENLABS_API_KEY = "your_api_key"
VOICE_ID = "your_voice_id"  # narrator voice from Voice Library or cloned voice

def generate_narration(script: str, output_path: str):
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}"
    headers = {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json"
    }
    payload = {
        "text": script,
        "model_id": "eleven_multilingual_v2",
        "voice_settings": {
            "stability": 0.65,
            "similarity_boost": 0.80,
            "style": 0.10,
            "use_speaker_boost": True
        }
    }
    response = requests.post(url, json=payload, headers=headers)
    response.raise_for_status()
    with open(output_path, "wb") as f:
        f.write(response.content)
    return output_path

For podcasts longer than a few minutes, use streaming to avoid buffering the full audio in memory before writing:

def generate_narration_streaming(script: str, output_path: str):
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream"
    headers = {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json"
    }
    payload = {
        "text": script,
        "model_id": "eleven_multilingual_v2",
        "voice_settings": {
            "stability": 0.65,
            "similarity_boost": 0.80
        }
    }
    with requests.post(url, json=payload, headers=headers, stream=True) as response:
        response.raise_for_status()
        with open(output_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    return output_path
Get your ElevenLabs API key

Transcribing Episodes with Scribe

For interview formats or any episode where audio is recorded rather than scripted, Scribe provides transcription with speaker diarization:

def transcribe_episode(audio_path: str) -> dict:
    url = "https://api.elevenlabs.io/v1/speech-to-text"
    headers = {"xi-api-key": ELEVENLABS_API_KEY}

    with open(audio_path, "rb") as audio_file:
        files = {"file": (audio_path, audio_file, "audio/mpeg")}
        data = {
            "model_id": "scribe_v1",
            "diarize": True,          # identify individual speakers
            "timestamps_granularity": "word"  # word-level timestamps
        }
        response = requests.post(url, headers=headers, files=files, data=data)

    response.raise_for_status()
    return response.json()

def format_show_notes(transcript: dict) -> str:
    """Convert a Scribe transcript into formatted show notes."""
    lines = []
    for utterance in transcript.get("utterances", []):
        speaker = utterance.get("speaker", "Speaker")
        text = utterance.get("text", "")
        start = utterance.get("start", 0)
        timestamp = f"[{int(start // 60):02d}:{int(start % 60):02d}]"
        lines.append(f"**{speaker}** {timestamp}: {text}")
    return "\n\n".join(lines)

The diarize: True parameter tells Scribe to label speech by speaker — output includes a speaker field per utterance, labelled speaker_0, speaker_1, etc. For a two-person interview, you get a structured transcript you can use directly in show notes or pass to an LLM for summarisation.

Voice Cloning for Consistent Host Audio

If your podcast host records their voice, clone it once and use it for narration, ad reads, and intros without recording sessions:

def clone_voice(sample_audio_path: str, name: str, description: str) -> str:
    """Clone a voice from an audio sample. Returns the voice_id."""
    url = "https://api.elevenlabs.io/v1/voices/add"
    headers = {"xi-api-key": ELEVENLABS_API_KEY}

    with open(sample_audio_path, "rb") as f:
        files = {"files": (sample_audio_path, f, "audio/mpeg")}
        data = {
            "name": name,
            "description": description
        }
        response = requests.post(url, headers=headers, files=files, data=data)

    response.raise_for_status()
    return response.json()["voice_id"]

The returned voice_id is reusable across all future TTS calls. Store it — you will use it as the VOICE_ID in every narration request.

Start building with the ElevenLabs API

Full Episode Production Workflow

A complete automated pipeline for a narrated podcast:

import os

def produce_episode(script_path: str, episode_number: int) -> dict:
    """Full pipeline: script → narrated audio."""
    with open(script_path, "r") as f:
        script = f.read()

    # Generate narration
    output_audio = f"episode_{episode_number:03d}_narration.mp3"
    generate_narration_streaming(script, output_audio)

    # Transcribe the generated audio for show notes
    transcript = transcribe_episode(output_audio)
    show_notes = format_show_notes(transcript)

    # Save show notes
    show_notes_path = f"episode_{episode_number:03d}_show_notes.md"
    with open(show_notes_path, "w") as f:
        f.write(show_notes)

    return {
        "audio": output_audio,
        "show_notes": show_notes_path,
        "word_count": len(script.split())
    }

Voice Settings for Podcast Narration

Voice settings affect how the narration sounds over a full episode. Recommended starting points:

PODCAST_VOICE_SETTINGS = {
    "stability": 0.65,        # consistent delivery, not robotic
    "similarity_boost": 0.80, # preserve voice character across the episode
    "style": 0.10,            # slight expressiveness without over-emoting
    "use_speaker_boost": True # improved clarity for spoken audio
}

# For ad reads — slightly more energy
AD_READ_SETTINGS = {
    "stability": 0.55,
    "similarity_boost": 0.75,
    "style": 0.25,
    "use_speaker_boost": True
}

Test your settings on a representative passage — a mix of calm narration, a question, and an emphatic point — before running the full episode.

Generating Sound Effects for Episode Elements

Produce consistent audio elements for transitions, intros, and outros:

def generate_sfx(description: str, output_path: str, duration_seconds: float = 3.0):
    url = "https://api.elevenlabs.io/v1/sound-generation"
    headers = {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json"
    }
    payload = {
        "text": description,
        "duration_seconds": duration_seconds,
        "prompt_influence": 0.3
    }
    response = requests.post(url, json=payload, headers=headers)
    response.raise_for_status()
    with open(output_path, "wb") as f:
        f.write(response.content)

# Episode elements
generate_sfx("soft podcast intro chime, warm and professional", "intro_chime.mp3", 3.0)
generate_sfx("smooth transition whoosh, subtle", "transition.mp3", 1.5)
generate_sfx("gentle outro music fade", "outro.mp3", 4.0)

Generate these once, cache them, and reuse across every episode for consistent branding.

Costs and Scaling

API usage is billed per character (TTS) and per audio minute (Scribe). For podcast production:

  • A 20-minute scripted episode (~3,000 words) generates roughly 18,000 characters of TTS
  • A 45-minute interview recording transcribes at the Scribe per-minute rate
  • Sound effects generation is separate billing per generation

At scale, cache all generated audio — if the same script segment runs unchanged, store the output file and serve it rather than regenerating. This is especially relevant for intro/outro segments and recurring ad reads.

Frequently Asked Questions

Can I use AI to build a podcast? Yes. TTS generates narration from scripts; Scribe transcribes interview recordings; voice cloning lets you use a host voice programmatically.

How does Scribe handle multiple speakers? Set diarize: True in the Scribe request. The API identifies speakers and labels each utterance with a speaker ID. For interviews, you get a structured transcript per speaker.

Can I clone my podcast voice? Yes. Upload a clean sample of the host voice to the Voice Cloning endpoint. The cloned voice is then available as a voice_id for all future TTS requests.

What voice settings work best for podcast narration? Stability 0.60–0.70, similarity boost 0.75–0.85. Test on representative content before committing to full episode generation.

How much does it cost? Usage-based per character (TTS) and per minute (Scribe). No minimum commitment. Check elevenlabs.io/pricing for current rates.

Get your API key — usage-based billing, no minimum

Free: AI Voice Tool Comparison Guide

Which tool wins for your use case, ElevenLabs pricing decoded, and a quick-reference comparison table — sent straight to your inbox. No spam. Unsubscribe anytime.

Build your podcast workflow with the ElevenLabs API

Automate narration, transcription, and audio production with one API. Usage-based billing, no minimum commitment.

Frequently Asked Questions

Last updated: