Agent DailyAgent Daily
tutorialintermediate

Low latency voice assistant with ElevenLabs Nov 2025 • Integrations Build a low-latency voice assistant using ElevenLabs for speech-to-text and text-to-speech combined with Claude.

cookbook
View original on cookbook

This cookbook demonstrates building a low-latency voice assistant by combining ElevenLabs (speech-to-text and text-to-speech) with Claude for intelligent responses. The guide covers installation, API setup, and crucially, latency optimization techniques including Claude's streaming API and sentence-by-sentence TTS synthesis. Performance measurements show streaming reduces perceived latency by ~31% compared to non-streaming approaches, with TTS first-chunk delivery in 0.39 seconds.

Key Points

  • Install ElevenLabs and Anthropic SDKs, configure API keys via .env file for both services
  • Use ElevenLabs speech-to-text (scribe_v1) to transcribe user audio input (~0.54s latency)
  • Send transcribed text to Claude (claude-haiku-4-5) for intelligent response generation
  • Implement Claude streaming API to reduce time-to-first-token by ~31% (0.71s vs 1.03s)
  • Stream TTS responses using ElevenLabs text_to_speech.stream() for first audio chunk in ~0.39s
  • Detect sentence boundaries in Claude's streamed output and synthesize audio sentence-by-sentence
  • Select appropriate ElevenLabs models: eleven_v3 for quality, eleven_turbo_v2_5 for speed
  • Measure latency at each stage (transcription, LLM response, TTS) to identify bottlenecks
  • Use claude-haiku-4-5 model for fast, cost-effective responses suitable for real-time interaction
  • Combine streaming at multiple levels (Claude output + TTS synthesis) for minimal end-to-end latency

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process
Step A
Step B
Step C
Complete
Quality

Concepts

Artifacts (5)

requirements.txtconfig
anthropic
elevenlabs
python-dotenv
IPython
.env.exampleconfig
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
voice_assistant_setup.pypythonscript
import io
import os
import time
import anthropic
import elevenlabs
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
ELEVENLABS_API_KEY = os.getenv("ELEVENLABS_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

assert ELEVENLABS_API_KEY is not None, "ERROR: ELEVENLABS_API_KEY not found"
assert ANTHROPIC_API_KEY is not None, "ERROR: ANTHROPIC_API_KEY not found"

# Initialize clients
elevenlabs_client = elevenlabs.ElevenLabs(
    api_key=ELEVENLABS_API_KEY,
    base_url="https://api.elevenlabs.io"
)
anthropiclient = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

# List and select voice
voices = elevenlabs_client.voices.search().voices
selected_voice = voices[0]
VOICE_ID = selected_voice.voice_id
print(f"Selected voice: {selected_voice.name} with ID: {VOICE_ID}")
transcription_and_response.pypythonscript
import time
import io

# Speech-to-text transcription
audio_data.seek(0)
start_time = time.time()
transcription = elevenlabs_client.speech_to_text.convert(
    file=audio_data,
    model_id="scribe_v1"
)
end_time = time.time()
transcription_time = end_time - start_time
print(f"Transcribed text: {transcription.text}")
print(f"Transcription time: {transcription_time:.2f} seconds")

# Get Claude response with streaming
start_time = time.time()
first_token_time = None
claude_full_response = ""

with anthropic_client.messages.stream(
    model="claude-haiku-4-5",
    max_tokens=1000,
    temperature=0,
    messages=[{"role": "user", "content": transcription.text}],
) as stream:
    for text in stream.text_stream:
        claude_full_response += text
        print(text, end="", flush=True)
        if first_token_time is None:
            first_token_time = time.time()

streaming_time_to_first_token = first_token_time - start_time
print(f"\nStreaming time to first token: {streaming_time_to_first_token:.2f} seconds")
streaming_tts.pypythonscript
import re
import time
import io

# Stream Claude response and synthesize audio sentence-by-sentence
sentence_pattern = re.compile(r"[.!?]+")
sentence_buffer = ""
audio_chunks = []
start_time = time.time()
first_audio_time = None

with anthropic_client.messages.stream(
    model="claude-haiku-4-5",
    max_tokens=1000,
    temperature=0,
    messages=[{"role": "user", "content": transcription.text}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
        sentence_buffer += text
        
        # Check for sentence boundaries
        if sentence_pattern.search(text):
            # Send complete sentence to TTS
            audio_generator = elevenlabs_client.text_to_speech.stream(
                voice_id=VOICE_ID,
                output_format="mp3_44100_128",
                text=sentence_buffer,
                model_id="eleven_turbo_v2_5",
            )
            
            for chunk in audio_generator:
                if first_audio_time is None:
                    first_audio_time = time.time()
                audio_chunks.append(chunk)
            
            sentence_buffer = ""  # Reset for next sentence

if first_audio_time:
    streaming_tts_latency = first_audio_time - start_time
    print(f"\nStreaming TTS latency: {streaming_tts_latency:.2f} seconds")
Low latency voice assistant with ElevenLabs Nov 2025 • Integrations Build a low-latency voice assistant using ElevenLabs for speech-to-text and text-to-speech combined with Claude. | Agent Daily