Evaluating ElevenLabs voice AI
This guide walks you through testing ElevenLabs voice AI capabilities using Promptfoo, from basic text-to-speech quality testing to advanced conversational agent evaluation.
Prerequisites
- Node.js 20+ installed
- ElevenLabs API key (get one here)
- 15 minutes
Part 1: Text-to-Speech Quality Testing
Let's start by comparing different voice models and measuring their quality.
Step 1: Setup
Install Promptfoo and set your API key:
npm install -g promptfoo
export ELEVENLABS_API_KEY=your_api_key_here
Step 2: Create Your First Config
Create promptfooconfig.yaml:
description: 'Compare ElevenLabs TTS models for customer service greetings'
prompts:
- "Thank you for calling TechSupport Inc. My name is Alex, and I'll be assisting you today. How can I help?"
providers:
- label: Flash Model (Fastest)
id: elevenlabs:tts:rachel
config:
modelId: eleven_flash_v2_5
outputFormat: mp3_44100_128
- label: Turbo Model (Best Quality)
id: elevenlabs:tts:rachel
config:
modelId: eleven_turbo_v2_5
outputFormat: mp3_44100_128
tests:
- description: Both models complete within 3 seconds
assert:
- type: latency
threshold: 3000
- description: Cost is under $0.01 per greeting
assert:
- type: cost
threshold: 0.01
Step 3: Run Your First Eval
promptfoo eval
You'll see results comparing both models:
┌─────────────────────────┬──────────┬──────────┐
│ Prompt │ Flash │ Turbo │
├─────────────────────────┼──────────┼──────────┤
│ Thank you for calling...│ ✓ Pass │ ✓ Pass │
│ Latency: <3s │ 847ms │ 1,234ms │
│ Cost: <$0.01 │ $0.003 │ $0.004 │
└─────────────────────────┴──────────┴──────────┘
Step 4: View Results
Open the web UI to listen to the audio:
promptfoo view
Part 2: Voice Customization
Now let's optimize voice settings for different use cases.
Step 5: Add Voice Settings
Update your config:
description: 'Test voice settings for different scenarios'
prompts:
- 'Welcome to our automated system.' # Formal announcement
- 'Hey there! Thanks for reaching out.' # Casual greeting
- 'I understand your frustration. Let me help.' # Empathetic response
providers:
- label: Professional Voice
id: elevenlabs:tts:rachel
config:
modelId: eleven_flash_v2_5
voiceSettings:
stability: 0.8 # Consistent tone
similarity_boost: 0.85
speed: 0.95
- label: Friendly Voice
id: elevenlabs:tts:rachel
config:
modelId: eleven_flash_v2_5
voiceSettings:
stability: 0.4 # More variation
similarity_boost: 0.75
speed: 1.1 # Slightly faster
- label: Empathetic Voice
id: elevenlabs:tts:rachel
config:
modelId: eleven_flash_v2_5
voiceSettings:
stability: 0.5
similarity_boost: 0.7
style: 0.8 # More expressive
speed: 0.9 # Slower, calmer
tests:
- vars:
scenario: formal
provider: Professional Voice
assert:
- type: javascript
value: output.includes("Welcome") || output.includes("system")
- vars:
scenario: casual
provider: Friendly Voice
assert:
- type: latency
threshold: 2000
- vars:
scenario: empathy
provider: Empathetic Voice
assert:
- type: cost
threshold: 0.01
Run the eval:
promptfoo eval
promptfoo view # Compare the different voice styles
Part 3: Speech-to-Text Accuracy
Test transcription accuracy by creating a TTS → STT pipeline.
Step 6: Create Transcription Pipeline
Create transcription-test.yaml:
description: 'Test TTS → STT accuracy pipeline'
prompts:
- |
The quarterly sales meeting is scheduled for Thursday, March 15th at 2:30 PM.
Please bring your laptop, quarterly reports, and the Q4 projections spreadsheet.
Conference room B has been reserved for this meeting.
providers:
# Step 1: Generate audio
- label: tts-generator
id: elevenlabs:tts:rachel
config:
modelId: eleven_flash_v2_5
tests:
- description: Generate audio and verify quality
provider: tts-generator
assert:
- type: javascript
value: |
// Verify audio was generated
const result = JSON.parse(output);
return result.audio && result.audio.sizeBytes > 0;
Now add STT to verify accuracy. Create a second config stt-accuracy.yaml:
description: 'Test STT accuracy'
prompts:
- file://audio/generated-speech.mp3 # Audio from previous eval
providers:
- id: elevenlabs:stt
config:
modelId: eleven_speech_to_text_v1
calculateWER: true
tests:
- vars:
referenceText: 'The quarterly sales meeting is scheduled for Thursday, March 15th at 2:30 PM. Please bring your laptop, quarterly reports, and the Q4 projections spreadsheet. Conference room B has been reserved for this meeting.'
assert:
- type: javascript
value: |
const result = JSON.parse(output);
// Check Word Error Rate is under 5%
if (result.wer_result) {
console.log('WER:', result.wer_result.wer);
return result.wer_result.wer < 0.05;
}
return false;
Run the STT eval:
promptfoo eval -c stt-accuracy.yaml
Part 4: Conversational Agent Testing
Test a complete voice agent with evaluation criteria.
Step 7: Create Agent Config
Create agent-test.yaml:
description: 'Test customer support agent performance'
prompts:
- |
User: Hi, I'm having trouble with my account
User: I can't log in with my password
User: My email is [email protected]
User: I already tried resetting it twice
providers:
- id: elevenlabs:agents
config:
# Create an ephemeral agent for testing
agentConfig:
name: Support Agent
prompt: |
You are a helpful customer support agent for TechCorp.
Your job is to:
1. Greet customers warmly
2. Understand their issue
3. Collect necessary information (email, account number)
4. Provide clear next steps
5. Maintain a professional, empathetic tone
Never make promises you can't keep. Always set clear expectations.
voiceId: 21m00Tcm4TlvDq8ikWAM # Rachel
llmModel: gpt-4o-mini
# Define evaluation criteria
evaluationCriteria:
- name: greeting
description: Agent greets the user warmly
weight: 0.8
passingThreshold: 0.8
- name: information_gathering
description: Agent asks for email or account details
weight: 1.0
passingThreshold: 0.9
- name: empathy
description: Agent acknowledges user frustration
weight: 0.9
passingThreshold: 0.7
- name: next_steps
description: Agent provides clear next steps
weight: 1.0
passingThreshold: 0.9
- name: professionalism
description: Agent maintains professional tone
weight: 0.8
passingThreshold: 0.8
# Limit conversation for testing
maxTurns: 8
timeout: 60000
tests:
- description: Agent passes all critical evaluation criteria
assert:
- type: javascript
value: |
const result = JSON.parse(output);
const criteria = result.analysis.evaluation_criteria_results;
// Check that critical criteria passed
const critical = ['information_gathering', 'next_steps', 'professionalism'];
const criticalPassed = criteria
.filter(c => critical.includes(c.name))
.every(c => c.passed);
console.log('Criteria Results:');
criteria.forEach(c => {
console.log(` ${c.name}: ${c.passed ? '✓' : '✗'} (score: ${c.score.toFixed(2)})`);
});
return criticalPassed;
- description: Agent conversation stays within turn limit
assert:
- type: javascript
value: |
const result = JSON.parse(output);
return result.transcript.length <= 8;
- description: Agent responds within reasonable time
assert:
- type: latency
threshold: 60000
Run the agent eval:
promptfoo eval -c agent-test.yaml
Step 8: Review Agent Performance
View detailed results:
promptfoo view
In the web UI, you'll see:
- Full conversation transcript
- Evaluation criteria scores
- Pass/fail for each criterion
- Conversation duration and cost
- Audio playback for each turn
Part 5: Tool Mocking
Step 9: Add Tool Mocking
Create agent-with-tools.yaml:
description: "Test agent with order lookup tool"
prompts:
- |
User: What's the status of my order?
User: Order number ORDER-12345
providers:
- id: elevenlabs:agents
config:
agentConfig:
name: Support Agent with Tools
prompt: You are a support agent. Use the order_lookup tool to check order status.
voiceId: 21m00Tcm4TlvDq8ikWAM
llmModel: gpt-4o
# Define available tools
tools:
- type: function
function:
name: order_lookup
description: Look up order status by order number
parameters:
type: object
properties:
order_number:
type: string
description: The order number (format: ORDER-XXXXX)
required:
- order_number
# Mock tool responses for testing
toolMockConfig:
order_lookup:
response:
order_number: "ORDER-12345"
status: "Shipped"
tracking_number: "1Z999AA10123456784"
expected_delivery: "2024-03-20"
evaluationCriteria:
- name: uses_tool
description: Agent uses the order_lookup tool
weight: 1.0
passingThreshold: 0.9
- name: provides_tracking
description: Agent provides tracking information
weight: 1.0
passingThreshold: 0.9
tests:
- description: Agent successfully looks up order
assert:
- type: javascript
value: |
const result = JSON.parse(output);
// Verify tool was called
const toolCalls = result.transcript.filter(t =>
t.role === 'tool_call'
);
return toolCalls.length > 0;
- type: contains
value: "1Z999AA10123456784" # Tracking number from mock
Run with tool mocking:
promptfoo eval -c agent-with-tools.yaml
Next Steps
You've learned to:
- ✅ Compare TTS models and voices
- ✅ Customize voice settings for different scenarios
- ✅ Test STT accuracy with WER calculation
- ✅ Evaluate conversational agents with criteria
- ✅ Mock tools for agent testing
Explore More
- Audio processing: Use isolation for noise removal
- Regression testing: Track agent performance over time
- Production monitoring: Set up continuous testing
Example Projects
Check out complete examples:
Resources
Troubleshooting
Common Issues
Agent conversations timeout:
- Increase
maxTurnsandtimeoutin config - Simplify evaluation criteria
- Use faster LLM models
High costs during testing:
- Use
gpt-4o-miniinstead ofgpt-4o - Enable caching for repeated tests
- Implement LLM cascading
- Test with shorter prompts first
Evaluation criteria always failing:
- Start with simple, objective criteria
- Lower passing thresholds during development
- Review agent transcript to understand behavior
- Add more specific criteria descriptions
Audio quality issues:
- Try different
outputFormatsettings - Adjust voice settings (stability, similarity_boost)
- Test with different models
- Consider using Turbo over Flash for quality