Evaluating ElevenLabs voice AI

This guide walks you through testing ElevenLabs voice AI capabilities using Promptfoo, from basic text-to-speech quality testing to advanced conversational agent evaluation.

Part 1: Text-to-Speech Quality Testing

Let's start by comparing different voice models and measuring their quality.

Step 1: Setup

Install Promptfoo and set your API key:

npm install -g promptfoo
export ELEVENLABS_API_KEY=your_api_key_here

Step 2: Create Your First Config

Create promptfooconfig.yaml:

description: 'Compare ElevenLabs TTS models for customer service greetings'

prompts:
  - "Thank you for calling TechSupport Inc. My name is Alex, and I'll be assisting you today. How can I help?"

providers:
  - label: Flash Model (Fastest)
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      outputFormat: mp3_44100_128

  - label: Turbo Model (Best Quality)
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_turbo_v2_5
      outputFormat: mp3_44100_128

tests:
  - description: Both models complete within 3 seconds
    assert:
      - type: latency
        threshold: 3000

  - description: Cost is under $0.01 per greeting
    assert:
      - type: cost
        threshold: 0.01

Step 3: Run Your First Eval

promptfoo eval

You'll see results comparing both models:

┌─────────────────────────┬──────────┬──────────┐
│ Prompt                  │ Flash    │ Turbo    │
├─────────────────────────┼──────────┼──────────┤
│ Thank you for calling...│ ✓ Pass   │ ✓ Pass   │
│ Latency: <3s            │ 847ms    │ 1,234ms  │
│ Cost: <$0.01            │ $0.003   │ $0.004   │
└─────────────────────────┴──────────┴──────────┘

Step 4: View Results

Open the web UI to listen to the audio:

promptfoo view

Part 2: Voice Customization

Now let's optimize voice settings for different use cases.

Step 5: Add Voice Settings

Update your config:

description: 'Test voice settings for different scenarios'

prompts:
  - 'Welcome to our automated system.' # Formal announcement
  - 'Hey there! Thanks for reaching out.' # Casual greeting
  - 'I understand your frustration. Let me help.' # Empathetic response

providers:
  - label: Professional Voice
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      voiceSettings:
        stability: 0.8 # Consistent tone
        similarity_boost: 0.85
        speed: 0.95

  - label: Friendly Voice
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      voiceSettings:
        stability: 0.4 # More variation
        similarity_boost: 0.75
        speed: 1.1 # Slightly faster

  - label: Empathetic Voice
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      voiceSettings:
        stability: 0.5
        similarity_boost: 0.7
        style: 0.8 # More expressive
        speed: 0.9 # Slower, calmer

tests:
  - vars:
      scenario: formal
    provider: Professional Voice
    assert:
      - type: javascript
        value: output.includes("Welcome") || output.includes("system")

  - vars:
      scenario: casual
    provider: Friendly Voice
    assert:
      - type: latency
        threshold: 2000

  - vars:
      scenario: empathy
    provider: Empathetic Voice
    assert:
      - type: cost
        threshold: 0.01

Run the eval:

promptfoo eval
promptfoo view  # Compare the different voice styles

Part 3: Speech-to-Text Accuracy

Test transcription accuracy by creating a TTS → STT pipeline.

Step 6: Create Transcription Pipeline

Create transcription-test.yaml:

description: 'Test TTS → STT accuracy pipeline'

prompts:
  - |
    The quarterly sales meeting is scheduled for Thursday, March 15th at 2:30 PM.
    Please bring your laptop, quarterly reports, and the Q4 projections spreadsheet.
    Conference room B has been reserved for this meeting.

providers:
  # Step 1: Generate audio
  - label: tts-generator
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5

tests:
  - description: Generate audio and verify quality
    provider: tts-generator
    assert:
      - type: javascript
        value: |
          // Verify audio was generated
          const result = JSON.parse(output);
          return result.audio && result.audio.sizeBytes > 0;

Now add STT to verify accuracy. Create a second config stt-accuracy.yaml:

description: 'Test STT accuracy'

prompts:
  - file://audio/generated-speech.mp3 # Audio from previous eval

providers:
  - id: elevenlabs:stt
    config:
      modelId: eleven_speech_to_text_v1
      calculateWER: true

tests:
  - vars:
      referenceText: 'The quarterly sales meeting is scheduled for Thursday, March 15th at 2:30 PM. Please bring your laptop, quarterly reports, and the Q4 projections spreadsheet. Conference room B has been reserved for this meeting.'
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          // Check Word Error Rate is under 5%
          if (result.wer_result) {
            console.log('WER:', result.wer_result.wer);
            return result.wer_result.wer < 0.05;
          }
          return false;

Run the STT eval:

promptfoo eval -c stt-accuracy.yaml

Part 4: Conversational Agent Testing

Test a complete voice agent with evaluation criteria.

Step 7: Create Agent Config

Create agent-test.yaml:

description: 'Test customer support agent performance'

prompts:
  - |
    User: Hi, I'm having trouble with my account
    User: I can't log in with my password
    User: My email is [email protected]
    User: I already tried resetting it twice

providers:
  - id: elevenlabs:agents
    config:
      # Create an ephemeral agent for testing
      agentConfig:
        name: Support Agent
        prompt: |
          You are a helpful customer support agent for TechCorp.
          Your job is to:
          1. Greet customers warmly
          2. Understand their issue
          3. Collect necessary information (email, account number)
          4. Provide clear next steps
          5. Maintain a professional, empathetic tone

          Never make promises you can't keep. Always set clear expectations.
        voiceId: 21m00Tcm4TlvDq8ikWAM # Rachel
        llmModel: gpt-5-mini

      # Define evaluation criteria
      evaluationCriteria:
        - name: greeting
          description: Agent greets the user warmly
          weight: 0.8
          passingThreshold: 0.8

        - name: information_gathering
          description: Agent asks for email or account details
          weight: 1.0
          passingThreshold: 0.9

        - name: empathy
          description: Agent acknowledges user frustration
          weight: 0.9
          passingThreshold: 0.7

        - name: next_steps
          description: Agent provides clear next steps
          weight: 1.0
          passingThreshold: 0.9

        - name: professionalism
          description: Agent maintains professional tone
          weight: 0.8
          passingThreshold: 0.8

      # Limit conversation for testing
      maxTurns: 8
      timeout: 60000

tests:
  - description: Agent passes all critical evaluation criteria
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          const criteria = result.analysis.evaluation_criteria_results;

          // Check that critical criteria passed
          const critical = ['information_gathering', 'next_steps', 'professionalism'];
          const criticalPassed = criteria
            .filter(c => critical.includes(c.name))
            .every(c => c.passed);

          console.log('Criteria Results:');
          criteria.forEach(c => {
            console.log(`  ${c.name}: ${c.passed ? '✓' : '✗'} (score: ${c.score.toFixed(2)})`);
          });

          return criticalPassed;

  - description: Agent conversation stays within turn limit
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          return result.transcript.length <= 8;

  - description: Agent responds within reasonable time
    assert:
      - type: latency
        threshold: 60000

Run the agent eval:

promptfoo eval -c agent-test.yaml

Step 8: Review Agent Performance

View detailed results:

promptfoo view

In the web UI, you'll see:

Full conversation transcript
Evaluation criteria scores
Pass/fail for each criterion
Conversation duration and cost
Audio playback for each turn

Part 5: Tool Mocking

Step 9: Add Tool Mocking

Create agent-with-tools.yaml:

description: "Test agent with order lookup tool"

prompts:
  - |
    User: What's the status of my order?
    User: Order number ORDER-12345

providers:
  - id: elevenlabs:agents
    config:
      agentConfig:
        name: Support Agent with Tools
        prompt: You are a support agent. Use the order_lookup tool to check order status.
        voiceId: 21m00Tcm4TlvDq8ikWAM
        llmModel: gpt-5

        # Define available tools
        tools:
          - type: function
            function:
              name: order_lookup
              description: Look up order status by order number
              parameters:
                type: object
                properties:
                  order_number:
                    type: string
                    description: The order number (format: ORDER-XXXXX)
                required:
                  - order_number

      # Mock tool responses for testing
      toolMockConfig:
        order_lookup:
          response:
            order_number: "ORDER-12345"
            status: "Shipped"
            tracking_number: "1Z999AA10123456784"
            expected_delivery: "2024-03-20"

      evaluationCriteria:
        - name: uses_tool
          description: Agent uses the order_lookup tool
          weight: 1.0
          passingThreshold: 0.9

        - name: provides_tracking
          description: Agent provides tracking information
          weight: 1.0
          passingThreshold: 0.9

tests:
  - description: Agent successfully looks up order
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          // Verify tool was called
          const toolCalls = result.transcript.filter(t =>
            t.role === 'tool_call'
          );
          return toolCalls.length > 0;

      - type: contains
        value: "1Z999AA10123456784"  # Tracking number from mock

Run with tool mocking:

promptfoo eval -c agent-with-tools.yaml

Next Steps

You've learned to:

✅ Compare TTS models and voices
✅ Customize voice settings for different scenarios
✅ Test STT accuracy with WER calculation
✅ Evaluate conversational agents with criteria
✅ Mock tools for agent testing

Explore More

Audio processing: Use isolation for noise removal
Regression testing: Track agent performance over time
Production monitoring: Set up continuous testing

Example Projects

Check out complete examples:

Resources

Troubleshooting

Common Issues

Agent conversations timeout:

Increase maxTurns and timeout in config
Simplify evaluation criteria
Use faster LLM models

High costs during testing:

Use gpt-5-mini instead of gpt-5
Enable caching for repeated tests
Implement LLM cascading
Test with shorter prompts first

Evaluation criteria always failing:

Start with simple, objective criteria
Lower passing thresholds during development
Review agent transcript to understand behavior
Add more specific criteria descriptions

Audio quality issues:

Try different outputFormat settings
Adjust voice settings (stability, similarity_boost)
Test with different models
Consider using Turbo over Flash for quality

Part 1: Text-to-Speech Quality Testing​

Step 1: Setup​

Step 2: Create Your First Config​

Step 3: Run Your First Eval​

Step 4: View Results​

Part 2: Voice Customization​

Step 5: Add Voice Settings​

Part 3: Speech-to-Text Accuracy​

Step 6: Create Transcription Pipeline​

Part 4: Conversational Agent Testing​

Step 7: Create Agent Config​

Step 8: Review Agent Performance​

Part 5: Tool Mocking​

Step 9: Add Tool Mocking​

Next Steps​

Explore More​

Example Projects​

Resources​

Troubleshooting​

Common Issues​

Getting Help​