Meta Llama API

The Llama API provider enables you to use Meta's hosted Llama models through their official API service. This includes access to the latest Llama 4 multimodal models and Llama 3.3 text models, as well as accelerated variants from partners like Cerebras and Groq.

Setup

First, you'll need to get an API key from Meta:

Visit llama.developer.meta.com
Sign up for an account and join the waitlist
Create an API key in the dashboard
Set the API key as an environment variable:

export LLAMA_API_KEY="your_api_key_here"

Configuration

Use the llamaapi: prefix to specify Llama API models:

providers:
  - llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
  - llamaapi:Llama-3.3-70B-Instruct
  - llamaapi:chat:Llama-3.3-8B-Instruct # Explicit chat format

Provider Options

providers:
  - id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
    config:
      temperature: 0.7 # Controls randomness (0.0-2.0)
      max_tokens: 1000 # Maximum response length
      top_p: 0.9 # Nucleus sampling parameter
      frequency_penalty: 0 # Reduce repetition (-2.0 to 2.0)
      presence_penalty: 0 # Encourage topic diversity (-2.0 to 2.0)
      stream: false # Enable streaming responses

Available Models

Meta-Hosted Models

Llama 4 (Multimodal)

Llama-4-Maverick-17B-128E-Instruct-FP8: Industry-leading multimodal model with image and text understanding
Llama-4-Scout-17B-16E-Instruct-FP8: Class-leading multimodal model with superior visual intelligence

Both Llama 4 models support:

Input: Text and images
Output: Text
Context Window: 128k tokens
Rate Limits: 3,000 RPM, 1M TPM

Llama 3.3 (Text-Only)

Llama-3.3-70B-Instruct: Enhanced performance text model
Llama-3.3-8B-Instruct: Lightweight, ultra-fast variant

Both Llama 3.3 models support:

Input: Text only
Output: Text
Context Window: 128k tokens
Rate Limits: 3,000 RPM, 1M TPM

Accelerated Variants (Preview)

For applications requiring ultra-low latency:

Cerebras-Llama-4-Maverick-17B-128E-Instruct (32k context, 900 RPM, 300k TPM)
Cerebras-Llama-4-Scout-17B-16E-Instruct (32k context, 600 RPM, 200k TPM)
Groq-Llama-4-Maverick-17B-128E-Instruct (128k context, 1000 RPM, 600k TPM)

Note: Accelerated variants are text-only and don't support image inputs.

Features

Text Generation

Basic text generation works with all models:

providers:
  - llamaapi:Llama-3.3-70B-Instruct

prompts:
  - 'Explain quantum computing in simple terms'

tests:
  - vars: {}
    assert:
      - type: contains
        value: 'quantum'

Multimodal (Image + Text)

Llama 4 models can process images alongside text:

providers:
  - llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8

prompts:
  - role: user
    content:
      - type: text
        text: 'What do you see in this image?'
      - type: image_url
        image_url:
          url: 'https://example.com/image.jpg'

tests:
  - vars: {}
    assert:
      - type: llm-rubric
        value: 'Accurately describes the image content'

Image Requirements

Supported formats: JPEG, PNG, GIF, ICO
Maximum file size: 25MB per image
Maximum images per request: 9
Input methods: URL or base64 encoding

JSON Structured Output

Generate responses following a specific JSON schema:

providers:
  - id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
    config:
      temperature: 0.1
      response_format:
        type: json_schema
        json_schema:
          name: product_review
          schema:
            type: object
            properties:
              rating:
                type: number
                minimum: 1
                maximum: 5
              summary:
                type: string
              pros:
                type: array
                items:
                  type: string
              cons:
                type: array
                items:
                  type: string
            required: ['rating', 'summary']

prompts:
  - 'Review this product: {{product_description}}'

tests:
  - vars:
      product_description: 'Wireless headphones with great sound quality but short battery life'
    assert:
      - type: is-json
      - type: javascript
        value: 'JSON.parse(output).rating >= 1 && JSON.parse(output).rating <= 5'

Tool Calling

Enable models to call external functions:

providers:
  - id: llamaapi:Llama-3.3-70B-Instruct
    config:
      tools:
        - type: function
          function:
            name: get_weather
            description: Get current weather for a location
            parameters:
              type: object
              properties:
                location:
                  type: string
                  description: City and state, e.g. San Francisco, CA
                unit:
                  type: string
                  enum: ['celsius', 'fahrenheit']
              required: ['location']

prompts:
  - "What's the weather like in {{city}}?"

tests:
  - vars:
      city: 'New York, NY'
    assert:
      - type: function-call
        value: get_weather
      - type: javascript
        value: "output.arguments.location.includes('New York')"

Streaming

Enable real-time response streaming:

providers:
  - id: llamaapi:Llama-3.3-8B-Instruct
    config:
      stream: true
      temperature: 0.7

prompts:
  - 'Write a short story about {{topic}}'

tests:
  - vars:
      topic: 'time travel'
    assert:
      - type: contains
        value: 'time'

Rate Limits and Quotas

All rate limits are applied per team (across all API keys):

Model Type	Requests/min	Tokens/min
Standard Models	3,000	1,000,000
Cerebras Models	600-900	200,000-300,000
Groq Models	1,000	600,000

Rate limit information is available in response headers:

x-ratelimit-limit-tokens: Total token limit
x-ratelimit-remaining-tokens: Remaining tokens
x-ratelimit-limit-requests: Total request limit
x-ratelimit-remaining-requests: Remaining requests

Model Selection Guide

Choose Llama 4 Models When:

You need multimodal capabilities (text + images)
You want the most advanced reasoning and intelligence
Quality is more important than speed
You're building complex AI applications

Choose Llama 3.3 Models When:

You only need text processing
You want a balance of quality and speed
Cost efficiency is important
You're building chatbots or content generation tools

Choose Accelerated Variants When:

Ultra-low latency is critical
You're building real-time applications
Text-only processing is sufficient
You can work within reduced context windows (Cerebras models)

Best Practices

Multimodal Usage

Optimize image sizes: Larger images consume more tokens
Use appropriate formats: JPEG for photos, PNG for graphics
Batch multiple images: Up to 9 images per request when possible

Token Management

Monitor context windows: 32k-128k depending on model
Use max_tokens appropriately: Control response length
Estimate image tokens: ~145 tokens per 336x336 pixel tile

Error Handling

Implement retry logic: For rate limits and transient failures
Validate inputs: Check image formats and sizes
Monitor rate limits: Use response headers to avoid limits

Performance Optimization

Choose the right model: Balance quality vs. speed vs. cost
Use streaming: For better user experience with long responses
Cache responses: When appropriate for your use case

Troubleshooting

Authentication Issues

Error: 401 Unauthorized

Verify your LLAMA_API_KEY environment variable is set
Check that your API key is valid at llama.developer.meta.com
Ensure you have access to the Llama API (currently in preview)

Rate Limiting

Error: 429 Too Many Requests

Check your current rate limit usage
Implement exponential backoff retry logic
Consider distributing load across different time periods

Model Errors

Error: Model not found

Verify the model name spelling
Check model availability in your region
Ensure you're using supported model IDs

Image Processing Issues

Error: Invalid image format

Check image format (JPEG, PNG, GIF, ICO only)
Verify image size is under 25MB
Ensure image URL is accessible publicly

Data Privacy

Meta Llama API has strong data commitments:

✅ No training on your data: Your inputs and outputs are not used for model training
✅ Encryption: Data encrypted at rest and in transit
✅ No ads: Data not used for advertising
✅ Storage separation: Strict access controls and isolated storage
✅ Compliance: Regular vulnerability management and compliance audits

Comparison with Other Providers

Feature	Llama API	OpenAI	Anthropic
Multimodal	✅ (Llama 4)	✅	✅
Tool Calling	✅	✅	✅
JSON Schema	✅	✅	❌
Streaming	✅	✅	✅
Context Window	32k-128k	128k	200k
Data Training	❌	✅	❌

Examples

Check out the examples directory for:

Basic chat: Simple text generation
Multimodal: Image understanding tasks
Structured output: JSON schema validation
Tool calling: Function calling examples
Model comparison: Performance benchmarking

OpenAI - Similar API structure and capabilities
Anthropic - Alternative AI provider
Together AI - Hosts various open-source models including Llama
OpenRouter - Provides access to multiple AI models including Llama

For questions and support, visit the Llama API documentation or join the promptfoo Discord community.

Setup​

Configuration​

Provider Options​

Available Models​

Meta-Hosted Models​

Llama 4 (Multimodal)​

Llama 3.3 (Text-Only)​

Accelerated Variants (Preview)​

Features​

Text Generation​

Multimodal (Image + Text)​

Image Requirements​

JSON Structured Output​

Tool Calling​

Streaming​

Rate Limits and Quotas​

Model Selection Guide​

Choose Llama 4 Models When:​

Choose Llama 3.3 Models When:​

Choose Accelerated Variants When:​

Best Practices​

Multimodal Usage​

Token Management​

Error Handling​

Performance Optimization​

Troubleshooting​

Authentication Issues​

Rate Limiting​

Model Errors​

Image Processing Issues​

Data Privacy​

Comparison with Other Providers​

Examples​

Related Providers​