Skip to main content

Meta Llama API

The Llama API provider enables you to use Meta's hosted Llama models through their official API service. This includes access to the latest Llama 4 multimodal models and Llama 3.3 text models, as well as accelerated variants from partners like Cerebras and Groq.

Setup

First, you'll need to get an API key from Meta:

  1. Visit llama.developer.meta.com
  2. Sign up for an account and join the waitlist
  3. Create an API key in the dashboard
  4. Set the API key as an environment variable:
export LLAMA_API_KEY="your_api_key_here"

Configuration

Use the llamaapi: prefix to specify Llama API models:

providers:
- llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
- llamaapi:Llama-3.3-70B-Instruct
- llamaapi:chat:Llama-3.3-8B-Instruct # Explicit chat format

Provider Options

providers:
- id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
config:
temperature: 0.7 # Controls randomness (0.0-2.0)
max_tokens: 1000 # Maximum response length
top_p: 0.9 # Nucleus sampling parameter
frequency_penalty: 0 # Reduce repetition (-2.0 to 2.0)
presence_penalty: 0 # Encourage topic diversity (-2.0 to 2.0)
stream: false # Enable streaming responses

Available Models

Meta-Hosted Models

Llama 4 (Multimodal)

  • Llama-4-Maverick-17B-128E-Instruct-FP8: Industry-leading multimodal model with image and text understanding
  • Llama-4-Scout-17B-16E-Instruct-FP8: Class-leading multimodal model with superior visual intelligence

Both Llama 4 models support:

  • Input: Text and images
  • Output: Text
  • Context Window: 128k tokens
  • Rate Limits: 3,000 RPM, 1M TPM

Llama 3.3 (Text-Only)

  • Llama-3.3-70B-Instruct: Enhanced performance text model
  • Llama-3.3-8B-Instruct: Lightweight, ultra-fast variant

Both Llama 3.3 models support:

  • Input: Text only
  • Output: Text
  • Context Window: 128k tokens
  • Rate Limits: 3,000 RPM, 1M TPM

Accelerated Variants (Preview)

For applications requiring ultra-low latency:

  • Cerebras-Llama-4-Maverick-17B-128E-Instruct (32k context, 900 RPM, 300k TPM)
  • Cerebras-Llama-4-Scout-17B-16E-Instruct (32k context, 600 RPM, 200k TPM)
  • Groq-Llama-4-Maverick-17B-128E-Instruct (128k context, 1000 RPM, 600k TPM)

Note: Accelerated variants are text-only and don't support image inputs.

Features

Text Generation

Basic text generation works with all models:

providers:
- llamaapi:Llama-3.3-70B-Instruct

prompts:
- 'Explain quantum computing in simple terms'

tests:
- vars: {}
assert:
- type: contains
value: 'quantum'

Multimodal (Image + Text)

Llama 4 models can process images alongside text:

providers:
- llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8

prompts:
- role: user
content:
- type: text
text: 'What do you see in this image?'
- type: image_url
image_url:
url: 'https://example.com/image.jpg'

tests:
- vars: {}
assert:
- type: llm-rubric
value: 'Accurately describes the image content'

Image Requirements

  • Supported formats: JPEG, PNG, GIF, ICO
  • Maximum file size: 25MB per image
  • Maximum images per request: 9
  • Input methods: URL or base64 encoding

JSON Structured Output

Generate responses following a specific JSON schema:

providers:
- id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
config:
temperature: 0.1
response_format:
type: json_schema
json_schema:
name: product_review
schema:
type: object
properties:
rating:
type: number
minimum: 1
maximum: 5
summary:
type: string
pros:
type: array
items:
type: string
cons:
type: array
items:
type: string
required: ['rating', 'summary']

prompts:
- 'Review this product: {{product_description}}'

tests:
- vars:
product_description: 'Wireless headphones with great sound quality but short battery life'
assert:
- type: is-json
- type: javascript
value: 'JSON.parse(output).rating >= 1 && JSON.parse(output).rating <= 5'

Tool Calling

Enable models to call external functions:

providers:
- id: llamaapi:Llama-3.3-70B-Instruct
config:
tools:
- type: function
function:
name: get_weather
description: Get current weather for a location
parameters:
type: object
properties:
location:
type: string
description: City and state, e.g. San Francisco, CA
unit:
type: string
enum: ['celsius', 'fahrenheit']
required: ['location']

prompts:
- "What's the weather like in {{city}}?"

tests:
- vars:
city: 'New York, NY'
assert:
- type: function-call
value: get_weather
- type: javascript
value: "output.arguments.location.includes('New York')"

Streaming

Enable real-time response streaming:

providers:
- id: llamaapi:Llama-3.3-8B-Instruct
config:
stream: true
temperature: 0.7

prompts:
- 'Write a short story about {{topic}}'

tests:
- vars:
topic: 'time travel'
assert:
- type: contains
value: 'time'

Rate Limits and Quotas

All rate limits are applied per team (across all API keys):

Model TypeRequests/minTokens/min
Standard Models3,0001,000,000
Cerebras Models600-900200,000-300,000
Groq Models1,000600,000

Rate limit information is available in response headers:

  • x-ratelimit-limit-tokens: Total token limit
  • x-ratelimit-remaining-tokens: Remaining tokens
  • x-ratelimit-limit-requests: Total request limit
  • x-ratelimit-remaining-requests: Remaining requests

Model Selection Guide

Choose Llama 4 Models When:

  • You need multimodal capabilities (text + images)
  • You want the most advanced reasoning and intelligence
  • Quality is more important than speed
  • You're building complex AI applications

Choose Llama 3.3 Models When:

  • You only need text processing
  • You want a balance of quality and speed
  • Cost efficiency is important
  • You're building chatbots or content generation tools

Choose Accelerated Variants When:

  • Ultra-low latency is critical
  • You're building real-time applications
  • Text-only processing is sufficient
  • You can work within reduced context windows (Cerebras models)

Best Practices

Multimodal Usage

  1. Optimize image sizes: Larger images consume more tokens
  2. Use appropriate formats: JPEG for photos, PNG for graphics
  3. Batch multiple images: Up to 9 images per request when possible

Token Management

  1. Monitor context windows: 32k-128k depending on model
  2. Use max_tokens appropriately: Control response length
  3. Estimate image tokens: ~145 tokens per 336x336 pixel tile

Error Handling

  1. Implement retry logic: For rate limits and transient failures
  2. Validate inputs: Check image formats and sizes
  3. Monitor rate limits: Use response headers to avoid limits

Performance Optimization

  1. Choose the right model: Balance quality vs. speed vs. cost
  2. Use streaming: For better user experience with long responses
  3. Cache responses: When appropriate for your use case

Troubleshooting

Authentication Issues

Error: 401 Unauthorized
  • Verify your LLAMA_API_KEY environment variable is set
  • Check that your API key is valid at llama.developer.meta.com
  • Ensure you have access to the Llama API (currently in preview)

Rate Limiting

Error: 429 Too Many Requests
  • Check your current rate limit usage
  • Implement exponential backoff retry logic
  • Consider distributing load across different time periods

Model Errors

Error: Model not found
  • Verify the model name spelling
  • Check model availability in your region
  • Ensure you're using supported model IDs

Image Processing Issues

Error: Invalid image format
  • Check image format (JPEG, PNG, GIF, ICO only)
  • Verify image size is under 25MB
  • Ensure image URL is accessible publicly

Data Privacy

Meta Llama API has strong data commitments:

  • No training on your data: Your inputs and outputs are not used for model training
  • Encryption: Data encrypted at rest and in transit
  • No ads: Data not used for advertising
  • Storage separation: Strict access controls and isolated storage
  • Compliance: Regular vulnerability management and compliance audits

Comparison with Other Providers

FeatureLlama APIOpenAIAnthropic
Multimodal✅ (Llama 4)
Tool Calling
JSON Schema
Streaming
Context Window32k-128k128k200k
Data Training

Examples

Check out the examples directory for:

  • Basic chat: Simple text generation
  • Multimodal: Image understanding tasks
  • Structured output: JSON schema validation
  • Tool calling: Function calling examples
  • Model comparison: Performance benchmarking
  • OpenAI - Similar API structure and capabilities
  • Anthropic - Alternative AI provider
  • Together AI - Hosts various open-source models including Llama
  • OpenRouter - Provides access to multiple AI models including Llama

For questions and support, visit the Llama API documentation or join the promptfoo Discord community.