Rate Limits
Promptfoo automatically handles rate limits from LLM providers. When a provider returns HTTP 429 or similar rate limit errors, requests are automatically retried with exponential backoff.
Automatic Handling
Rate limit handling is built into the evaluator and requires no configuration:
- Automatic retry: Failed requests are retried up to 3 times with exponential backoff
- Header-aware delays: Respects
retry-afterheaders from providers - Adaptive concurrency: Reduces concurrent requests when rate limits are hit
- Per-provider isolation: Each provider and API key has separate rate limit tracking
Supported Headers
Promptfoo parses rate limit headers from major providers:
| Provider | Headers |
|---|---|
| OpenAI | x-ratelimit-remaining-requests, x-ratelimit-limit-requests, x-ratelimit-remaining-tokens, retry-after-ms |
| Anthropic | anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-remaining, retry-after |
| Azure OpenAI | x-ratelimit-remaining-requests, retry-after-ms, retry-after |
| Generic | retry-after, ratelimit-remaining, ratelimit-reset |
Transient Error Handling
Promptfoo automatically retries requests that fail with transient server errors:
| Status Code | Description | Retry Condition |
|---|---|---|
| 502 | Bad Gateway | Status text contains "bad gateway" |
| 503 | Service Unavailable | Status text contains "service unavailable" |
| 504 | Gateway Timeout | Status text contains "gateway timeout" |
| 524 | A Timeout Occurred | Status text contains "timeout" (Cloudflare-specific) |
These errors are retried up to 3 times with exponential backoff (1s, 2s, 4s). The status text check ensures that permanent failures (like authentication errors that happen to use 502) are not retried.
How Adaptive Concurrency Works
The scheduler uses AIMD (Additive Increase, Multiplicative Decrease) to optimize throughput:
- When a rate limit is hit, concurrency is reduced by 50%
- After sustained successful requests, concurrency increases by 1
- When remaining quota drops below 10% (from headers), concurrency is proactively reduced
This allows you to set a higher maxConcurrency and let promptfoo find the optimal rate automatically.
Configuration
Concurrency
Control the maximum number of concurrent requests:
evaluateOptions:
maxConcurrency: 10
Or via CLI:
promptfoo eval --max-concurrency 10
The adaptive scheduler will reduce this if rate limits are encountered, but cannot exceed your configured maximum.
Fixed Delay
Add a fixed delay between requests (in addition to any rate limit backoff):
evaluateOptions:
delay: 1000 # milliseconds
Or via CLI:
promptfoo eval --delay 1000
Or via environment variable:
PROMPTFOO_DELAY_MS=1000 promptfoo eval
Backoff Configuration
Promptfoo has two retry layers:
- Provider-level retry (scheduler): Retries
callApi()with 1-second base backoff, up to 3 times - HTTP-level retry: Retries failed HTTP requests with configurable backoff
Environment variables for the scheduler:
| Environment Variable | Description | Default |
|---|---|---|
PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULER | Disable adaptive concurrency (use fixed) | false |
PROMPTFOO_MIN_CONCURRENCY | Minimum concurrency (floor for adaptive) | 1 |
PROMPTFOO_SCHEDULER_QUEUE_TIMEOUT_MS | Timeout for queued requests (0 to disable) | 300000ms |
Environment variables for HTTP-level retry:
| Environment Variable | Description | Default |
|---|---|---|
PROMPTFOO_REQUEST_BACKOFF_MS | Base delay for HTTP retry backoff | 5000ms |
PROMPTFOO_RETRY_5XX | Retry on HTTP 500 errors | false |
Example:
PROMPTFOO_REQUEST_BACKOFF_MS=10000 PROMPTFOO_RETRY_5XX=true promptfoo eval
The scheduler's retry handles most rate limiting automatically. The HTTP-level retry provides additional resilience for network issues.
Provider-Specific Notes
OpenAI
OpenAI has separate rate limits for requests and tokens. The scheduler tracks both. For high-volume evaluations:
evaluateOptions:
maxConcurrency: 20 # Scheduler will adapt down if needed
See OpenAI troubleshooting for additional options.
Anthropic
Anthropic rate limits are typically per-minute. The scheduler respects retry-after headers from the API.
Custom Providers
Custom providers trigger automatic retry when errors contain:
- "429"
- "rate limit"
- "too many requests"
To provide retry timing, include headers in your response metadata:
return {
output: 'response',
metadata: {
headers: {
'retry-after': '60', // seconds
},
},
};
Debugging
To see rate limit events, enable debug logging:
LOG_LEVEL=debug promptfoo eval -c config.yaml
Events logged:
ratelimit:hit- Rate limit encounteredratelimit:learned- Provider limits discovered from headersratelimit:warning- Approaching rate limit thresholdconcurrency:decreased/concurrency:increased- Adaptive concurrency changesrequest:retrying- Retry in progress
Best Practices
-
Start with higher concurrency - Set
maxConcurrencyto your desired throughput; the scheduler will adapt down if needed -
Use caching - Enable caching to avoid re-running identical requests
-
Monitor debug logs - If evaluations are slow, check for frequent
ratelimit:hitevents -
Consider provider tiers - Higher API tiers typically have higher rate limits; the scheduler will automatically use whatever limits the provider allows
Disabling Automatic Handling
The scheduler is always active but has minimal overhead. For fully deterministic behavior (e.g., in tests), use:
evaluateOptions:
maxConcurrency: 1
delay: 1000
This ensures sequential execution with fixed delays between requests.