Rate Limits

Promptfoo automatically handles rate limits from LLM providers. When a provider returns HTTP 429 or similar rate limit errors, requests are automatically retried with exponential backoff.

Automatic Handling

Rate limit handling is built into the evaluator and requires no configuration:

Automatic retry: Failed requests are retried up to 3 times with exponential backoff
Header-aware delays: Respects retry-after headers from providers
Adaptive concurrency: Reduces concurrent requests when rate limits are hit
Per-provider isolation: Each provider and API key has separate rate limit tracking

Supported Headers

Promptfoo parses rate limit headers from major providers:

Provider	Headers
OpenAI	`x-ratelimit-remaining-requests`, `x-ratelimit-limit-requests`, `x-ratelimit-remaining-tokens`, `retry-after-ms`
Anthropic	`anthropic-ratelimit-requests-remaining`, `anthropic-ratelimit-tokens-remaining`, `retry-after`
Azure OpenAI	`x-ratelimit-remaining-requests`, `retry-after-ms`, `retry-after`
Generic	`retry-after`, `ratelimit-remaining`, `ratelimit-reset`

Transient Error Handling

Promptfoo automatically retries requests that fail with transient server errors:

Status Code	Description	Retry Condition
502	Bad Gateway	Status text contains "bad gateway"
503	Service Unavailable	Status text contains "service unavailable"
504	Gateway Timeout	Status text contains "gateway timeout"
524	A Timeout Occurred	Status text contains "timeout" (Cloudflare-specific)

These errors are retried up to 3 times with exponential backoff (1s, 2s, 4s). The status text check ensures that permanent failures (like authentication errors that happen to use 502) are not retried.

How Adaptive Concurrency Works

The scheduler uses AIMD (Additive Increase, Multiplicative Decrease) to optimize throughput:

When a rate limit is hit, concurrency is reduced by 50%
After sustained successful requests, concurrency increases by 1
When remaining quota drops below 10% (from headers), concurrency is proactively reduced

This allows you to set a higher maxConcurrency and let promptfoo find the optimal rate automatically.

Configuration

Concurrency

Control the maximum number of concurrent requests:

evaluateOptions:
  maxConcurrency: 10

Or via CLI:

promptfoo eval --max-concurrency 10

The adaptive scheduler will reduce this if rate limits are encountered, but cannot exceed your configured maximum.

Fixed Delay

Add a fixed delay between requests (in addition to any rate limit backoff):

evaluateOptions:
  delay: 1000 # milliseconds

Or via CLI:

promptfoo eval --delay 1000

Or via environment variable:

PROMPTFOO_DELAY_MS=1000 promptfoo eval

Backoff Configuration

Promptfoo has two retry layers:

Provider-level retry (scheduler): Retries callApi() with 1-second base backoff, up to 3 times
HTTP-level retry: Retries failed HTTP requests with configurable backoff

Environment variables for the scheduler:

Environment Variable	Description	Default
`PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULER`	Disable adaptive concurrency (use fixed)	false
`PROMPTFOO_MIN_CONCURRENCY`	Minimum concurrency (floor for adaptive)	1
`PROMPTFOO_SCHEDULER_QUEUE_TIMEOUT_MS`	Timeout for queued requests (0 to disable)	300000ms

Environment variables for HTTP-level retry:

Environment Variable	Description	Default
`PROMPTFOO_REQUEST_BACKOFF_MS`	Base delay for HTTP retry backoff	5000ms
`PROMPTFOO_RETRY_5XX`	Retry on HTTP 500 errors	false

Example:

PROMPTFOO_REQUEST_BACKOFF_MS=10000 PROMPTFOO_RETRY_5XX=true promptfoo eval

The scheduler's retry handles most rate limiting automatically. The HTTP-level retry provides additional resilience for network issues.

Provider-Specific Notes

OpenAI

OpenAI has separate rate limits for requests and tokens. The scheduler tracks both. For high-volume evaluations:

evaluateOptions:
  maxConcurrency: 20 # Scheduler will adapt down if needed

See OpenAI troubleshooting for additional options.

Anthropic

Anthropic rate limits are typically per-minute. The scheduler respects retry-after headers from the API.

Custom Providers

Custom providers trigger automatic retry when errors contain:

"429"
"rate limit"
"too many requests"

To provide retry timing, include headers in your response metadata:

return {
  output: 'response',
  metadata: {
    headers: {
      'retry-after': '60', // seconds
    },
  },
};

Debugging

To see rate limit events, enable debug logging:

LOG_LEVEL=debug promptfoo eval -c config.yaml

Events logged:

ratelimit:hit - Rate limit encountered
ratelimit:learned - Provider limits discovered from headers
ratelimit:warning - Approaching rate limit threshold
concurrency:decreased / concurrency:increased - Adaptive concurrency changes
request:retrying - Retry in progress

Best Practices

Start with higher concurrency - Set maxConcurrency to your desired throughput; the scheduler will adapt down if needed
Use caching - Enable caching to avoid re-running identical requests
Monitor debug logs - If evaluations are slow, check for frequent ratelimit:hit events
Consider provider tiers - Higher API tiers typically have higher rate limits; the scheduler will automatically use whatever limits the provider allows

Disabling Automatic Handling

The scheduler is always active but has minimal overhead. For fully deterministic behavior (e.g., in tests), use:

evaluateOptions:
  maxConcurrency: 1
  delay: 1000

This ensures sequential execution with fixed delays between requests.

Automatic Handling​

Supported Headers​

Transient Error Handling​

How Adaptive Concurrency Works​

Configuration​

Concurrency​

Fixed Delay​

Backoff Configuration​

Provider-Specific Notes​

OpenAI​

Anthropic​

Custom Providers​

Debugging​

Best Practices​

Disabling Automatic Handling​