Skip to main content

Rate Limits

Promptfoo automatically handles rate limits from LLM providers. When a provider returns HTTP 429 or similar rate limit errors, requests are automatically retried with exponential backoff.

Automatic Handling

Rate limit handling is built into the evaluator and requires no configuration:

  • Automatic retry: Failed requests are retried up to 3 times with exponential backoff
  • Header-aware delays: Respects retry-after headers from providers
  • Adaptive concurrency: Reduces concurrent requests when rate limits are hit
  • Per-provider isolation: Each provider and API key has separate rate limit tracking

Supported Headers

Promptfoo parses rate limit headers from major providers:

ProviderHeaders
OpenAIx-ratelimit-remaining-requests, x-ratelimit-limit-requests, x-ratelimit-remaining-tokens, retry-after-ms
Anthropicanthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-remaining, retry-after
Azure OpenAIx-ratelimit-remaining-requests, retry-after-ms, retry-after
Genericretry-after, ratelimit-remaining, ratelimit-reset

Transient Error Handling

Promptfoo automatically retries requests that fail with transient server errors:

Status CodeDescriptionRetry Condition
502Bad GatewayStatus text contains "bad gateway"
503Service UnavailableStatus text contains "service unavailable"
504Gateway TimeoutStatus text contains "gateway timeout"
524A Timeout OccurredStatus text contains "timeout" (Cloudflare-specific)

These errors are retried up to 3 times with exponential backoff (1s, 2s, 4s). The status text check ensures that permanent failures (like authentication errors that happen to use 502) are not retried.

How Adaptive Concurrency Works

The scheduler uses AIMD (Additive Increase, Multiplicative Decrease) to optimize throughput:

  1. When a rate limit is hit, concurrency is reduced by 50%
  2. After sustained successful requests, concurrency increases by 1
  3. When remaining quota drops below 10% (from headers), concurrency is proactively reduced

This allows you to set a higher maxConcurrency and let promptfoo find the optimal rate automatically.

Configuration

Concurrency

Control the maximum number of concurrent requests:

evaluateOptions:
maxConcurrency: 10

Or via CLI:

promptfoo eval --max-concurrency 10

The adaptive scheduler will reduce this if rate limits are encountered, but cannot exceed your configured maximum.

Fixed Delay

Add a fixed delay between requests (in addition to any rate limit backoff):

evaluateOptions:
delay: 1000 # milliseconds

Or via CLI:

promptfoo eval --delay 1000

Or via environment variable:

PROMPTFOO_DELAY_MS=1000 promptfoo eval

Backoff Configuration

Promptfoo has two retry layers:

  1. Provider-level retry (scheduler): Retries callApi() with 1-second base backoff, up to 3 times
  2. HTTP-level retry: Retries failed HTTP requests with configurable backoff

Environment variables for the scheduler:

Environment VariableDescriptionDefault
PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULERDisable adaptive concurrency (use fixed)false
PROMPTFOO_MIN_CONCURRENCYMinimum concurrency (floor for adaptive)1
PROMPTFOO_SCHEDULER_QUEUE_TIMEOUT_MSTimeout for queued requests (0 to disable)300000ms

Environment variables for HTTP-level retry:

Environment VariableDescriptionDefault
PROMPTFOO_REQUEST_BACKOFF_MSBase delay for HTTP retry backoff5000ms
PROMPTFOO_RETRY_5XXRetry on HTTP 500 errorsfalse

Example:

PROMPTFOO_REQUEST_BACKOFF_MS=10000 PROMPTFOO_RETRY_5XX=true promptfoo eval

The scheduler's retry handles most rate limiting automatically. The HTTP-level retry provides additional resilience for network issues.

Provider-Specific Notes

OpenAI

OpenAI has separate rate limits for requests and tokens. The scheduler tracks both. For high-volume evaluations:

evaluateOptions:
maxConcurrency: 20 # Scheduler will adapt down if needed

See OpenAI troubleshooting for additional options.

Anthropic

Anthropic rate limits are typically per-minute. The scheduler respects retry-after headers from the API.

Custom Providers

Custom providers trigger automatic retry when errors contain:

  • "429"
  • "rate limit"
  • "too many requests"

To provide retry timing, include headers in your response metadata:

return {
output: 'response',
metadata: {
headers: {
'retry-after': '60', // seconds
},
},
};

Debugging

To see rate limit events, enable debug logging:

LOG_LEVEL=debug promptfoo eval -c config.yaml

Events logged:

  • ratelimit:hit - Rate limit encountered
  • ratelimit:learned - Provider limits discovered from headers
  • ratelimit:warning - Approaching rate limit threshold
  • concurrency:decreased / concurrency:increased - Adaptive concurrency changes
  • request:retrying - Retry in progress

Best Practices

  1. Start with higher concurrency - Set maxConcurrency to your desired throughput; the scheduler will adapt down if needed

  2. Use caching - Enable caching to avoid re-running identical requests

  3. Monitor debug logs - If evaluations are slow, check for frequent ratelimit:hit events

  4. Consider provider tiers - Higher API tiers typically have higher rate limits; the scheduler will automatically use whatever limits the provider allows

Disabling Automatic Handling

The scheduler is always active but has minimal overhead. For fully deterministic behavior (e.g., in tests), use:

evaluateOptions:
maxConcurrency: 1
delay: 1000

This ensures sequential execution with fixed delays between requests.