Skip to main content

Nscale

The Nscale provider enables you to use Nscale's Serverless Inference API models with promptfoo. Nscale offers cost-effective AI inference with up to 80% savings compared to other providers, zero rate limits, and no cold starts.

Setup

Set your Nscale service token as an environment variable:

export NSCALE_SERVICE_TOKEN=your_service_token_here

Alternatively, you can add it to your .env file:

NSCALE_SERVICE_TOKEN=your_service_token_here

Obtaining Credentials

You can obtain service tokens by:

  1. Signing up at Nscale
  2. Navigating to your account settings
  3. Going to "Service Tokens" section

Configuration

To use Nscale models in your promptfoo configuration, use the nscale: prefix followed by the model name:

providers:
- nscale:openai/gpt-oss-120b
- nscale:meta/llama-3.3-70b-instruct
- nscale:qwen/qwen-3-235b-a22b-instruct

Model Types

Nscale supports different types of models through specific endpoint formats:

Chat Completion Models (Default)

For chat completion models, you can use either format:

providers:
- nscale:chat:openai/gpt-oss-120b
- nscale:openai/gpt-oss-120b # Defaults to chat

Completion Models

For text completion models:

providers:
- nscale:completion:openai/gpt-oss-20b

Embedding Models

For embedding models:

providers:
- nscale:embedding:qwen/qwen3-embedding-8b
- nscale:embeddings:qwen/qwen3-embedding-8b # Alternative format

Nscale offers a wide range of popular AI models:

Text Generation Models

ModelProvider FormatUse Case
GPT OSS 120Bnscale:openai/gpt-oss-120bGeneral-purpose reasoning and tasks
GPT OSS 20Bnscale:openai/gpt-oss-20bLightweight general-purpose model
Qwen 3 235B Instructnscale:qwen/qwen-3-235b-a22b-instructLarge-scale language understanding
Qwen 3 235B Instruct 2507nscale:qwen/qwen-3-235b-a22b-instruct-2507Latest Qwen 3 235B variant
Qwen 3 4B Thinking 2507nscale:qwen/qwen-3-4b-thinking-2507Reasoning and thinking tasks
Qwen 3 8Bnscale:qwen/qwen-3-8bMid-size general-purpose model
Qwen 3 14Bnscale:qwen/qwen-3-14bEnhanced reasoning capabilities
Qwen 3 32Bnscale:qwen/qwen-3-32bLarge-scale reasoning and analysis
Qwen 2.5 Coder 3B Instructnscale:qwen/qwen-2.5-coder-3b-instructLightweight code generation
Qwen 2.5 Coder 7B Instructnscale:qwen/qwen-2.5-coder-7b-instructCode generation and programming
Qwen 2.5 Coder 32B Instructnscale:qwen/qwen-2.5-coder-32b-instructAdvanced code generation
Qwen QwQ 32Bnscale:qwen/qwq-32bSpecialized reasoning model
Llama 3.3 70B Instructnscale:meta/llama-3.3-70b-instructHigh-quality instruction following
Llama 3.1 8B Instructnscale:meta/llama-3.1-8b-instructEfficient instruction following
Llama 4 Scout 17Bnscale:meta/llama-4-scout-17b-16e-instructImage-Text-to-Text capabilities
DeepSeek R1 Distill Llama 70Bnscale:deepseek/deepseek-r1-distill-llama-70bEfficient reasoning model
DeepSeek R1 Distill Llama 8Bnscale:deepseek/deepseek-r1-distill-llama-8bLightweight reasoning model
DeepSeek R1 Distill Qwen 1.5Bnscale:deepseek/deepseek-r1-distill-qwen-1.5bUltra-lightweight reasoning
DeepSeek R1 Distill Qwen 7Bnscale:deepseek/deepseek-r1-distill-qwen-7bCompact reasoning model
DeepSeek R1 Distill Qwen 14Bnscale:deepseek/deepseek-r1-distill-qwen-14bMid-size reasoning model
DeepSeek R1 Distill Qwen 32Bnscale:deepseek/deepseek-r1-distill-qwen-32bLarge reasoning model
Devstral Small 2505nscale:mistral/devstral-small-2505Code generation and development
Mixtral 8x22B Instructnscale:mistral/mixtral-8x22b-instruct-v0.1Large mixture-of-experts model

Embedding Models

ModelProvider FormatUse Case
Qwen 3 Embedding 8Bnscale:embedding:Qwen/Qwen3-Embedding-8BText embeddings and similarity

Text-to-Image Models

ModelProvider FormatUse Case
Flux.1 Schnellnscale:image:BlackForestLabs/FLUX.1-schnellFast image generation
Stable Diffusion XLnscale:image:stabilityai/stable-diffusion-xl-base-1.0High-quality image generation
SDXL Lightning 4-stepnscale:image:ByteDance/SDXL-Lightning-4stepUltra-fast image generation
SDXL Lightning 8-stepnscale:image:ByteDance/SDXL-Lightning-8stepBalanced speed and quality

Configuration Options

Nscale supports standard OpenAI-compatible parameters:

providers:
- id: nscale:openai/gpt-oss-120b
config:
temperature: 0.7
max_tokens: 1024
top_p: 0.9
frequency_penalty: 0.1
presence_penalty: 0.2
stop: ['END', 'STOP']
stream: true

Supported Parameters

  • temperature: Controls randomness (0.0 to 2.0)
  • max_tokens: Maximum number of tokens to generate
  • top_p: Nucleus sampling parameter
  • frequency_penalty: Reduces repetition based on frequency
  • presence_penalty: Reduces repetition based on presence
  • stop: Stop sequences to halt generation
  • stream: Enable streaming responses
  • seed: Deterministic sampling seed

Example Configuration

Here's a complete example configuration:

providers:
- id: nscale-gpt-oss
config:
temperature: 0.7
max_tokens: 512
- id: nscale-llama
config:
temperature: 0.5
max_tokens: 1024

prompts:
- 'Explain {{concept}} in simple terms'
- 'What are the key benefits of {{concept}}?'

tests:
- vars:
concept: quantum computing
assert:
- type: contains
value: 'quantum'
- type: llm-rubric
value: 'Explanation should be clear and accurate'

Pricing

Nscale offers highly competitive pricing:

  • Text Generation: Starting from $0.01 input / $0.03 output per 1M tokens
  • Embeddings: $0.04 per 1M tokens
  • Image Generation: Starting from $0.0008 per mega-pixel

For the most current pricing information, visit Nscale's pricing page.

Key Features

  • Cost-Effective: Up to 80% savings compared to other providers
  • Zero Rate Limits: No throttling or request limits
  • No Cold Starts: Instant response times
  • Serverless: No infrastructure management required
  • OpenAI Compatible: Standard API interface
  • Global Availability: Low-latency inference worldwide

Error Handling

The Nscale provider includes built-in error handling for common issues:

  • Network timeouts and retries
  • Rate limiting (though Nscale has zero rate limits)
  • Invalid API key errors
  • Model availability issues

Support

For support with the Nscale provider: