Guides | Promptfoo

📄️ Testing LLM chains

Learn how to test complex LLM chains and RAG systems with unit tests and end-to-end validation to ensure reliable outputs and catch failures across multi-step prompts

📄️ Evaluating factuality

How to evaluate the factual accuracy of LLM outputs against reference information using promptfoo's factuality assertion

📄️ Evaluating RAG pipelines

Benchmark RAG pipeline performance by evaluating document retrieval accuracy and LLM output quality with factuality and context adherence metrics for 2-step analysis

📄️ HLE Benchmark

Run evaluations against Humanity's Last Exam using promptfoo - the most challenging AI benchmark with expert-crafted questions across 100+ subjects.

📄️ Evaluate Coding Agents

Compare AI coding agents for code generation, security analysis, and refactoring with promptfoo

📄️ OpenAI vs Azure benchmark

Compare OpenAI vs Azure OpenAI performance across speed, cost, and model updates with automated benchmarks to optimize your LLM infrastructure decisions

📄️ Building trust in AI with Portkey and Promptfoo

Supercharge Promptfoo evals with Portkey's AI gateway. Run tests across 1600+ models, manage prompts collaboratively, and gain detailed analytics for production-ready AI trust.

📄️ Red teaming a Chatbase Chatbot

Learn how to test and secure Chatbase RAG chatbots against multi-turn conversation attacks with automated red teaming techniques and security benchmarks

📄️ Choosing the best GPT model

Compare GPT-4o vs GPT-4.1-mini performance on your custom data with automated benchmarks to evaluate reasoning capabilities, costs, and response latency metrics

📄️ Claude 3.7 vs GPT-4.1

Learn how to benchmark Claude 3.7 against GPT-4.1 using your own data with promptfoo. Discover which model performs best for your specific use case.

📄️ Cohere Command-R benchmarks

Compare Cohere Command-R vs GPT-4 vs Claude performance with automated benchmarks to evaluate model accuracy on your specific use cases and datasets

📄️ Llama vs GPT benchmark

Compare Llama 3.1 405B vs GPT-4 performance on custom datasets using automated benchmarks and side-by-side evaluations to identify the best model for your use case

📄️ DBRX benchmarks

Compare DBRX vs Mixtral vs GPT-3.5 performance with custom benchmarks to evaluate real-world task accuracy and identify the optimal model for your use case

📄️ Deepseek benchmark

Compare Deepseek MoE (671B params) vs GPT-4.1 vs Llama-3 performance with custom benchmarks to evaluate code tasks and choose the optimal model for your needs

📄️ Evaluating LLM safety with HarmBench

Assess LLM vulnerabilities against 400+ adversarial attacks using HarmBench benchmarks to identify and prevent harmful outputs across 6 risk categories

📄️ Red teaming a CrewAI Agent

Evaluate CrewAI agent security and performance with automated red team testing. Compare agent responses across 100+ test cases to identify vulnerabilities.

📄️ Evaluating JSON outputs

Validate and test LLM JSON outputs with automated schema checks and field assertions to ensure reliable, well-formed data structures in your AI applications

📄️ Evaluate LangGraph

Hands-on tutorial (July 2025) on evaluating and red-teaming LangGraph agents with Promptfoo—includes setup, YAML tests, and security scans.

📄️ Choosing the right temperature for your LLM

Compare LLM temperature settings from 0.1-1.0 to optimize model creativity vs consistency with automated benchmarks and randomness metrics

📄️ Evaluating OpenAI Assistants

Compare OpenAI Assistant configurations and measure performance across different prompts, models, and tools to optimize your AI application's accuracy and reliability

📄️ Evaluating Replicate Lifeboat

Compare GPT-3.5 vs Llama2-70b performance on real-world prompts using Replicate Lifeboat API to benchmark model accuracy and response quality for your specific use case

📄️ Gemini vs GPT

Compare Gemini Pro vs GPT-4 performance on your custom datasets using automated benchmarks and side-by-side analysis to identify the best model for your use case

📄️ Gemma vs Llama

Compare Google Gemma vs Meta Llama performance on custom datasets using automated benchmarks and side-by-side evaluations to select the best model for your use case

📄️ Gemma vs Mistral/Mixtral

Compare Gemma vs Mistral vs Mixtral performance on your custom datasets with automated benchmarks to identify the best model for your specific use case

📄️ Testing Model Armor

Learn how to evaluate and tune Google Cloud Model Armor templates and floor settings for LLM safety using Promptfoo's red teaming and guardrail testing.

📄️ GPT 3.5 vs GPT 4

Compare GPT-3.5 vs GPT-4 performance on your custom datasets using automated benchmarks to evaluate costs, latency and reasoning capabilities for your use case

📄️ GPT-4o vs GPT-4.1-mini

Compare GPT-4o vs GPT-4.1-mini performance on your custom datasets with automated benchmarks to evaluate cost, latency and reasoning capabilities

📄️ GPT-4.1 vs GPT-4o MMLU

Compare GPT-4.1 and GPT-4o performance on MMLU academic reasoning tasks using promptfoo with step-by-step setup and research-backed optimization techniques.

📄️ gpt-5 vs o1

Benchmark OpenAI o1 reasoning models against GPT-4 for cost, latency, and accuracy to optimize model selection decisions

📄️ Using LangChain PromptTemplate with Promptfoo

Learn how to test LangChain PromptTemplate outputs systematically with Promptfoo's evaluation tools to validate prompt formatting and variable injection

📄️ Uncensored Llama2 benchmark

Compare Llama2 Uncensored vs GPT-3.5 responses on sensitive topics using automated benchmarks to evaluate model safety and content filtering capabilities

📄️ How to red team LLM applications

Protect your LLM applications from prompt injection, jailbreaks, and data leaks with automated red teaming tests that identify 20+ vulnerability types and security risks

📄️ Magistral AIME2024 Benchmark

Replicate Mistral Magistral AIME2024 math benchmark achieving 73.6% accuracy with detailed evaluation methodology and comparisons

📄️ Mistral vs Llama

Compare Mistral 7B, Mixtral 8x7B, and Llama 3.1 performance on custom benchmarks to optimize model selection for your specific LLM application needs

📄️ Mixtral vs GPT

Compare Mixtral vs GPT-4 performance on custom datasets using automated benchmarks and evaluation metrics to identify the optimal model for your use case

📄️ Multi-Modal Red Teaming

Red team multimodal AI systems using adversarial text, images, audio, and video inputs to identify cross-modal vulnerabilities

📄️ Phi vs Llama

Compare Phi 3 vs Llama 3.1 performance on your custom datasets using automated benchmarks and side-by-side evaluations to select the optimal model for your use case

📄️ Preventing hallucinations

Measure and reduce LLM hallucinations using perplexity metrics, RAG, and controlled decoding techniques to achieve 85%+ factual accuracy in AI outputs

📄️ Qwen vs Llama vs GPT

Compare Qwen-2-72B vs GPT-4o vs Llama-3-70B performance on customer support tasks with custom benchmarks to optimize your chatbot's response quality

📄️ Sandboxed Evaluations of LLM-Generated Code

Safely evaluate and benchmark LLM-generated code in isolated Docker containers to prevent security risks and catch errors before production deployment

📄️ Testing Guardrails

Learn how to test guardrails in your AI applications to prevent harmful content, detect PII, and block prompt injections

📄️ Evaluating LLM text-to-SQL performance

Compare text-to-SQL accuracy across GPT-3.5 and GPT-4 using automated test cases and schema validation to optimize database query generation performance