ποΈ Testing LLM chains
Learn how to test complex LLM chains and RAG systems with unit tests and end-to-end validation to ensure reliable outputs and catch failures across multi-step prompts
ποΈ Evaluating factuality
How to evaluate the factual accuracy of LLM outputs against reference information using promptfoo's factuality assertion
ποΈ Evaluating RAG pipelines
Benchmark RAG pipeline performance by evaluating document retrieval accuracy and LLM output quality with factuality and context adherence metrics for 2-step analysis
ποΈ HLE Benchmark
Run evaluations against Humanity's Last Exam using promptfoo - the most challenging AI benchmark with expert-crafted questions across 100+ subjects.
ποΈ OpenAI vs Azure benchmark
Compare OpenAI vs Azure OpenAI performance across speed, cost, and model updates with automated benchmarks to optimize your LLM infrastructure decisions
ποΈ Red teaming a Chatbase Chatbot
Learn how to test and secure Chatbase RAG chatbots against multi-turn conversation attacks with automated red teaming techniques and security benchmarks
ποΈ Choosing the best GPT model
Compare GPT-4o vs GPT-4o-mini performance on your custom data with automated benchmarks to evaluate reasoning capabilities, costs, and response latency metrics
ποΈ Claude 3.7 vs GPT-4.1
Learn how to benchmark Claude 3.7 against GPT-4.1 using your own data with promptfoo. Discover which model performs best for your specific use case.
ποΈ Cohere Command-R benchmarks
Compare Cohere Command-R vs GPT-4 vs Claude performance with automated benchmarks to evaluate model accuracy on your specific use cases and datasets
ποΈ Llama vs GPT benchmark
Compare Llama 3.1 405B vs GPT-4 performance on custom datasets using automated benchmarks and side-by-side evaluations to identify the best model for your use case
ποΈ DBRX benchmarks
Compare DBRX vs Mixtral vs GPT-3.5 performance with custom benchmarks to evaluate real-world task accuracy and identify the optimal model for your use case
ποΈ Deepseek benchmark
Compare Deepseek MoE (671B params) vs GPT-4.1 vs Llama-3 performance with custom benchmarks to evaluate code tasks and choose the optimal model for your needs
ποΈ Evaluating LLM safety with HarmBench
Assess LLM vulnerabilities against 400+ adversarial attacks using HarmBench benchmarks to identify and prevent harmful outputs across 6 risk categories
ποΈ Red teaming a CrewAI Agent
Evaluate CrewAI agent security and performance with automated red team testing. Compare agent responses across 100+ test cases to identify vulnerabilities.
ποΈ Evaluating JSON outputs
Validate and test LLM JSON outputs with automated schema checks and field assertions to ensure reliable, well-formed data structures in your AI applications
ποΈ Evaluate LangGraph
Hands-on tutorial (July 2025) on evaluating and red-teaming LangGraph agents with Promptfooβincludes setup, YAML tests, and security scans.
ποΈ Choosing the right temperature for your LLM
Compare LLM temperature settings from 0.1-1.0 to optimize model creativity vs consistency with automated benchmarks and randomness metrics
ποΈ Evaluating OpenAI Assistants
Compare OpenAI Assistant configurations and measure performance across different prompts, models, and tools to optimize your AI application's accuracy and reliability
ποΈ Evaluating Replicate Lifeboat
Compare GPT-3.5 vs Llama2-70b performance on real-world prompts using Replicate Lifeboat API to benchmark model accuracy and response quality for your specific use case
ποΈ Gemini vs GPT
Compare Gemini Pro vs GPT-4 performance on your custom datasets using automated benchmarks and side-by-side analysis to identify the best model for your use case
ποΈ Gemma vs Llama
Compare Google Gemma vs Meta Llama performance on custom datasets using automated benchmarks and side-by-side evaluations to select the best model for your use case
ποΈ Gemma vs Mistral/Mixtral
Compare Gemma vs Mistral vs Mixtral performance on your custom datasets with automated benchmarks to identify the best model for your specific use case
ποΈ GPT 3.5 vs GPT 4
Compare GPT-3.5 vs GPT-4 performance on your custom datasets using automated benchmarks to evaluate costs, latency and reasoning capabilities for your use case
ποΈ GPT-4o vs GPT-4o-mini
Compare GPT-4o vs GPT-4o-mini performance on your custom datasets with automated benchmarks to evaluate cost, latency and reasoning capabilities
ποΈ GPT-4.1 vs GPT-4o MMLU
Compare GPT-4.1 and GPT-4o performance on MMLU academic reasoning tasks using promptfoo with step-by-step setup and research-backed optimization techniques.
ποΈ gpt-4.1 vs o1
Benchmark OpenAI o1 reasoning models against GPT-4 for cost, latency, and accuracy to optimize model selection decisions
ποΈ Using LangChain PromptTemplate with Promptfoo
Learn how to test LangChain PromptTemplate outputs systematically with Promptfoo's evaluation tools to validate prompt formatting and variable injection
ποΈ Uncensored Llama2 benchmark
Compare Llama2 Uncensored vs GPT-3.5 responses on sensitive topics using automated benchmarks to evaluate model safety and content filtering capabilities
ποΈ How to red team LLM applications
Protect your LLM applications from prompt injection, jailbreaks, and data leaks with automated red teaming tests that identify 20+ vulnerability types and security risks
ποΈ Magistral AIME2024 Benchmark
Replicate Mistral Magistral AIME2024 math benchmark achieving 73.6% accuracy with detailed evaluation methodology and comparisons
ποΈ Mistral vs Llama
Compare Mistral 7B, Mixtral 8x7B, and Llama 3.1 performance on custom benchmarks to optimize model selection for your specific LLM application needs
ποΈ Mixtral vs GPT
Compare Mixtral vs GPT-4 performance on custom datasets using automated benchmarks and evaluation metrics to identify the optimal model for your use case
ποΈ Multi-Modal Red Teaming
Red team multimodal AI systems using adversarial text, images, audio, and video inputs to identify cross-modal vulnerabilities
ποΈ Phi vs Llama
Compare Phi 3 vs Llama 3.1 performance on your custom datasets using automated benchmarks and side-by-side evaluations to select the optimal model for your use case
ποΈ Preventing hallucinations
Measure and reduce LLM hallucinations using perplexity metrics, RAG, and controlled decoding techniques to achieve 85%+ factual accuracy in AI outputs
ποΈ Qwen vs Llama vs GPT
Compare Qwen-2-72B vs GPT-4o vs Llama-3-70B performance on customer support tasks with custom benchmarks to optimize your chatbot's response quality
ποΈ Sandboxed Evaluations of LLM-Generated Code
Safely evaluate and benchmark LLM-generated code in isolated Docker containers to prevent security risks and catch errors before production deployment
ποΈ Testing Guardrails
Learn how to test guardrails in your AI applications to prevent harmful content, detect PII, and block prompt injections
ποΈ Evaluating LLM text-to-SQL performance
Compare text-to-SQL accuracy across GPT-3.5 and GPT-4 using automated test cases and schema validation to optimize database query generation performance