📄️ Testing LLM chains
Prompt chaining is a common pattern used to perform more complex reasoning with LLMs. It's used by libraries like LangChain, and OpenAI has released built-in support via OpenAI functions.
📄️ Evaluating factuality
promptfoo implements OpenAI's evaluation methodology for factuality, using the factuality assertion type.
📄️ Evaluating RAG pipelines
Retrieval-augmented generation is a method for enriching LLM prompts with relevant data. Typically, the user prompt will be converting into an embedding and matching documents are fetched from a vector store. Then, the LLM is called with the matching documents as part of the prompt.
📄️ OpenAI vs Azure benchmark
Whether you use GPT through the OpenAI or Azure APIs, the results are pretty similar. But there are some key differences:
📄️ Llama 2 vs GPT benchmark
This guide describes how to compare three models - Llama v2 70B, GPT 3.5, and GPT 4 - using the promptfoo CLI.
📄️ Evaluating JSON outputs
Getting an LLM to output valid JSON can be a difficult task. There are a few failure modes:
📄️ Choosing the right temperature for your LLM
The temperature` setting in language models is like a dial that adjusts how predictable or surprising the responses from the model will be, helping application developers fine-tune the AI's creativity to suit different tasks.
📄️ Evaluating OpenAI Assistants
OpenAI recently released an Assistants API that offers simplified handling for message state and tool usage. It also enables code interpreter and knowledge retrieval features, abstracting away some of the dirty work for implementing RAG architecture.
📄️ Evaluating Replicate Lifeboat
Replicate put together a "Lifeboat" OpenAI proxy that allows you to swap to their hosted Llama2-70b instances. They are generously providing this API for free for a week.
📄️ Gemini vs GPT
When comparing Gemini with GPT, you'll find plenty of eval and opinions online. Model capabilities set a ceiling on what you're able to accomplish, but in my experience most LLM apps are highly dependent on their prompting and use case.
📄️ Gemma vs Llama
Comparing Google's Gemma and Meta's Llama involves more than just looking at their specs and reading about generic benchmarks. The true measure of their usefulness comes down to how they perform on the specific tasks you need them for, in the context of your specific application.
📄️ Gemma vs Mistral/Mixtral
When comparing the performance of LLMs, it's best not to rely on generic benchmarks. This guide shows you how to set up a comprehensive benchmark that compares Gemma vs Mistral vs Mixtral.
📄️ GPT 3.5 vs GPT 4
This guide will walk you through how to compare OpenAI's GPT-3.5 and GPT-4 using promptfoo. This testing framework will give you the chance to test the models' reasoning capabilities, cost, and latency.
📄️ Uncensored Llama2 benchmark
Most LLMs go through fine-tuning that prevents them from answering questions like "How do you make Tylenol", "Who would win in a fist fight...", and "Write a recipe for dangerously spicy mayo."
📄️ Mistral vs Llama2
When Mistral was was released, it was the "best 7B model to date" based on a number of evals. Mixtral, a mixture-of-experts model based on Mistral, was recently announced with even more impressive eval performance.
📄️ Mixtral vs GPT
In this guide, we'll walk through the steps to compare three large language models (LLMs): Mixtral, GPT-3.5, and GPT-4. We will use promptfoo, a command-line interface (CLI) tool, to run evaluations and compare the performance of these models based on a set of prompts and test cases.
📄️ Preventing hallucinations
LLMs have great potential, but they are prone to generating incorrect or misleading information, a phenomenon known as hallucination. Factuality and LLM "grounding" is a key concern for developers building LLM applications.