LLM Testing Tools 2026: PromptFoo, Evals, and How to Stop Guessing
How to test your LLM prompts and applications systematically. PromptFoo and alternatives for developers who need reliable AI outputs.
Disclosure: This post may contain affiliate links. We earn a commission if you purchase — at no extra cost to you. Our opinions are always our own.
Most developers shipping LLM features are doing it wrong. They write a prompt, try it a few times in a playground, ship it, and hope for the best. When something breaks in production, they tweak the prompt manually and redeploy.
This works until it doesn't — and it usually stops working at the worst possible time.
Systematic How to Evaluate LLM Outputs in 2026: The Developer's Guide to AI Evals" class="internal-link">LLM testing is how you stop guessing and start shipping AI features with confidence. This guide covers PromptFoo (the best open-source tool for this), alternatives worth knowing, and how to wire it all into a CI/CD pipeline.
Why LLM Testing Is Hard
Testing deterministic code is well-understood: given input X, assert output Y. LLMs break this contract. The same prompt can produce different outputs on successive runs. Quality is often subjective. Regressions are subtle — a prompt change that improves responses for one class of inputs can silently degrade others.
The specific challenges:
Non-determinism: Even at temperature 0, some models aren't perfectly deterministic. Evaluation suites need to handle variance gracefully.
Vibes-based evaluation: "This response feels better" is not a test assertion. Formalizing what "better" means for your use case is hard and requires domain knowledge.
Ground truth is expensive: Creating labeled test datasets where you have authoritative answers requires human effort. For many domains, there's no cheap proxy.
Prompt sensitivity: Minor wording changes can produce major quality differences. A word swap that looks cosmetic can meaningfully shift response distribution.
Model drift: LLM providers update models silently. The review-2026" title="Claude Opus 4.6 Review 2026 — Is It Still the Best LLM for Serious Work?" class="internal-link">Claude or GPT-4 you called six months ago may behave differently today. Without evals, you won't know until users complain.
Coverage: LLM applications have enormous input spaces. Testing the happy path is easy; testing edge cases, adversarial inputs, and failure modes is where the real work is.
Despite all this, systematic testing is tractable. The key is accepting approximate, probabilistic assertions rather than demanding exact outputs.
Get the Weekly TrendHarvest Pick
One email. The best tool, deal, or guide we found this week. No spam.
What Is PromptFoo
PromptFoo is an open-source CLI and library for evaluating LLM prompts and applications. You define test cases, assertions, and the models you want to test, then run evaluations that produce structured results you can analyze, compare, and track over time.
It supports multi-model comparisons (run the same test suite against Claude Pro, ChatGPT Plus, and local models simultaneously), multiple assertion types (string matching, regex, semantic similarity, LLM-as-judge), and CI/CD integration.
Installation
# Install globally
npm install -g promptfoo
# Or use npx without installing
npx promptfoo@latest
Project Initialization
mkdir my-evals && cd my-evals
promptfoo init
This creates a promptfooconfig.yaml file. Here's a minimal config:
providers:
- openai:gpt-4
- anthropic:claude-3-5-sonnet-20241022
prompts:
- "Classify the sentiment of this review: {{review}}"
tests:
- vars:
review: "This product is absolutely terrible. Broke after one day."
assert:
- type: icontains
value: "negative"
- vars:
review: "Exceeded all my expectations. Would buy again."
assert:
- type: icontains
value: "positive"
Run the evaluation:
promptfoo eval
View results in the browser:
promptfoo view
This opens a side-by-side comparison of outputs across all providers with pass/fail status for each assertion.
Writing Test Cases
Test cases are the most important part of your eval suite. The quality of your tests determines whether your eval results mean anything.
Assertion Types
PromptFoo supports a range of assertion types, from simple to sophisticated:
String matching:
assert:
- type: icontains # case-insensitive contains
value: "negative"
- type: not-icontains # must NOT contain
value: "positive"
- type: equals # exact match
value: "NEGATIVE"
- type: regex
value: "(negative|bad|poor)"
Structural assertions:
assert:
- type: is-json # output must be valid JSON
- type: javascript # custom JS assertion
value: "output.length < 500"
- type: python # custom Python assertion
value: "len(output) < 500"
Semantic assertions:
assert:
- type: similar # cosine similarity to reference
value: "The sentiment is negative"
threshold: 0.8
LLM-as-judge:
assert:
- type: llm-rubric
value: "The response correctly identifies the sentiment and provides a brief explanation. It should not make up details not present in the review."
The LLM-as-judge assertion sends the output to another LLM (by default, GPT-4) with a rubric and asks it to pass/fail. This is expensive but often the most practical way to evaluate open-ended outputs.
Writing Good Test Cases
Cover the distribution of real inputs: Your test cases should reflect what users actually send, not just clean examples you made up. Pull from production logs if you have them.
Include adversarial cases: Prompt injections, edge cases, malformed inputs, off-topic requests. Your LLM application needs to handle these gracefully.
Test your failure modes explicitly: If your system prompt says "never discuss competitors," write a test case that tries to elicit competitor discussion and assert it fails.
Use multiple assertions per test: A single icontains assertion is weak. Layer a structural check, a content check, and a length check for important cases.
Keep test cases independent: Each test case should be self-contained. Don't rely on state from previous tests.
Example: Testing a Code Review Assistant
providers:
- anthropic:claude-3-5-sonnet-20241022
prompts:
- file://prompts/code-reviewer.txt
tests:
- description: "SQL injection vulnerability should be flagged"
vars:
code: |
def get_user(username):
query = f"SELECT * FROM users WHERE name = '{username}'"
return db.execute(query)
assert:
- type: icontains
value: "sql injection"
- type: icontains
value: "parameterized"
- type: llm-rubric
value: "The response identifies the SQL injection vulnerability and suggests a fix using parameterized queries or an ORM."
- description: "Clean code should not raise false positives"
vars:
code: |
def get_user(user_id: int):
return db.execute("SELECT * FROM users WHERE id = ?", (user_id,))
assert:
- type: not-icontains
value: "sql injection"
- type: llm-rubric
value: "The response either approves the code or provides constructive feedback that is not about SQL injection, which is not present."
Alternative Tools
PromptFoo is the best general-purpose option, but there are alternatives worth knowing depending on your context.
Braintrust
Braintrust is a hosted eval and observability platform. The developer experience is polished — you instrument your code with the Braintrust SDK, and traces, evals, and experiments appear in a web dashboard.
The advantage over PromptFoo: tighter integration with your application code, better dataset management, and a more collaborative interface for teams. The disadvantage: it's a paid SaaS product after the free tier, and your data goes to Braintrust's servers.
import braintrust
experiment = braintrust.init(project="my-project")
with experiment.start_span("classify"):
result = llm.classify(input_text)
experiment.log(
input=input_text,
output=result,
expected="positive",
scores={"correct": result == "positive"}
)
Ragas (for RAG pipelines)
Ragas is specifically designed for evaluating RAG pipelines. It measures:
- Faithfulness: Does the answer make claims supported by the retrieved context?
- Answer relevancy: Does the answer address the actual question?
- Context precision: Is the retrieved context relevant to the question?
- Context recall: Did retrieval find all the information needed to answer?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
results = evaluate(
dataset=your_dataset,
metrics=[faithfulness, answer_relevancy, context_recall]
)
print(results.to_pandas())
Ragas is the best tool for diagnosing RAG quality problems specifically. If your RAG pipeline is giving bad answers, Ragas helps you isolate whether the problem is retrieval or generation.
HELM (Holistic Evaluation of Language Models)
HELM is Stanford's benchmark framework. It's comprehensive — covering accuracy, calibration, robustness, fairness, and efficiency across many scenarios. HELM is primarily a research tool for comparing models at a macro level, not for testing your application's specific prompts.
Use HELM if you're doing model selection research or academic work. Don't use it as your production eval framework — it's not designed for that.
OpenAI Evals
OpenAI's own eval framework is open source. It's tightly coupled to OpenAI's models and infrastructure, and the DX isn't as polished as PromptFoo for application-level testing. But it's worth knowing because many papers and blog posts reference it, and the dataset format is reusable.
CI/CD Integration
Running evals in CI prevents regressions from reaching production. Here's how to integrate PromptFoo into GitHub Actions.
GitHub Actions Workflow
# .github/workflows/llm-evals.yml
name: LLM Evals
on:
pull_request:
paths:
- 'prompts/**'
- 'promptfooconfig.yaml'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install PromptFoo
run: npm install -g promptfoo
- name: Run evals
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
promptfoo eval --ci --output results.json
- name: Check results
run: |
# Fail if pass rate drops below 90%
node -e "
const r = require('./results.json');
const passRate = r.results.stats.successes / r.results.stats.total;
if (passRate < 0.9) {
console.error('Pass rate ' + (passRate * 100).toFixed(1) + '% below threshold');
process.exit(1);
}
"
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results.json
What to Gate On
Not every CI failure should block a merge. Consider:
- Hard block: Pass rate drops below a threshold (e.g., 90%)
- Hard block: Any test case in the "safety" category fails
- Warning only: Performance on non-critical test cases degrades
- Informational: Model cost comparison changes significantly
Practical Workflow
Red-Teaming
Before shipping a prompt, try to break it. Write test cases that attempt:
- Prompt injection ("Ignore all previous instructions and...")
- Jailbreaking attempts
- Off-topic requests
- Ambiguous inputs where the right behavior is unclear
- Inputs in languages you didn't design for
Add these to your eval suite as explicit cases with assertions. When they start passing reliably, you have evidence your system handles adversarial inputs.
Regression Testing
Every time a prompt change ships to production, add test cases for the behaviors it was supposed to improve. When you later change the prompt again, those cases run and catch regressions. Over time, your eval suite becomes a specification of what your application should do.
A/B Prompt Comparison
PromptFoo is built for this. Define two prompt variants and run the same test suite against both:
prompts:
- id: "v1"
raw: "Classify this review: {{review}}"
- id: "v2"
raw: |
You are a sentiment analysis expert.
Classify the following review as positive, negative, or neutral.
Review: {{review}}
Classification:
The side-by-side view shows which version passes more tests and where each fails. This is much more reliable than intuition about which prompt "sounds better."
LLM Model Selection
Claude Pro and ChatGPT Plus perform differently on different tasks. Cursor Pro for coding workflows has its own characteristics. Before committing to a model for a specific application feature, run your eval suite against all candidates. The results often surprise — the "best" model for general use isn't always the best for your specific task.
This pairs naturally with our best AI coding assistants comparison and our Claude Code review for context on how models compare in practice.
Building an Evaluation Dataset
Start small and grow deliberately:
Seed with synthetic cases: Generate 20–30 test cases covering your core use cases. These won't be perfect but get you moving.
Sample from production: Once you have production traffic, sample real inputs (anonymizing as needed) and manually label outputs as pass/fail.
Capture regressions: Every bug report that involves an LLM output should become a test case.
Adversarial expansion: Periodically run red-team sessions and add cases from the findings.
A mature eval suite for a production LLM feature will have 200–500 test cases built up over several months. It's an investment that pays dividends every time you change a prompt or upgrade a model.
Tools We Recommend
- PromptFoo (free/open source) — Best LLM eval framework for prompt testing and model comparison (free, open source)
- Ragas (free/open source) — Best RAG-specific evaluation metrics library (free, open source)
- Claude Pro — Best AI assistant for writing test cases and debugging prompt regressions ($20/mo)
- GitHub Actions (free tier) — Best CI/CD platform for automating eval runs on pull requests (free tier available)
FAQ
Is PromptFoo free?
Yes, PromptFoo is open source (MIT license). The CLI and local evaluation are completely free. There's a hosted version with additional features, but everything needed for local and CI/CD eval is free.
How many test cases do I need?
For a prompt in active development, 20–30 cases covering core scenarios and obvious edge cases is a reasonable starting point. A production prompt handling sensitive or high-stakes tasks should have 100+, including adversarial cases. The right number depends on the risk profile of the feature.
What's the difference between evals and unit tests?
Unit tests assert exact outputs from deterministic functions. Evals use probabilistic assertions against non-deterministic systems. You run evals against a sample of inputs with pass/fail criteria that accept a range of valid outputs. You also run evals repeatedly and track trends — a feature that's passing 95% of the time is different from one that passes 70%.
Can I use PromptFoo with local models?
Yes. PromptFoo supports Ollama as a provider:
providers:
- ollama:llama3
- ollama:mistral
This is useful for comparing local model quality against hosted models before committing to API costs, or for teams that need to evaluate on-premises.
How do I handle non-deterministic outputs in assertions?
Use threshold-based or semantic assertions rather than exact-match. type: similar with a threshold accepts outputs that are semantically close to your reference. type: llm-rubric evaluates against a rubric rather than an exact answer. type: javascript lets you write custom logic that accepts any output matching your criteria.
Should I run evals on every commit or just on prompt changes?
At minimum, run evals when prompts change. Many teams also run a fast smoke-test suite (10–20 critical cases) on every commit, and the full suite on a schedule (nightly) or before releases. The fast suite catches obvious regressions early; the full suite catches subtle ones.
What about testing RAG pipelines specifically?
Use Ragas for RAG-specific evaluation metrics (faithfulness, context precision, etc.). Use PromptFoo for end-to-end evaluation of the full pipeline against expected outputs. Both are complementary — Ragas tells you where the quality problem is (retrieval vs generation), PromptFoo tells you whether the system as a whole meets your quality bar.
Tools Mentioned in This Article
Recommended Resources
Curated prompt packs and tools to help you take action on what you just read.
Related Articles
How to Evaluate LLM Outputs in 2026: The Developer's Guide to AI Evals
How to systematically evaluate LLM outputs in production. The developer's guide to evals: metrics, tools, LLM-as-judge, and CI integration.
AI Browser Agents 2026: Page Agent, Browse AI, Bardeen, and Browserbase Compared
Compare the best AI browser agent tools in 2026. From no-code scraping to fully autonomous web agents — which tool fits your workflow?
AI Memory and Context in 2026: RAG vs Fine-Tuning vs Long Context Windows Explained
RAG vs fine-tuning vs long context windows: when to use each approach for giving AI models memory and access to your data.