Not every CI failure should block a merge. Consider: - Hard block: Pass rate drops below a threshold (e.g., 90%) - Hard block: Any test case in the "safety" category fails - Warning only: Performance on non-critical test cases degrades - Informational: Model cost comparison changes significantly

Developer Guides

LLM Testing Tools 2026: PromptFoo, Evals, and How to Stop Guessing

Q: Can I use PromptFoo with local models?

Yes. PromptFoo supports Ollama as a provider: ``yaml providers: - ollama:llama3 - ollama:mistral `` This is useful for comparing local model quality against hosted models before committing to API costs, or for teams that need to evaluate on-premises.

How to test your LLM prompts and applications systematically. PromptFoo and alternatives for developers who need reliable AI outputs.

Alex Chen·March 13, 2026·12 min read·2,380 words

Disclosure: This post may contain affiliate links. We earn a commission if you purchase — at no extra cost to you. Our opinions are always our own.

LLM Testing Tools 2026: PromptFoo, Evals, and How to Stop Guessing

Quick links:Hands-On Large Language Models Book →AI Engineering Book by Chip Huyen →Software Engineering at Google Book →MacBook Pro M3 for AI Development →

Most developers shipping LLM features are doing it wrong. They write a prompt, try it a few times in a playground, ship it, and hope for the best. When something breaks in production, they tweak the prompt manually and redeploy.

This works until it doesn't — and it usually stops working at the worst possible time.

Systematic How to Evaluate LLM Outputs in 2026: The Developer's Guide to AI Evals" class="internal-link">LLM testing is how you stop guessing and start shipping AI features with confidence. This guide covers PromptFoo (the best open-source tool for this), alternatives worth knowing, and how to wire it all into a CI/CD pipeline.

Why LLM Testing Is Hard

Testing deterministic code is well-understood: given input X, assert output Y. LLMs break this contract. The same prompt can produce different outputs on successive runs. Quality is often subjective. Regressions are subtle — a prompt change that improves responses for one class of inputs can silently degrade others.

The specific challenges:

Non-determinism: Even at temperature 0, some models aren't perfectly deterministic. Evaluation suites need to handle variance gracefully.

Vibes-based evaluation: "This response feels better" is not a test assertion. Formalizing what "better" means for your use case is hard and requires domain knowledge.

Ground truth is expensive: Creating labeled test datasets where you have authoritative answers requires human effort. For many domains, there's no cheap proxy.

Prompt sensitivity: Minor wording changes can produce major quality differences. A word swap that looks cosmetic can meaningfully shift response distribution.

Model drift: LLM providers update models silently. The Claude or GPT-4 you called six months ago may behave differently today. Without evals, you won't know until users complain.

Coverage: LLM applications have enormous input spaces. Testing the happy path is easy; testing edge cases, adversarial inputs, and failure modes is where the real work is.

Despite all this, systematic testing is tractable. The key is accepting approximate, probabilistic assertions rather than demanding exact outputs.

Get the Weekly TrendHarvest Pick

One email. The best tool, deal, or guide we found this week. No spam.

What Is PromptFoo

PromptFoo is an open-source CLI and library for evaluating LLM prompts and applications. You define test cases, assertions, and the models you want to test, then run evaluations that produce structured results you can analyze, compare, and track over time.

It supports multi-model comparisons (run the same test suite against Claude Pro, ChatGPT Plus, and Local AI Models You Can Run on Your Laptop 2026" class="internal-link">local models simultaneously), multiple assertion types (string matching, regex, semantic similarity, LLM-as-judge), and CI/CD integration.

Installation

# Install globally
npm install -g promptfoo

# Or use npx without installing
npx promptfoo@latest

Project Initialization

mkdir my-evals && cd my-evals
promptfoo init

This creates a promptfooconfig.yaml file. Here's a minimal config:

providers:
  - openai:gpt-4
  - anthropic:claude-3-5-sonnet-20241022

prompts:
  - "Classify the sentiment of this review: {{review}}"

tests:
  - vars:
      review: "This product is absolutely terrible. Broke after one day."
    assert:
      - type: icontains
        value: "negative"

  - vars:
      review: "Exceeded all my expectations. Would buy again."
    assert:
      - type: icontains
        value: "positive"

Run the evaluation:

promptfoo eval

View results in the browser:

promptfoo view

This opens a side-by-side comparison of outputs across all providers with pass/fail status for each assertion.

Writing Test Cases

Test cases are the most important part of your eval suite. The quality of your tests determines whether your eval results mean anything.

Assertion Types

PromptFoo supports a range of assertion types, from simple to sophisticated:

String matching:

assert:
  - type: icontains          # case-insensitive contains
    value: "negative"
  - type: not-icontains      # must NOT contain
    value: "positive"
  - type: equals             # exact match
    value: "NEGATIVE"
  - type: regex
    value: "(negative|bad|poor)"

Structural assertions:

assert:
  - type: is-json            # output must be valid JSON
  - type: javascript         # custom JS assertion
    value: "output.length < 500"
  - type: python             # custom Python assertion
    value: "len(output) < 500"

Semantic assertions:

assert:
  - type: similar            # cosine similarity to reference
    value: "The sentiment is negative"
    threshold: 0.8

LLM-as-judge:

assert:
  - type: llm-rubric
    value: "The response correctly identifies the sentiment and provides a brief explanation. It should not make up details not present in the review."

The LLM-as-judge assertion sends the output to another LLM (by default, GPT-4) with a rubric and asks it to pass/fail. This is expensive but often the most practical way to evaluate open-ended outputs.

Writing Good Test Cases

Cover the distribution of real inputs: Your test cases should reflect what users actually send, not just clean examples you made up. Pull from production logs if you have them.

Include adversarial cases: Prompt injections, edge cases, malformed inputs, off-topic requests. Your LLM application needs to handle these gracefully.

Test your failure modes explicitly: If your system prompt says "never discuss competitors," write a test case that tries to elicit competitor discussion and assert it fails.

Use multiple assertions per test: A single icontains assertion is weak. Layer a structural check, a content check, and a length check for important cases.

Keep test cases independent: Each test case should be self-contained. Don't rely on state from previous tests.

Example: Testing a Code Review Assistant

providers:
  - anthropic:claude-3-5-sonnet-20241022

prompts:
  - file://prompts/code-reviewer.txt

tests:
  - description: "SQL injection vulnerability should be flagged"
    vars:
      code: |
        def get_user(username):
            query = f"SELECT * FROM users WHERE name = '{username}'"
            return db.execute(query)
    assert:
      - type: icontains
        value: "sql injection"
      - type: icontains
        value: "parameterized"
      - type: llm-rubric
        value: "The response identifies the SQL injection vulnerability and suggests a fix using parameterized queries or an ORM."

  - description: "Clean code should not raise false positives"
    vars:
      code: |
        def get_user(user_id: int):
            return db.execute("SELECT * FROM users WHERE id = ?", (user_id,))
    assert:
      - type: not-icontains
        value: "sql injection"
      - type: llm-rubric
        value: "The response either approves the code or provides constructive feedback that is not about SQL injection, which is not present."

Alternative Tools

PromptFoo is the best general-purpose option, but there are alternatives worth knowing depending on your context.

Braintrust

Braintrust is a hosted eval and observability platform. The developer experience is polished — you instrument your code with the Braintrust SDK, and traces, evals, and experiments appear in a web dashboard.

The advantage over PromptFoo: tighter integration with your application code, better dataset management, and a more collaborative interface for teams. The disadvantage: it's a paid SaaS product after the free tier, and your data goes to Braintrust's servers.

import braintrust

experiment = braintrust.init(project="my-project")

with experiment.start_span("classify"):
    result = llm.classify(input_text)
    experiment.log(
        input=input_text,
        output=result,
        expected="positive",
        scores={"correct": result == "positive"}
    )

Ragas (for RAG pipelines)

Ragas is specifically designed for evaluating RAG pipelines. It measures:

Faithfulness: Does the answer make claims supported by the retrieved context?
Answer relevancy: Does the answer address the actual question?
Context precision: Is the retrieved context relevant to the question?
Context recall: Did retrieval find all the information needed to answer?

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

results = evaluate(
    dataset=your_dataset,
    metrics=[faithfulness, answer_relevancy, context_recall]
)
print(results.to_pandas())

Ragas is the best tool for diagnosing RAG quality problems specifically. If your RAG pipeline is giving bad answers, Ragas helps you isolate whether the problem is retrieval or generation.

HELM (Holistic Evaluation of Language Models)

HELM is Stanford's benchmark framework. It's comprehensive — covering accuracy, calibration, robustness, fairness, and efficiency across many scenarios. HELM is primarily a research tool for comparing models at a macro level, not for testing your application's specific prompts.

Use HELM if you're doing model selection research or academic work. Don't use it as your production eval framework — it's not designed for that.

OpenAI Evals

OpenAI's own eval framework is open source. It's tightly coupled to OpenAI's models and infrastructure, and the DX isn't as polished as PromptFoo for application-level testing. But it's worth knowing because many papers and review-2026" title="Pictory Review 2026: Can It Really Turn Blog Posts Into Videos?" class="internal-link">blog posts reference it, and the dataset format is reusable.

CI/CD Integration

Running evals in CI prevents regressions from reaching production. Here's how to integrate PromptFoo into GitHub Actions.

GitHub Actions Workflow

# .github/workflows/llm-evals.yml
name: LLM Evals

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install PromptFoo
        run: npm install -g promptfoo

      - name: Run evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          promptfoo eval --ci --output results.json

      - name: Check results
        run: |
          # Fail if pass rate drops below 90%
          node -e "
            const r = require('./results.json');
            const passRate = r.results.stats.successes / r.results.stats.total;
            if (passRate < 0.9) {
              console.error('Pass rate ' + (passRate * 100).toFixed(1) + '% below threshold');
              process.exit(1);
            }
          "

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results.json

What to Gate On

Not every CI failure should block a merge. Consider:

Hard block: Pass rate drops below a threshold (e.g., 90%)
Hard block: Any test case in the "safety" category fails
Warning only: Performance on non-critical test cases degrades
Informational: Model cost comparison changes significantly

Practical Workflow

Red-Teaming

Before shipping a prompt, try to break it. Write test cases that attempt:

Prompt injection ("Ignore all previous instructions and...")
Jailbreaking attempts
Off-topic requests
Ambiguous inputs where the right behavior is unclear
Inputs in languages you didn't design for

Add these to your eval suite as explicit cases with assertions. When they start passing reliably, you have evidence your system handles adversarial inputs.

Regression Testing

Every time a prompt change ships to production, add test cases for the behaviors it was supposed to improve. When you later change the prompt again, those cases run and catch regressions. Over time, your eval suite becomes a specification of what your application should do.

A/B Prompt Comparison

PromptFoo is built for this. Define two prompt variants and run the same test suite against both:

prompts:
  - id: "v1"
    raw: "Classify this review: {{review}}"
  - id: "v2"
    raw: |
      You are a sentiment analysis expert.
      Classify the following review as positive, negative, or neutral.
      Review: {{review}}
      Classification:

The side-by-side view shows which version passes more tests and where each fails. This is much more reliable than intuition about which prompt "sounds better."

LLM Model Selection

Claude Pro and ChatGPT Plus perform differently on different tasks. Cursor Pro for coding workflows has its own characteristics. Before committing to a model for a specific application feature, run your eval suite against all candidates. The results often surprise — the "best" model for general use isn't always the best for your specific task.

This pairs naturally with our best AI coding assistants comparison and our Claude Code review for context on how models compare in practice.

Building an Evaluation Dataset

Start small and grow deliberately:

Seed with synthetic cases: Generate 20–30 test cases covering your core use cases. These won't be perfect but get you moving.
Sample from production: Once you have production traffic, sample real inputs (anonymizing as needed) and manually label outputs as pass/fail.
Capture regressions: Every bug report that involves an LLM output should become a test case.
Adversarial expansion: Periodically run red-team sessions and add cases from the findings.

A mature eval suite for a production LLM feature will have 200–500 test cases built up over several months. It's an investment that pays dividends every time you change a prompt or upgrade a model.

PromptFoo (free/open source) — Best LLM eval framework for prompt testing and model comparison (free, open source)
Ragas (free/open source) — Best RAG-specific evaluation metrics library (free, open source)
Claude Pro — Best AI assistant for writing test cases and debugging prompt regressions ($20/mo)
GitHub Actions (free tier) — Best CI/CD platform for automating eval runs on pull requests (free tier available)

FAQ

Is PromptFoo free?

Yes, PromptFoo is open source (MIT license). The CLI and local evaluation are completely free. There's a hosted version with additional features, but everything needed for local and CI/CD eval is free.

How many test cases do I need?

For a prompt in active development, 20–30 cases covering core scenarios and obvious edge cases is a reasonable starting point. A production prompt handling sensitive or high-stakes tasks should have 100+, including adversarial cases. The right number depends on the risk profile of the feature.

What's the difference between evals and unit tests?

Unit tests assert exact outputs from deterministic functions. Evals use probabilistic assertions against non-deterministic systems. You run evals against a sample of inputs with pass/fail criteria that accept a range of valid outputs. You also run evals repeatedly and track trends — a feature that's passing 95% of the time is different from one that passes 70%.

Can I use PromptFoo with local models?

Yes. PromptFoo supports Ollama as a provider:

providers:
  - ollama:llama3
  - ollama:mistral

This is useful for comparing local model quality against hosted models before committing to API costs, or for teams that need to evaluate on-premises.

How do I handle non-deterministic outputs in assertions?

Use threshold-based or semantic assertions rather than exact-match. type: similar with a threshold accepts outputs that are semantically close to your reference. type: llm-rubric evaluates against a rubric rather than an exact answer. type: javascript lets you write custom logic that accepts any output matching your criteria.

Should I run evals on every commit or just on prompt changes?

At minimum, run evals when prompts change. Many teams also run a fast smoke-test suite (10–20 critical cases) on every commit, and the full suite on a schedule (nightly) or before releases. The fast suite catches obvious regressions early; the full suite catches subtle ones.

What about testing RAG pipelines specifically?

Use Ragas for RAG-specific evaluation metrics (faithfulness, context precision, etc.). Use PromptFoo for end-to-end evaluation of the full pipeline against expected outputs. Both are complementary — Ragas tells you where the quality problem is (retrieval vs generation), PromptFoo tells you whether the system as a whole meets your quality bar.

How to Evaluate LLM Outputs in 2026: The Developer's Guide to AI Evals

How to systematically evaluate LLM outputs in production. The developer's guide to evals: metrics, tools, LLM-as-judge, and CI integration.

March 13, 2026·15 min readLLM evalsAI evaluation

Developer Guides

AI Browser Agents 2026: Page Agent, Browse AI, Bardeen, and Browserbase Compared

Browse AI, Bardeen, Browserbase, or Page Agent — which AI browser agent is right for you? We compare pricing, skill level, and real-world performance so you pick the right tool and skip the expensive trial-and-error.

March 13, 2026·14 min readbrowser automationai agents

Developer Guides

AI Memory and Context in 2026: RAG vs Fine-Tuning vs Long Context Windows Explained

RAG vs fine-tuning vs long context windows: when to use each approach for giving AI models memory and access to your data.

March 13, 2026·14 min readRAGfine-tuning

LLM Testing Tools 2026: PromptFoo, Evals, and How to Stop Guessing

Why LLM Testing Is Hard

Get the Weekly TrendHarvest Pick

What Is PromptFoo

Installation

Project Initialization

Writing Test Cases

Assertion Types

Writing Good Test Cases

Example: Testing a Code Review Assistant

Alternative Tools

Braintrust

Ragas (for RAG pipelines)

HELM (Holistic Evaluation of Language Models)

OpenAI Evals

CI/CD Integration

GitHub Actions Workflow

What to Gate On

Practical Workflow

Red-Teaming

Regression Testing

A/B Prompt Comparison

LLM Model Selection

Building an Evaluation Dataset

FAQ

Is PromptFoo free?

How many test cases do I need?

What's the difference between evals and unit tests?

Can I use PromptFoo with local models?

How do I handle non-deterministic outputs in assertions?

Should I run evals on every commit or just on prompt changes?

What about testing RAG pipelines specifically?

Further Reading

Tools Mentioned in This Article

Recommended Resources

Related Articles

How to Evaluate LLM Outputs in 2026: The Developer's Guide to AI Evals

AI Browser Agents 2026: Page Agent, Browse AI, Bardeen, and Browserbase Compared

AI Memory and Context in 2026: RAG vs Fine-Tuning vs Long Context Windows Explained

Why LLM Testing Is Hard

Get the Weekly TrendHarvest Pick

What Is PromptFoo

Installation

Project Initialization

Writing Test Cases

Assertion Types

Writing Good Test Cases

Example: Testing a Code Review Assistant

Alternative Tools

Braintrust

Ragas (for RAG pipelines)

HELM (Holistic Evaluation of Language Models)

OpenAI Evals

CI/CD Integration

GitHub Actions Workflow

What to Gate On

Practical Workflow

Red-Teaming

Regression Testing

A/B Prompt Comparison

LLM Model Selection

Building an Evaluation Dataset

Tools We Recommend

FAQ

Is PromptFoo free?

How many test cases do I need?

What's the difference between evals and unit tests?

Can I use PromptFoo with local models?

How do I handle non-deterministic outputs in assertions?

Should I run evals on every commit or just on prompt changes?

What about testing RAG pipelines specifically?

Further Reading

Tools Mentioned in This Article

Recommended Resources

Enjoyed this? Get more picks weekly.

Related Articles

How to Evaluate LLM Outputs in 2026: The Developer's Guide to AI Evals

AI Browser Agents 2026: Page Agent, Browse AI, Bardeen, and Browserbase Compared

AI Memory and Context in 2026: RAG vs Fine-Tuning vs Long Context Windows Explained