AI/ML DevelopmentAI TestingLLM EvaluationQuality AssuranceMLOpsAI Engineering

The End of Vibes: How to Unit Test Your AI

Rahul

AI/ML Delivery Head, GYSP.tech

1 November 202510 min read

What you'll take away

Why Standard Software Testing Doesn't Translate
The Evaluation Hierarchy: From Deterministic to Probabilistic
The Regression Suite: Catching Quality Degradation
Validated Outcomes
A/B Evaluation: Comparing Changes

The prompt engineering session went well. The team tried thirty variations, found one that seemed to work reliably for the five test cases they'd been using, and shipped it to production. Two weeks later, users started reporting that the system was giving wrong answers to a specific class of query the team hadn't included in their test set. The prompt that 'worked' in testing had a subtle failure mode that only showed up at scale.

This is what vibes-based AI evaluation looks like at production scale: you test on cases you thought of, ship on cases you didn't, and discover the failures through user complaints. The alternative — systematic AI evaluation with defined metrics, regression suites, and automated quality gates — requires significantly more engineering investment upfront, but it's the only approach that produces AI systems with predictable, measurable behaviour.

Why Standard Software Testing Doesn't Translate

Unit tests for deterministic software verify specific inputs produce specific outputs. An AI system's outputs are not deterministic: the same input can produce different responses across invocations, across model versions, and across temperature settings. You can't write a unit test that asserts 'this AI response exactly equals this expected string' — that test would be brittle, maintenance-intensive, and would fail to catch responses that are semantically equivalent to the expected output.

AI evaluation requires a different approach: tests that assess properties of the output rather than exact equality. Is the output factually accurate? Does it contain required elements? Is it appropriately formatted? Does it avoid prohibited content? Does it address the question asked? These are assertions about output quality that can be automated, even though they can't be expressed as simple equality checks.

The Evaluation Hierarchy: From Deterministic to Probabilistic

Deterministic Assertions (Most Reliable)

Some properties of AI output can be tested deterministically: does the output contain a required JSON structure? Does it include a required disclaimer when one is mandated? Does it avoid a list of prohibited words or phrases? Does it stay within a specified length limit? Is the cited source URL valid and accessible? These assertions are fast, cheap, and should be the first layer of your evaluation suite.

Rule-Based Heuristics (Reliable for Specific Properties)

Some quality properties can be evaluated with rule-based heuristics: does the response include the key entities from the retrieved context (a proxy for faithfulness)? Does it answer in the language the query was asked in? Does it use the expected terminology for a given domain? These heuristics don't guarantee quality but catch a predictable class of failures cheaply.

LLM-as-Judge (Flexible but Expensive)

Using a separate LLM to evaluate the output of your production LLM is the most flexible evaluation approach: you describe the quality criteria in natural language and let the evaluator model assess whether the output meets them. LLM-as-judge evaluation can assess properties that are hard to specify as rules — whether a response is helpful, whether it addresses all aspects of the question, whether it's appropriately cautious about uncertain information. The cost: LLM evaluation is slow and expensive compared to deterministic checks, and the evaluator model can itself be inconsistent.

Human Evaluation (Ground Truth but Slow)

Human evaluation — having domain experts rate AI outputs against quality criteria — provides the highest-quality signal but is slow and expensive to scale. It's most valuable for building the golden dataset that other evaluation methods are calibrated against, and for periodic quality audits of production output to verify that automated evaluation metrics correlate with actual user satisfaction.

The practical evaluation stack: deterministic assertions as automated pre-deployment gates (must pass before any deployment), LLM-as-judge for weekly quality snapshots against a representative query sample, and human evaluation quarterly or when automated metrics diverge unexpectedly from user satisfaction signals.

Is your AI ready for production?

48-hour turnaround. No obligation.

Request AI Architecture Review

The Regression Suite: Catching Quality Degradation

A regression suite for AI systems is analogous to a regression test suite for traditional software: a set of representative cases that should be handled correctly, run against every code or model change to detect regressions before they reach production. The key difference is that the oracle for AI regression tests is not 'exact match to expected output' but 'passes evaluation criteria that previously passed.'

Building a good regression suite requires: a diverse set of cases covering the distribution of real user queries (not just the easy ones); coverage of edge cases and failure modes you've encountered in production; evaluation criteria for each case that specify what 'correct' means; and a history of pass/fail results that lets you see when a change caused a regression.

Validated Outcomes

Anthropic's internal Claude evaluation programme — partially documented in their published model cards — uses a combination of automated benchmark suites and human preference evaluation to track model capability and safety regressions across every training run. Anthropic has documented that without systematic regression evaluation, capability improvements in one domain frequently degraded performance in adjacent domains — a phenomenon they call 'capability regression' — that would have been shipped to users without systematic evaluation. The structure of the evaluation framework, rather than the scale of compute, is cited as the primary mechanism for maintaining consistent quality across model updates.

GYSP's AI evaluation infrastructure includes a minimum viable evaluation stack that every client deploys at build time: a 100–200 query golden dataset with LLM-as-judge scoring, deterministic assertions for format and safety constraints, and a weekly automated regression run that alerts when quality metrics drop more than 5% week-over-week. This minimum stack costs less than one week of engineering time to build and has detected quality regressions in every client system that was previously operating without it — regressions that had been invisible to the team and silently degrading user experience.

A/B Evaluation: Comparing Changes

When you change a prompt, model version, retrieval strategy, or any other component of an AI system, you need to know whether the change improved or degraded quality overall. A/B evaluation — running the same query set against two system versions and comparing scores — provides this signal. The evaluation metrics determine what 'better' means, which is why they need to be defined carefully and validated against human judgement before they're used to make deployment decisions.

GYSP's AI & ML Development practice builds evaluation infrastructure as a first-class component of every AI system we develop. The projects where quality is most predictable and manageable in production are universally the ones that had systematic evaluation from the beginning — not those that shipped fast and added evaluation when users started complaining.

“You cannot improve what you cannot measure. AI teams that evaluate by vibes are not doing QA — they're doing post-incident review and calling it quality assurance.”
— Rahul, AI/ML Delivery Head — GYSP.tech

ShareLinkedIn Twitter / X

Ready to act on this?

Is your AI ready for production?

Get a free AI architecture review — we assess your current design, identify failure points, and outline a production-ready path.

92%

Faster information retrieval

70%

Reduction in support queries

99.5%

Extraction accuracy

Request AI Architecture Review

48-hour turnaround · No obligation · Senior engineers only

Get new AI/ML Development insights in your inbox

Practical, no-fluff articles for engineers and technology leaders. New pieces delivered as they're published.

No spam. Unsubscribe any time.

The End of Vibes: How to Unit Test Your AI

Why Standard Software Testing Doesn't Translate

The Evaluation Hierarchy: From Deterministic to Probabilistic

Deterministic Assertions (Most Reliable)

Rule-Based Heuristics (Reliable for Specific Properties)

LLM-as-Judge (Flexible but Expensive)

Human Evaluation (Ground Truth but Slow)

The Regression Suite: Catching Quality Degradation

Validated Outcomes

A/B Evaluation: Comparing Changes

Is your AI ready for production?

Get new AI/ML Development insights in your inbox

More from the Blog

The "It Works On My Machine" AI Crisis: Why 90% of Models Die in Production

Stop Buying Vector Databases: The Case for the Unified Data Layer

Your PDFs Are Ruining Your AI: The Case for Layout-Aware Ingestion