What you'll take away
The prompt engineering session went well. The team tried thirty variations, found one that seemed to work reliably for the five test cases they'd been using, and shipped it to production. Two weeks later, users started reporting that the system was giving wrong answers to a specific class of query the team hadn't included in their test set. The prompt that 'worked' in testing had a subtle failure mode that only showed up at scale.
This is what vibes-based AI evaluation looks like at production scale: you test on cases you thought of, ship on cases you didn't, and discover the failures through user complaints. The alternative — systematic AI evaluation with defined metrics, regression suites, and automated quality gates — requires significantly more engineering investment upfront, but it's the only approach that produces AI systems with predictable, measurable behaviour.
Why Standard Software Testing Doesn't Translate
Unit tests for deterministic software verify specific inputs produce specific outputs. An AI system's outputs are not deterministic: the same input can produce different responses across invocations, across model versions, and across temperature settings. You can't write a unit test that asserts 'this AI response exactly equals this expected string' — that test would be brittle, maintenance-intensive, and would fail to catch responses that are semantically equivalent to the expected output.
AI evaluation requires a different approach: tests that assess properties of the output rather than exact equality. Is the output factually accurate? Does it contain required elements? Is it appropriately formatted? Does it avoid prohibited content? Does it address the question asked? These are assertions about output quality that can be automated, even though they can't be expressed as simple equality checks.
The Evaluation Hierarchy: From Deterministic to Probabilistic
Deterministic Assertions (Most Reliable)
Some properties of AI output can be tested deterministically: does the output contain a required JSON structure? Does it include a required disclaimer when one is mandated? Does it avoid a list of prohibited words or phrases? Does it stay within a specified length limit? Is the cited source URL valid and accessible? These assertions are fast, cheap, and should be the first layer of your evaluation suite.
Rule-Based Heuristics (Reliable for Specific Properties)
Some quality properties can be evaluated with rule-based heuristics: does the response include the key entities from the retrieved context (a proxy for faithfulness)? Does it answer in the language the query was asked in? Does it use the expected terminology for a given domain? These heuristics don't guarantee quality but catch a predictable class of failures cheaply.
LLM-as-Judge (Flexible but Expensive)
Using a separate LLM to evaluate the output of your production LLM is the most flexible evaluation approach: you describe the quality criteria in natural language and let the evaluator model assess whether the output meets them. LLM-as-judge evaluation can assess properties that are hard to specify as rules — whether a response is helpful, whether it addresses all aspects of the question, whether it's appropriately cautious about uncertain information. The cost: LLM evaluation is slow and expensive compared to deterministic checks, and the evaluator model can itself be inconsistent.
Human Evaluation (Ground Truth but Slow)
Is your AI ready for production?
48-hour turnaround. No obligation.
Human evaluation — having domain experts rate AI outputs against quality criteria — provides the highest-quality signal but is slow and expensive to scale. It's most valuable for building the golden dataset that other evaluation methods are calibrated against, and for periodic quality audits of production output to verify that automated evaluation metrics correlate with actual user satisfaction.
The practical evaluation stack: deterministic assertions as automated pre-deployment gates (must pass before any deployment), LLM-as-judge for weekly quality snapshots against a representative query sample, and human evaluation quarterly or when automated metrics diverge unexpectedly from user satisfaction signals.
The Regression Suite: Catching Quality Degradation
A regression suite for AI systems is analogous to a regression test suite for traditional software: a set of representative cases that should be handled correctly, run against every code or model change to detect regressions before they reach production. The key difference is that the oracle for AI regression tests is not 'exact match to expected output' but 'passes evaluation criteria that previously passed.'
Building a good regression suite requires: a diverse set of cases covering the distribution of real user queries (not just the easy ones); coverage of edge cases and failure modes you've encountered in production; evaluation criteria for each case that specify what 'correct' means; and a history of pass/fail results that lets you see when a change caused a regression.
A/B Evaluation: Comparing Changes
When you change a prompt, model version, retrieval strategy, or any other component of an AI system, you need to know whether the change improved or degraded quality overall. A/B evaluation — running the same query set against two system versions and comparing scores — provides this signal. The evaluation metrics determine what 'better' means, which is why they need to be defined carefully and validated against human judgement before they're used to make deployment decisions.
GYSP's AI & ML Development practice builds evaluation infrastructure as a first-class component of every AI system we develop. The projects where quality is most predictable and manageable in production are universally the ones that had systematic evaluation from the beginning — not those that shipped fast and added evaluation when users started complaining.
“You cannot improve what you cannot measure. AI teams that evaluate by vibes are not doing QA — they're doing post-incident review and calling it quality assurance.”
— Rahul, AI/ML Delivery Head — GYSP.tech
