LLM Evaluation Pipeline
An internal QA tool for a company in a highly regulated industry — automatically scores every AI support chatbot conversation on three dimensions, routes uncertain verdicts to a human review queue, and surfaces trends in an analytics dashboard.
AI support chatbots in regulated industries can produce wrong answers, and the only way to know is to read every conversation yourself. There is no systematic way to measure quality degradation, catch a hallucination before it reaches a user, or give a QA team anything actionable. The chatbot ships; monitoring is hope.
Build a two-stage pipeline that separates upload speed from evaluation speed. Upload parses and queues immediately — 202 Accepted in under a second, no AI in the loop. Async cron handles evaluation: embed the question, retrieve the relevant knowledge base articles from the same RAG index the chatbot uses, and have Claude score the answer against actual source material — not general world knowledge. Verdicts feed a human review queue; the review queue feeds aggregate analytics. Nothing is ever dropped silently.
- CSV upload — PapaParse parses the chatbot export, MD5 content hashing deduplicates across overlapping re-exports, bulk insert into a queue table, 202 Accepted before any AI runs
- Async evaluation — cron fetches the oldest unevaluated rows, runs five concurrently via Promise.allSettled; upload latency and AI latency are fully decoupled
- RAG-grounded judge — the evaluator retrieves the same KB articles the chatbot was trained on before scoring, so judgment is grounded in actual documentation rather than world knowledge
- 3H scoring — Honest 40%, Helpful 35%, Harmless 25%; verdict formula lives in application code, not the prompt, so thresholds are auditable and adjustable without a prompt engineering session
- Harmless is an independent gate — a chatbot answer that scores well overall but poorly on harmless fails regardless; no weighted average overrides it
- Human review queue — uncertain verdicts surface for accept or reject with reviewer notes; evaluation errors surface here too rather than disappearing into the pipeline
- Analytics dashboard — Recharts; daily, weekly, and monthly volume; resolution rates by question category; peak hour; week-over-week trends
Upload to verdict in under 30 seconds per pair — the QA team gets a scored, reviewable record of every conversation instead of sampling from a queue and hoping the sample is representative.
Next.js 16 · App Router · Vercel
TypeScript strict mode — enforced types on all evaluation shapes
React 19 · Tailwind CSS v4 · shadcn/ui (card, table, tabs, badge)
Recharts 3.8 — volume trends, hourly distribution, resolution rates
Supabase — PostgreSQL, shared KB index with production chatbot
Claude Sonnet 4.5 — structured output via Vercel AI SDK v6 Output.object()
OpenAI text-embedding-3-small — 1536-dim, same model as KB index
PapaParse 5.5.3 — chatbot export format, malformed row handling
The distinction between “did the chatbot know this” and “is this generally true” is the whole point of RAG-grounded evaluation. An LLM judge without retrieved context applies world knowledge — which is the wrong standard for a chatbot that only covers one company's documentation. Retrieving the KB chunks the answer should be grounded in, and scoring against those, is what makes the evaluation meaningful rather than decorative. The judgment call that separates a useful eval tool from a confidence score nobody can act on is grounding the judge in the same source of truth as the chatbot.