LLM Evaluation Pipeline

An internal QA tool for a company in a highly regulated industry — automatically scores every AI support chatbot conversation on three dimensions, routes uncertain verdicts to a human review queue, and surfaces trends in an analytics dashboard.

AI support chatbots in regulated industries can produce wrong answers, and the only way to know is to read every conversation yourself. There is no systematic way to measure quality degradation, catch a hallucination before it reaches a user, or give a QA team anything actionable. The chatbot ships; monitoring is hope.

Build a two-stage pipeline that separates upload speed from evaluation speed. Upload parses and queues immediately — 202 Accepted in under a second, no AI in the loop. Async cron handles evaluation: embed the question, retrieve the relevant knowledge base articles from the same RAG index the chatbot uses, and have Claude score the answer against actual source material — not general world knowledge. Verdicts feed a human review queue; the review queue feeds aggregate analytics. Nothing is ever dropped silently.

CSV upload — PapaParse parses the chatbot export, MD5 content hashing deduplicates across overlapping re-exports, bulk insert into a queue table, 202 Accepted before any AI runs
Async evaluation — cron fetches the oldest unevaluated rows, runs five concurrently via Promise.allSettled; upload latency and AI latency are fully decoupled
RAG-grounded judge — the evaluator retrieves the same KB articles the chatbot was trained on before scoring, so judgment is grounded in actual documentation rather than world knowledge
3H scoring — Honest 40%, Helpful 35%, Harmless 25%; verdict formula lives in application code, not the prompt, so thresholds are auditable and adjustable without a prompt engineering session
Harmless is an independent gate — a chatbot answer that scores well overall but poorly on harmless fails regardless; no weighted average overrides it
Human review queue — uncertain verdicts surface for accept or reject with reviewer notes; evaluation errors surface here too rather than disappearing into the pipeline
Analytics dashboard — Recharts; daily, weekly, and monthly volume; resolution rates by question category; peak hour; week-over-week trends

Upload to verdict in under 30 seconds per pair — the QA team gets a scored, reviewable record of every conversation instead of sampling from a queue and hoping the sample is representative.

Framework

Next.js 16 · App Router · Vercel

Language

TypeScript strict mode — enforced types on all evaluation shapes

React 19 · Tailwind CSS v4 · shadcn/ui (card, table, tabs, badge)

Charts

Recharts 3.8 — volume trends, hourly distribution, resolution rates

Database

Supabase — PostgreSQL, shared KB index with production chatbot

AI — judge

Claude Sonnet 4.5 — structured output via Vercel AI SDK v6 Output.object()

AI — embeddings

OpenAI text-embedding-3-small — 1536-dim, same model as KB index

CSV parsing

PapaParse 5.5.3 — chatbot export format, malformed row handling

The distinction between “did the chatbot know this” and “is this generally true” is the whole point of RAG-grounded evaluation. An LLM judge without retrieved context applies world knowledge — which is the wrong standard for a chatbot that only covers one company's documentation. Retrieving the KB chunks the answer should be grounded in, and scoring against those, is what makes the evaluation meaningful rather than decorative. The judgment call that separates a useful eval tool from a confidence score nobody can act on is grounding the judge in the same source of truth as the chatbot.

§ Back / Filed

speakeasyprep.ai →

A working AI interview prep tool shipped to real users on a real stack — after the no-code prototype proved there was a there there.

← All entries