Skip to content

// projects

LLM Eval & Observability Harness

Treat prompts like code — test sets, traces, metrics, and a regression gate in CI.

A harness that wraps any LLM feature with a versioned test set, captures a full trace of every call, computes quality and cost/latency metrics, and fails CI when a change regresses them. It turns 'the output feels worse' into a diff you can see and block. This is the project that signals you can operate LLMs, not just call them — exactly what LLMOps and platform teams hire for.

AdvancedPythonpytestOpenTelemetryDuckDBOpenAI
01The problem

LLM features have no compiler and no natural test suite: a prompt tweak that helps one case silently breaks ten others, and 'it seems better' is the only feedback most teams have. Without traces you cannot see why a call was slow or wrong, and without a baseline you cannot tell improvement from regression. The problem is bringing software-engineering discipline — test sets, observability, and regression gates — to a non-deterministic system.

02Architecture
  1. 01

    Test-set store

    Curated input and expected-behavior cases live in the repo as versioned data, grouped by feature and edge case, so the suite grows with every bug you find.

  2. 02

    Runner

    A pytest-driven runner executes each case against the current prompt and model, capturing output, latency, and token usage per call.

  3. 03

    Tracing

    Every model and tool call is instrumented with OpenTelemetry spans — prompt, parameters, response, timing — so a single request's full path is inspectable.

  4. 04

    Scorers

    Pluggable scorers grade outputs: exact-match and rule checks for deterministic cases, and model-graded rubrics (LLM-as-judge) for open-ended ones.

  5. 05

    Metrics store

    Results land in DuckDB so you can query pass rate, p95 latency, and cost per run, and compare any two runs.

  6. 06

    Baseline & regression gate

    Each run is diffed against a saved baseline, and a CI job fails if pass rate, faithfulness, latency, or cost crosses a configured threshold.

  7. 07

    Report

    A run produces a human-readable summary — new failures, score deltas, and cost/latency movement — attached to the pull request.

03Key trade-offs

Model-graded evals (LLM-as-judge) for open-ended outputs

Rubric grading scales to subjective quality where exact-match cannot, but the judge is itself non-deterministic — so you calibrate it against human labels and keep deterministic checks wherever possible.

DuckDB over a hosted metrics platform

An embedded analytical database keeps the whole harness local, fast, and free to run in CI, at the cost of the dashboards a SaaS tool gives you — a deliberate own-the-core, add-vendors-later choice.

OpenTelemetry instead of ad-hoc logging

Standard spans mean the same traces flow to any OTel-compatible backend later; the upfront wiring costs more than print statements but avoids a rewrite when you add a real backend.

Blocking the build on regressions

A hard CI gate is noisy at first and needs threshold tuning, but it is the only thing that actually stops quality from silently eroding release over release.

04How you know it works
  • Golden-run tests: seeded inputs with known scores confirm the scorers and the regression gate fire correctly.
  • Judge calibration: the model-graded scorer is checked against a human-labeled sample and its agreement rate is tracked over time.
  • The gate itself is validated by intentionally regressing a prompt and asserting CI fails with a clear diff.
  • Trace completeness is asserted — every call in a run must emit a span with prompt, tokens, and latency.
05Deployment
  • Runs as a CI job on every pull request that touches prompts, models, or retrieval, and is runnable locally with a single command.
  • Provider keys come from CI secrets; test-set data and baselines are versioned in the repo for reproducibility.
  • Traces and metrics can ship to any OpenTelemetry backend or object storage for longer-term trend analysis.
  • A scheduled run against production-sampled inputs catches drift between releases, not just at merge time.
06Interview talking points
  1. 01Why you score retrieval, quality, latency, and cost as separate axes instead of one blended number.
  2. 02How LLM-as-judge works, why you calibrate it, and where you refuse to trust it.
  3. 03How the regression gate turns a subjective 'feels worse' into a blocking, reviewable diff.
  4. 04How the same traces you use to debug locally become your production observability.

Video walkthrough

Watch it built, end to end

A full video walkthrough — architecture, trade-offs, evals, and deployment — ships with the AI Engineer Interview & Portfolio Kit at launch (August 2026). There is no fake demo here: join the waitlist and you will get it the day it lands.

Production AI Notes

One practical AI engineering email each week

One concept, one architecture, one project idea, and one interview question — written for developers who want to build and ship real AI systems.

No spam. Unsubscribe anytime.