Skip to content

Evals & Observability

Evals and observability are the production feedback layer. They answer a simple question: how do we know the AI system still behaves correctly after prompts, tools, retrieval, models, or workflows change?

Traditional tests prove deterministic code behavior. AI evals prove the probabilistic behavior of model output, prompts, retrieval, and tool paths still meets an acceptable bar over time.

What this layer owns

ConcernOutput
Tracingfull run timeline across model calls, retrieval, tools, prompts
Quality evalspass/fail or scored behavior on representative cases
Regression detectionwhether new changes made behavior worse
Prompt/version trackingwhich prompt produced which result
Dataset managementgolden questions, expected evidence, edge cases
Cost and latencytoken use, model route, end-to-end duration
Debuggingwhy an answer or action happened
Release gateswhether a change can ship

Traditional tests vs AI evals

Test typeBest forExample
Unit testdeterministic codeparser returns expected JSON
Integration testAPI/tool wiringCRM tool creates draft ticket
Contract testschema compatibilitytool args match OpenAPI schema
Retrieval evalcontext qualityexpected policy doc appears in top 5
Generation evalanswer qualityanswer is grounded and complete
Agent trajectory evalpath qualityagent asks approval before write action
Safety evalrefusal and policy behaviorprompt injection does not trigger unsafe tool

Golden dataset pattern

A golden dataset is a curated set of representative cases that define expected behavior. It should include normal cases, edge cases, negative cases, and high-risk cases.

Dataset fieldExample
Inputuser question or task
Expected evidencedocuments, code files, records, or facts that should be used
Expected behavioranswer, refusal, tool call, approval request
Risk tagsecurity, privacy, legal, operational, product
Evaluation methodexact match, rubric score, LLM judge, human review

Trace-first debugging

mermaid
flowchart LR
    A[User request] --> B[Agent/app run]
    B --> C[Trace]
    B --> D[Model output]
    B --> E[Tool calls]
    C --> F[Observability platform]
    D --> G[Eval scorer]
    E --> G
    G --> H[Regression report]
    H --> I[Prompt/spec/tool changes]

When an AI system fails, do not start by rewriting the prompt. Inspect the trace first:

  1. What instruction was active?
  2. What context was retrieved?
  3. Which model was used?
  4. Which tool was called?
  5. What arguments were passed?
  6. Which guardrail or approval gate fired?
  7. Which eval case should catch this next time?

Metrics that matter

MetricWhy it matters
Answer correctnessproduct value
Grounding/citation ratetrust and auditability
Retrieval precision/recallRAG quality
Tool success/failure rateoperational reliability
Unsafe action attemptssecurity signal
Human approval rateautonomy boundary health
Cost per successful taskbusiness scalability
Latency percentilesuser experience
Regression raterelease quality

Tooling map

Tool/categoryRole
LangSmithLangChain/LangGraph tracing and eval workflows
Langfuseopen-source LLM observability and prompt/eval tracking
Phoenixtracing, evals, and ML/LLM observability
OpenTelemetryvendor-neutral telemetry standard
CI eval gateprevents regressions from merging

Step-by-step adoption guide

  1. Instrument traces before optimizing prompts.
  2. Create 30-50 golden cases for your highest-value workflow.
  3. Add retrieval evals if the app uses RAG.
  4. Add tool trajectory evals if the agent can take actions.
  5. Run evals locally during development and in CI for critical changes.
  6. Track prompt version, model route, retrieval configuration, and tool version.
  7. Set failure thresholds by risk tier.
  8. Review eval failures weekly and convert incidents into new cases.

Failure modes

Failure modeSymptomBetter approach
No tracesfailures are debated from screenshotstrace every run
Only unit testscode passes but AI behavior regressesadd behavioral evals
Evals are too genericscores look good while users complainuse real tasks and edge cases
No dataset ownershipevals decay over timeassign owner and review cadence
LLM judge onlyflaky quality signalcombine deterministic checks, rubric, and human review
No CI gateregressions ship repeatedlyblock high-risk changes on eval threshold

References

Built as a static bilingual AI engineering stack guide.