Flagship Case Study

Agentic AI Production Harness

A runnable full-stack AI reference project for the layer most agent demos are missing: product UX, workflow control, retrieval grounding, typed tool execution, human approval, evaluation, traces, and operational readiness.

Problem

Agent demos often work in controlled paths but fail when real users introduce ambiguity, stale context, risky actions, partial inputs, tool failures, and production constraints. This harness shows how to make those risks visible and manageable.

Build status

The project now includes a dependency-free runnable Node demo, a golden workflow eval suite, a tool contract schema, trace artifacts, and a visual trace mock. Connecting the mock to generated traces is the next upgrade.

Run it locally

cd examples/agentic-ai-production-harness
npm run demo
npm run eval

Trace viewer mock

What the harness makes visible

trace_demo_001
1

Intent router

tool_request_with_human_approval

Detected a state-changing support action that requires review.

confidence 0.82
2

Retriever

2 policy documents found

Refund Policy and Account Cancellation Policy were used as grounding evidence.

precision 0.75
3

Tool contract validation

valid high-risk action

JSON schema validated the action type, customer id, reason, evidence ids, and approval requirement.

valid true
4

Approval gate

waiting for reviewer

The workflow pauses before account cancellation or refund execution.

approval required
5

Eval signals

pass with review

Groundedness, tool success, approval routing, latency, and cost are captured in the trace.

$0.012 / 1840ms

Full-stack AI system map

The layers I design together

1

AI Product UX

Copilot screens, review queues, dashboards, feedback, and trust signals.

2

Frontend State + Events

Streaming responses, loading states, user corrections, and workflow progress.

3

Backend Workflow Layer

APIs, auth, queues, state machines, orchestration, and human approval paths.

4

RAG + Data Layer

Documents, metadata, embeddings, retrieval, citations, freshness, and permissions.

5

Model + Agent Layer

LLM calls, tool contracts, routing, fallback logic, and controlled agent actions.

6

Eval + Observability

Traces, prompt versions, golden workflows, cost, latency, quality checks, and release gates.

Inspectable artifacts

These repository artifacts make the case study inspectable without pretending the full UI is finished.

Architecture flow

User request
  -> Intent router
  -> Planner node
  -> Retriever + policy checker
  -> Tool execution gateway
  -> Human approval checkpoint when needed
  -> Response composer
  -> Evaluation runner
  -> Trace and metrics dashboard

Request intake and intent router

Classifies whether the user needs retrieval, tool execution, human review, or a normal response path.

Planner node

Breaks the task into controlled steps and records assumptions before any tool is called.

Retriever and policy checker

Fetches grounded context, checks document freshness, validates access rules, and prepares citations.

Tool execution gateway

Runs only approved tools with typed inputs, output validation, audit logs, and fallback handling.

Human approval checkpoint

Routes risky or irreversible actions to a reviewer instead of allowing fully autonomous execution.

Evaluation and trace layer

Scores groundedness, task completion, latency, cost, tool-call success, and failure categories.

Mock product states

  • User request intake with task type and risk label
  • Retrieved evidence panel with source, freshness, and citation status
  • Tool approval queue for risky or irreversible actions
  • Trace timeline showing prompt version, retrieved context, tool calls, latency, and cost

Failure modes covered

  • Wrong intent route
  • Stale retrieved document
  • Risky tool request
  • Malformed tool output
  • Prompt regression
  • Latency or cost spike

Evaluation table

ScenarioExpected behaviorSignal
Policy question with fresh sourceAnswer with citation and freshness noteGroundedness + citation coverage
Request needs external tool actionRoute through tool contract and approval ruleTool-call success + approval rate
Stale retrieved documentWarn or fallback instead of confident answerFreshness handling
Ambiguous user requestAsk clarification before planning actionsWorkflow completion quality

Implementation roadmap

  1. Connect the visual trace viewer to generated demo output.
  2. Add human approval UI mock for risky tool requests.
  3. Add a small retrieval corpus with document freshness and permission examples.
  4. Add demo GIF and screenshots for the portfolio case study page.
  5. Publish companion build notes explaining the implementation decisions.