Production Readiness Checklist

Use this checklist before an AI solution goes live.

Ownership

<input type="checkbox" disabled> Product owner named.
<input type="checkbox" disabled> Engineering owner named.
<input type="checkbox" disabled> Runtime/on-call owner named.
<input type="checkbox" disabled> Data owner named.
<input type="checkbox" disabled> Security/governance reviewer named.

Architecture Boundaries

<input type="checkbox" disabled> User workflow is documented.
<input type="checkbox" disabled> Agent/workflow/retrieval boundaries are clear.
<input type="checkbox" disabled> Runtime boundary is clear.
<input type="checkbox" disabled> Data plane boundary is clear.
<input type="checkbox" disabled> Tool side effects are isolated and governed.

Reliability

<input type="checkbox" disabled> p95/p99 latency targets are defined.
<input type="checkbox" disabled> Capacity model includes prompt and completion token distribution.
<input type="checkbox" disabled> Fallback behavior is defined.
<input type="checkbox" disabled> Retry and timeout policy is tested.
<input type="checkbox" disabled> Rollback path exists.
<input type="checkbox" disabled> Incident runbook exists.

Data And Retrieval

<input type="checkbox" disabled> Data sources are approved.
<input type="checkbox" disabled> Chunking and embedding versions are tracked.
<input type="checkbox" disabled> Access control is enforced during retrieval.
<input type="checkbox" disabled> Deletion and update workflows are tested.
<input type="checkbox" disabled> Retrieval eval passes.

Evaluation

<input type="checkbox" disabled> Evaluation dataset exists.
<input type="checkbox" disabled> Baseline and candidate results are compared.
<input type="checkbox" disabled> Safety and policy evaluations are included.
<input type="checkbox" disabled> Human review covers high-risk cases.
<input type="checkbox" disabled> Promotion gate owner signs off.

Security And Governance

<input type="checkbox" disabled> Secrets are not exposed in prompts, traces, logs, or tools.
<input type="checkbox" disabled> Tool execution uses least privilege.
<input type="checkbox" disabled> Audit logs capture high-impact actions.
<input type="checkbox" disabled> PII redaction/retention policy is defined.
<input type="checkbox" disabled> Model artifact provenance is tracked.
<input type="checkbox" disabled> External providers meet data policy requirements.

Observability

<input type="checkbox" disabled> Traces include model calls, retrieval, tools, scores, and errors.
<input type="checkbox" disabled> Dashboards cover latency, cost, errors, retrieval quality, and safety events.
<input type="checkbox" disabled> Alerts are configured.
<input type="checkbox" disabled> Post-incident review process exists.

Final Gate

flowchart TB Ready[Readiness review] --> Reliability{Reliability pass?} Reliability -->|No| Hold[Hold release] Reliability -->|Yes| Security{Security pass?} Security -->|No| Hold Security -->|Yes| Eval{Evaluation pass?} Eval -->|No| Hold Eval -->|Yes| Ops{Operational owner ready?} Ops -->|No| Hold Ops -->|Yes| Ship[Release]

Review Method

Run this checklist as a cross-functional review with product, engineering, data, security, and operations represented. The purpose is to decide whether the solution can be operated, not whether the prototype is impressive. Each unchecked item should have an owner, a due date, and a release impact. Some items can be accepted as known risks, but only when the decision is explicit and the mitigation is understood.

The strongest readiness reviews use evidence from the other templates. The ADR explains why the architecture exists. The runtime matrix proves that serving constraints are understood. The RAG data contract proves that retrieval quality and data governance have owners. The LLMOps scorecard proves that prompt, model, retrieval, and tool changes have promotion gates. The security review proves that access, audit, and sensitive data controls are not afterthoughts.

Common Release Blockers

Common blockers include missing rollback paths, untested provider outages, no owner for evaluation datasets, unclear trace retention policy, secrets visible in prompts or logs, retrieval that bypasses tenant access control, tool calls without approval gates, and dashboards that only show infrastructure metrics while ignoring answer quality and cost. Treat these as architecture issues, not cleanup tasks, because they directly affect production trust.