AI Agent Postmortem Template: Review a Workflow Failure After Launch helps operators turn a failed AI workflow into a short evidence record: timeline, trigger, missing safeguard, approval gap, blast radius, and corrective action. Use it after an incident when the team needs a practical postmortem template instead of another generic status update.
A useful postmortem is not a long apology. It is the shortest document that can reconstruct the failure, show which safeguard did not hold, and define what proof is required before the same pattern returns to production.
Run a full postmortem when the route changed live work, not only when it caused embarrassment
Use the full template when a workflow wrote to customer-facing systems, created duplicate work, missed a material approval boundary, or forced a manual cleanup that would matter again if repeated. Do not wait for a public outage. A private, expensive, or hard-to-reconstruct miss is enough.
Freeze the evidence bundle before debate starts
| Field | Why it belongs in the record | Weak shortcut to reject |
|---|---|---|
| Trigger event | Shows what started the incident. | Starting the story with the root cause guess. |
| Observed impact | Separates user or business effect from internal frustration. | Jumping straight to engineering pain. |
| Timeline | Keeps discussion anchored to timestamps and state transitions. | Reconstructing order from memory after the logs changed. |
| Control failure | Makes clear which guardrail did not hold. | Calling the incident ‘human error’ without naming the missing control. |
| Corrective action | Turns the review into an exit criterion for re-launch. | Listing ideas without owners or due dates. |
The evidence bundle should include execution IDs, request IDs, approval records, queue or database state, screenshots if needed, and any external API responses that affect replay or reconciliation. If one of those artifacts can disappear after the system recovers, capture it before the review starts.
Copyable postmortem template
1. What failed: one paragraph in plain language. 2. Impact window: start time, end time, affected surface, and what was actually harmed. 3. Timeline: timestamped observable events only. 4. Trigger: the event that began the incident. 5. Control failure: the safeguard that should have prevented or contained it. 6. Why the team believed the route was safe: this exposes hidden assumptions. 7. Corrective actions: owner, due date, and measurable exit criterion. 8. Re-launch gate: the proof required before the same pattern is allowed back into production.
Containment and prevention should be written separately
Containment is what reduces harm now. Prevention is what changes the route before it returns. Teams lose time when they mix those into one vague action list. A strong postmortem writes both and gives each a different owner if needed.
Write the counterfactual that would have stopped the incident earlier
| Counterfactual question | Useful answer | Weak answer to reject |
|---|---|---|
| What single control would have blocked the bad action? | A named guardrail at the exact failed boundary, such as idempotency, approval, or stale-state recheck. | A general statement that the team should be more careful. |
| What single signal would have shortened impact time? | One alert, log correlation, or queue metric that would have made the issue visible sooner. | The team would have noticed if it had been paying closer attention. |
| What single test should now be mandatory before re-launch? | A targeted replay or drill that reproduces the same failure path. | A promise to include it in broader QA someday. |
The counterfactual keeps the postmortem creative without getting theatrical. It forces the review to identify the missing design move that would have changed the result, rather than settling for a cleaner narrative after the fact.
Score each corrective action before you reopen the route
| Action type | Good exit criterion | Weak version to reject |
|---|---|---|
| Detection fix | A new alert, query, or dashboard catches the same failure mode in test or staging. | The team promises to ‘watch it more closely’ next time. |
| Control fix | One guardrail is added or tightened at the exact failed boundary. | The action says people should be more careful. |
| Coordination fix | One owner, one handoff rule, and one incident contact path are named. | Responsibility stays spread across several teams. |
| Re-launch test | A concrete drill proves the miss is harder to repeat. | The route returns after a meeting but before a targeted test. |
This table keeps the review from ending with persuasive language only. A route should come back only when the missing control is stronger, the detection is faster, and the re-launch proof is narrower than the original failure.
Primary sources
These links are the primary documents or official reference pages used to tighten the decision logic in this article.
- Google SRE postmortem culture – Blameless does not mean vague; it means evidence-first and system-focused.
- Google SRE incident management guide – The postmortem should improve detection, mitigation, coordination, and communication.
- Google SRE workbook: postmortem culture – Templates and tracked action items make the review useful after the meeting ends.
- GitLab incident management handbook – Official runbook example for declaring, coordinating, and communicating incidents.
Re-launch gate after the postmortem
- Do not re-launch if the incident timeline still depends on memory instead of captured evidence.
- Do not re-launch if the failed guardrail is unnamed or assigned only to ‘better judgment’.
- Do not re-launch if corrective actions have no owner, date, or exit criterion.
- Do not re-launch if containment exists but prevention is still vague.
Next document, not more filler
- AI Agent Production Checklist: 9 Checks Before a Workflow Goes Live – Use this to convert lessons into a release gate.
- How to Prevent Race Conditions in Multi-Agent Workflows – Use this when the postmortem shows contested ownership.
- Why State-Managed Interruptions Make AI Tools Production-Ready – Use this when the miss involved pause and resume logic.