Incident Response Template

When an AI agent wipes your production database, when your API keys leak and someone runs up $30,000 in three hours, or when your S3 bucket exposes customer data—what do you do? Startups that have a runbook recover faster. Those that don't have one scramble, miss steps, and often make things worse. This template gives you a structure for when things break.

Severity Levels

SEV-1 (Critical): Complete outage, data loss, security breach, or significant revenue impact. All hands. Fix immediately.

SEV-2 (High): Major degradation, partial outage, or elevated risk. Urgent. Fix within hours.

SEV-3 (Medium): Degraded performance, limited impact. Fix within 24 hours.

SEV-4 (Low): Minor issue, workaround available. Fix when possible.

Phase 1: Detect and Triage

When an alert fires or a user reports an issue:

1. Acknowledge: Confirm the incident. Don't ignore. Assign an incident lead.

2. Assess: What's affected? How many users? Is it getting worse?

3. Severity: Classify as SEV-1 through SEV-4. Escalate accordingly.

4. War room: For SEV-1/2, create a dedicated channel (Slack, etc.). Invite responders. Mute non-essential chatter.

Phase 2: Mitigate

Goal: Stop the bleeding. Restore service or contain the damage.

1. Contain: If it's a breach, revoke credentials, block IPs, disable affected services. Prevent further damage.

2. Communicate internally: Update the war room. "We're seeing X. We're doing Y. ETA for fix: Z."

3. Rollback or fix: Revert the bad deploy. Restore from backup. Apply a hotfix. Document every change.

4. Verify: Confirm the fix. Monitor. Don't declare victory until metrics are green.

Phase 3: Communicate Externally

For SEV-1/2, users and customers need to know.

Status page update template:

We're currently investigating [brief description of issue]. Our team is actively working on a fix. We'll provide an update within [timeframe]. We apologize for any inconvenience.

Customer email template (if applicable):

Subject: [Service Name] – Incident Update
We're writing to inform you of an incident affecting [description]. We've identified the cause and [are working on a fix / have implemented a fix]. Expected resolution: [timeframe].
We'll send a follow-up with a full post-incident report within 48 hours. If you have questions, reply to this email.
Thank you for your patience.

When to notify: SEV-1: Immediately. SEV-2: Within 1 hour. SEV-3: If customers are affected, within 4 hours.

Phase 4: Resolve and Hand Off

When the incident is resolved:

1. All-clear: Announce in war room. Update status page. Send resolution notice to customers.

2. Hand off: If follow-up work is needed (monitoring, cleanup), assign owners. Don't leave loose ends.

3. Schedule post-mortem: For SEV-1/2, schedule within 48 hours. Details fade fast.

Phase 5: Post-Incident Review (Blameless)

Purpose: Learn. Improve. Don't blame. Focus on systems and processes.

Post-mortem template:

Summary

2–3 sentences. What happened? What was the impact?

Timeline

- HH:MM — Alert fired / First report

- HH:MM — Team assembled, severity assessed

- HH:MM — Root cause identified

- HH:MM — Mitigation applied

- HH:MM — Service restored

Root Cause

Use the 5 Whys. Go deeper than "the deploy broke it." Why did the deploy break it? Why wasn't it caught? Why did that gap exist?

What Went Well

Specific wins. Fast detection? Clear communication? Effective rollback? Acknowledge them.

What Didn't Go Well

Process failures. Missing runbooks. Delayed escalation. Unclear ownership. Be specific.

Action Items

- Add alert for X — Owner: [Name], Due: [Date]

- Update runbook for Y — Owner: [Name], Due: [Date]

- Implement Z to prevent recurrence — Owner: [Name], Due: [Date]

Rule: Every action item has an owner and a due date. Track them. Close the loop.

Contributors

Who responded? Who helped? Credit the team.

When to Write a Post-Mortem

Always: SEV-1, SEV-2, any data loss, security incidents, incidents lasting >1 hour.

Consider: SEV-3 with systemic issues, near-misses, incidents that generated complaints.

Skip: SEV-4, incidents resolved in <5 minutes with no impact.

Escalation Path

Define this before an incident:

- Primary on-call: [Name/Role] – First responder.

- Secondary: [Name/Role] – Backup if primary unavailable.

- Escalation: [Name/Role] – For SEV-1, or if primary needs support.

- Executive: [Name/Role] – For customer-facing SEV-1, or if external comms needed.

Tools to Have Ready

- Status page (e.g., Statuspage, Better Uptime)

- Incident channel (Slack, etc.)

- Runbooks (Notion, Confluence, or docs in repo)

- Access to logs, metrics, and deploy history

- Contact list for vendors (AWS, etc.) if needed

---

When things break, structure beats chaos. Customize this template for your team. Run a practice incident. Update it after every real incident. The goal isn't to never have incidents—it's to recover fast and learn every time.

Need help with production readiness? Get a free 30-minute audit.

Book Free 30-Min Production Audit

View our DevSecOps services