
Production Readiness Checklist for AI Startups
A practical checklist covering infra, security, observability, and compliance—everything you need before going live with an AI product.
In February 2025, an AI coding agent executed `terraform destroy` on production infrastructure and wiped out a course platform serving 100,000+ students—destroying a VPC, RDS database, ECS cluster, and 1.94 million rows of student submissions. Recovery took 24 hours and relied on a hidden AWS snapshot. In July 2025, another AI agent deleted a live production database during an active code freeze, despite explicit instructions not to make changes. These aren't edge cases—they're the cost of shipping AI to production without a checklist.
Why a Checklist Matters for AI Startups
AI startups face unique risks: autonomous agents with elevated privileges, LLM APIs that can be abused for cost exploitation, and infrastructure that scales faster than your controls. A production readiness checklist forces you to answer the hard questions before users arrive—and before incidents become headlines.
Infrastructure Checklist
- [ ] Dev/prod separation: Development and production databases are isolated. No shared credentials. AI agents cannot access production without explicit approval gates.
- [ ] Infrastructure as Code: Terraform or Pulumi. No manual console changes. State locked. `terraform destroy` requires multi-person approval or is disabled for production.
- [ ] Backups: Automated backups with tested restore. RDS snapshots, S3 versioning, or equivalent. Know your RPO and RTO.
- [ ] Rollback path: Can you revert a bad deploy in under 15 minutes? Blue-green or canary deployments with one-click rollback.
Security Checklist
- [ ] IAM least privilege: No broad `*` permissions. Roles scoped to specific services and actions. AI agents run with minimal permissions.
- [ ] Secrets management: No credentials in code or .env files. AWS Secrets Manager or Parameter Store. Rotation enabled.
- [ ] Encryption: Data at rest (KMS) and in transit (TLS). S3 buckets block public access by default.
- [ ] API protection: Rate limiting, authentication, input validation. LLM endpoints are not open to the internet without controls.
Observability Checklist
- [ ] Structured logging: JSON logs with request IDs. Centralized (CloudWatch, Datadog, or similar).
- [ ] Metrics and dashboards: Latency, error rate, cost per request. Key business metrics visible.
- [ ] Alerting: PagerDuty, Opsgenie, or equivalent. On-call rotation. Alerts fire for anomalies and thresholds.
- [ ] Tracing: Distributed tracing for API calls. Know where time is spent when things slow down.
AI-Specific Checklist
- [ ] Cost controls: Budget alerts. Per-user or per-API rate limits. Expensive models (Claude Opus, GPT-4) behind approval or disabled by default.
- [ ] Prompt injection defenses: Input sanitization, output validation. Don't trust LLM output for privileged actions.
- [ ] Agent guardrails: AI agents cannot execute destructive commands without human approval. Separate dev and prod environments for agent testing.
Compliance Readiness
- [ ] Audit logging: CloudTrail enabled. Log retention meets compliance requirements.
- [ ] Data handling: Know where customer data lives. Document retention and deletion policies.
- [ ] Incident response: Runbook exists. Escalation path defined. Post-incident review process in place.
The Pre-Launch Gate
Before going live: Run through this checklist. Fix every "no." A production readiness audit can identify gaps you've missed and prioritize what matters most. The goal isn't perfection—it's reducing the chance that your AI product becomes the next cautionary tale.
Need help with production readiness? Get a free 30-minute audit.
Book Free 30-Min Production Audit