2:14am — alert fires. Then you really start the clock.
ECS task error rate jumps to 47%. Permission denied errors cascading through logs. You have 3 hours to find the cause or customers wake up to downtime.
The Situation
PagerDuty wakes you at 2:14am. Your ECS cluster is failing. Error rate 47%. You pull up CloudWatch. Hundreds of "permission denied" errors in the task logs. But which permission? Which service? Which IAM change broke this?
You dig into CloudTrail. Hundreds of API calls. You check GitHub. Three commits in the last 4 hours. One manually updated IAM role. You manually grep through logs, trying to correlate the timeline. By 5:30am you piece together: someone accidentally removed a critical IAM permission at 2:09am. But you're exhausted, facts are fuzzy, and the postmortem will be guessing.
3 hours to find the root cause. Every minute costs money and customer trust.
The Company
Series B fintech. Every minute of downtime is expensive. Multi-region deployment. High-velocity team. Multiple services sharing IAM roles.
You don't have time for 3 hours of manual investigation. You need the answer now.
With Escher
You ask: "I had an outage at 2:14am. ECS tasks failing with permission denied. What changed?"
- 0-1 min: Escher ingests CloudTrail logs, ECS metrics, CloudWatch errors, IAM role history, and Git commit history.
- 1-2 min: Correlates the 2:14am error spike to the 2:09am IAM permission removal. Identifies which role was changed and which permission was deleted.
- 2-3 min: Generates root cause: commit abc123 removed PassRole permission. ECS can't assume the task execution role. Surfaces the exact responsible change.
product screenshot · replace with actual
The Outcome
Before
3+ hours
of searching
With Escher
3 minutes
exact cause
Real impact: MTTR drops 60%. Postmortem has facts instead of guesses. Revert plan ready in seconds. Customer impact minimized. Your team gets back to bed.