Managing Millions of Data Services @ Heroku
Speaker: Gabe Enslein
AWS S3 failure
- February 28, 2017 – AWS S3 outage – pager duty failed to give message
- Down for about 6 hours
- Heroku recovered before everyone went to bed (10pm Eastern)
- Avoid failure by having failover strategies
- Would have taken 35 years to recover if had to do all tasks manually
- No Heroku customers lost any data
- Layers of abstraction simplify evelopment
- Everything rus on hardware at some level down
- Abstractions can hide real problem
- Can be harder to reproduce problems
- Can model many tasks as state machines – both deterministic and non-deterministic moels
“just” implies it is easy. Be skeptical. How easy to repeat? How often is “just”
Automate yourself out of a job – recurring and one off work
If haven’t gotten a heartbeat in a while, don’t know health.
Not all states used by all systems
Manual fixes can cause more problems than started with. Immutuable infrastrucure enforces the “just”. Script the exceptions; don’t manually tinker. “Break Glass” in case of emergency procedures still help. Modeling emergency remedies help so computer can fix when detects instead of waking someone up.
Infrastructure is code, not a second class citizen. Test it for functionality, performance and regression.
Then March 15, 2017, there was a Linux denial of service and admin escalation vulnerability. Needed to see none of the images were affected. Can fix image so customers get when start up.
- Automate yourself out of regular operations
- Have emergency automation in place – scripts, jobs, etc
- Make routine failover strategies
- Treat infrastructure as full units
- Abstractions have their limits