managing millions of data services at heroku – live blogging at qcon

Managing Millions of Data Services @ Heroku
Speaker: Gabe Enslein

AWS S3 failure

  • February 28, 2017 – AWS S3 outage – pager duty failed to give message
  • Down for about 6 hours
  • Heroku recovered before everyone went to bed (10pm Eastern)
  • Avoid failure by having failover strategies
  • Would have taken 35 years to recover if had to do all tasks manually
  • No Heroku customers lost any data


  • Layers of abstraction simplify evelopment
  • Everything rus on hardware at some level down
  • Abstractions can hide real problem
  • Can be harder to reproduce problems
  • Can model many tasks as state machines – both deterministic and non-deterministic moels

“just” implies it is easy. Be skeptical. How easy to repeat? How often is “just”

Automate yourself out of a job – recurring and one off work

If haven’t gotten a heartbeat in a while, don’t know health.

Not all states used by all systems

  • installing
  • available
  • uncertain
  • unavailable
  • retiring
  • retired
  • archived
  • terminated
  • restart
  • upgrading

Check on

  • Backups
  • Replication
  • Security
  • Performance

Manual fixes can cause more problems than started with. Immutuable infrastrucure enforces the “just”. Script the exceptions; don’t manually tinker. “Break Glass” in case of emergency procedures still help. Modeling emergency remedies help so computer can fix when detects instead of waking someone up.

Infrastructure is code, not a second class citizen. Test it for functionality, performance and regression.

Then March 15, 2017, there was a Linux denial of service and admin escalation vulnerability. Needed to see none of the images were affected. Can fix image so customers get when start up.

Key Takeaways

  • Automate yourself out of regular operations
  • Have emergency automation in place – scripts, jobs, etc
  • Make routine failover strategies
  • Treat infrastructure as full units
  • Abstractions have their limits


