managing millions of data services at heroku – live blogging at qcon

Managing Millions of Data Services @ Heroku
Speaker: Gabe Enslein

See the list of all blog posts from the conference

AWS S3 failure

  • February 28, 2017 – AWS S3 outage – pager duty failed to give message
  • Down for about 6 hours
  • Heroku recovered before everyone went to bed (10pm Eastern)
  • Avoid failure by having failover strategies
  • Would have taken 35 years to recover if had to do all tasks manually
  • No Heroku customers lost any data


  • Layers of abstraction simplify evelopment
  • Everything rus on hardware at some level down
  • Abstractions can hide real problem
  • Can be harder to reproduce problems
  • Can model many tasks as state machines – both deterministic and non-deterministic moels

“just” implies it is easy. Be skeptical. How easy to repeat? How often is “just”

Automate yourself out of a job – recurring and one off work

If haven’t gotten a heartbeat in a while, don’t know health.

Not all states used by all systems

  • installing
  • available
  • uncertain
  • unavailable
  • retiring
  • retired
  • archived
  • terminated
  • restart
  • upgrading

Check on

  • Backups
  • Replication
  • Security
  • Performance

Manual fixes can cause more problems than started with. Immutuable infrastrucure enforces the “just”. Script the exceptions; don’t manually tinker. “Break Glass” in case of emergency procedures still help. Modeling emergency remedies help so computer can fix when detects instead of waking someone up.

Infrastructure is code, not a second class citizen. Test it for functionality, performance and regression.

Then March 15, 2017, there was a Linux denial of service and admin escalation vulnerability. Needed to see none of the images were affected. Can fix image so customers get when start up.

Key Takeaways

  • Automate yourself out of regular operations
  • Have emergency automation in place – scripts, jobs, etc
  • Make routine failover strategies
  • Treat infrastructure as full units
  • Abstractions have their limits


WTF from the Past

The missing image

I was remembering a strange problem from a few years ago and thought I would share the issue.

We had a simple web page around the year 2002 which contained either a plus or a minus image, not unlike these: plus minus

The problem we had was that regardless of the code we used, we could only ever see the minus image. Checking the page source and viewing the image on its own told us that we should be seeing the plus image where appropriate, and yet on the page itself we only ever saw the minus image.

How does your browser scale images?

It turns out that actual images were 11 pixels wide and 11 pixels high, but the image tag was telling the page to display as 10 pixels wide. I’ll show you what that looks like in today’s browsers, but your mileage may vary: plus minus

Hopefully you saw something like this: plus minus

If you’re unlucky you’ll see something like this: plus minus

Depending on the scaling algorithm used by your browser, scaling from width 11 to width 10 may just remove the middle line, making a plus look like a minus. Problem solved. I didn’t pick it and I’m sad to say that it took too long to pick up the issue.