This is part of my live blogging from QCon 2015. See my QCon table of contents for other posts.
Nori at Google for 10 years and healthcare.gov for a month when having tech surge.
Not a problem when one machine goes down. Problem when lots of machines go down. Architecting for failure assumes one machine will go down. Ships have locking doors so one compartment floods but ship doesn’t sink.
Story 1: A city vanishes
Until recently, Atlanta was biggest metro for Google. Updated script to monitor new routers and lost connectivity for Atlanta. Got 200 pages in a short time. [do you turn off the pager at that point?] It wasn’t just one cluster or one data center. Automation would kick in and route traffic to working areas.
Root cause: new vendor indexes CPUs starting with 1 rather than 0. And new vendor’s router crashes if ask for CPU that doesn’t exist.
Good news: As soon as SMTP querying stopped, Atlanta came back up
- Model ahead of time. Plan for spikes and unexpected events so have spare capacity. Do load testing to know limits and capacity planning.
- Regression analysis. How software and hardware changes over time.
- Network capacity is harder to do capacity planning than web server capacity
- Visualize in real time. Must be able to make decisions on the fly. What effect would X have? Humans can then decide if*should*.
Story #2: The edge snaps
Satellite clusters – at edge of network. Much smaller than data centers. Goal is to have a rack of servers nearer to users so faster response. Decomission frequently so have tool to disk erase. The tool was not idempotent to decomission. When passed empty set, interpretted as decomissioning all machines. Expect these machines to fail so detect and fall back to core clusters. Just not expecting them all to fail at once. Days to reinstall everything. 500 pages that afternoon.
Root cause: Defect – empty set != all
- Mitigate first.
- Fall back to established protocol – “How to panic effectively”. ICS (incident command system) – firefighters have this too. Allows cross-agency collaboration.Centralize response.
- When not in firefighting mode, easy to think don’t need
- Created lighter weight versoin of ICS. Changed terminology to make sound less hierarchical.
- No penalty for invoking protocol. Err on the side of caution because don’t know at first if early warning sign
- Train people before prod support. Give them scenarios
Story #3 – The all is coming from inside the house
At Google, “hope is not a strategy”. Unfortunately it was at healthcare.gov. And they had lots of anti-patterns. The tech surge started in October. They added monitoring at that point. Nuri joined in December.
The login system went down. Due to spaghetti code, this caused a full site outage. Monitoring said it was a database problem. Oracle DBA said a query running as a superuser that is clobbering all other traffic. It was coming from a person. The contractor’s manager wanted a report. The contractor logged in as superuser since it was the only id he had. He heard about the problem since he was in the war room. Toxic culture made him afraid to bring it up since cause is losing his job.
Root cause: Toxic culture
- Tech surge team did fix technical things, but it was mostly a cultural problem. Tech surge team brought a sense of urgency
- In five nines (three minutes of downtime a year),the fifth 9 is people. Health care needed to get to the first 9 not the fifth.
- Need a culture of response. Must care it is down. Monitoring. Reacting when down. Connect engineer’s work to bottom line.
- Need a culture of responsibility.Must do post mortem or root cause analysis.Need to know how wi prevent the following time. Must be blameless
- If people afraid to speak up, you extend outages
- Operational experiences informs system design. Want to avoid getting paged about same thing repeatedly
- Practice disaster- DIRT – disaster in reovery training. Know what week it will happened, but not what/when. ex: California has been eaten by zombies. Not user facing,but flush out weakness in system. If actually causes user facing outage, mark test as failed and bring up immediately.
- Wheels of Misfortune – role playing within team. Team member prepares scenario. Something that happened recently or might happen in future. Then talk through. Can ask any question that a compter can answer.
Dickerson’s Hierarchy of reliablity (like Maslow’s hiarchy of needs)- monitoring, incident response, post-mortems, testing/release, capacity planning, development, UX
Even if don’t have less outages, want them to be unnoticed.
- How test disasters without creating customer facing outage? Could use shadow system. Or break down to simulate components of outage on different days.
- How communicate in crisis? Internal tool to create incident. Create IRC channel dedicated to emergency. Create doc with status/brief summary. All are discretionary. Depends on outage. For example, don’t use IRC when problem is corporate network is down.
- How know what important in hundreds of pages? They were laptop probes; not humans. Can see from page that alert stream has patterns. Useful info but don’t need to read them all. Recognize cost of each page.