too big to fail – nori heikinen – qcon

This is part of my live blogging from QCon 2015. See my QCon table of contents for other posts.

Nori at Google for 10 years and healthcare.gov for a month when having tech surge.

Not a problem when one machine goes down. Problem when lots of machines go down. Architecting for failure assumes one machine will go down. Ships have locking doors so one compartment floods but ship doesn’t sink.

Story 1: A city vanishes

Until recently, Atlanta was biggest metro for Google. Updated script to monitor new routers and lost connectivity for Atlanta. Got 200 pages in a short time. [do you turn off the pager at that point?] It wasn’t just one cluster or one data center. Automation would kick in and route traffic to working areas.

Root cause: new vendor indexes CPUs starting with 1 rather than 0. And new vendor’s router crashes if ask for CPU that doesn’t exist.

Good news: As soon as SMTP querying stopped, Atlanta came back up

Lessons:

  • Model ahead of time. Plan for spikes and unexpected events so have spare capacity. Do load testing to know limits and capacity planning.
  • Regression analysis. How software and hardware changes over time.
  • Network capacity is harder to do capacity planning than web server capacity
  • Visualize in real time. Must be able to make decisions on the fly. What effect would X have? Humans can then decide if*should*.

Story #2: The edge snaps

Satellite clusters – at edge of network. Much smaller than data centers. Goal is to have a rack of servers nearer to users so faster response. Decomission frequently so have tool to disk erase. The tool was not idempotent to decomission. When passed empty set, interpretted as decomissioning all machines. Expect these machines to fail so detect and fall back to core clusters. Just not expecting them all to fail at once. Days to reinstall everything. 500 pages that afternoon.

Root cause: Defect – empty set != all

Lessons:

  • Mitigate first.
  • Fall back to established protocol – “How to panic effectively”. ICS (incident command system) – firefighters have this too. Allows cross-agency collaboration.Centralize response.
  • When not in firefighting mode, easy to think don’t need
  • Created lighter weight versoin of ICS. Changed terminology to make sound less hierarchical.
  • No penalty for invoking protocol. Err on the side of caution because don’t know at first if early warning sign
  • Train people before prod support. Give them scenarios

Story #3 – The all is coming from inside the house

At Google, “hope is not a strategy”. Unfortunately it was at healthcare.gov. And they had lots of anti-patterns. The tech surge started in October. They added monitoring at that point. Nuri joined in December.

The login system went down. Due to spaghetti code, this caused a full site outage. Monitoring said it was a database problem. Oracle DBA said a query running as a superuser that is clobbering all other traffic. It was coming from a person. The contractor’s manager wanted a report. The contractor logged in as superuser since it was the only id he had. He heard about the problem since he was in the war room. Toxic culture made him afraid to bring it up since cause is losing his job.

Root cause: Toxic culture

Lessons

  • Tech surge team did fix technical things, but it was mostly a cultural problem. Tech surge team brought a sense of urgency
  • In five nines (three minutes of downtime a year),the fifth 9 is people. Health care needed to get to the first 9 not the fifth.
  • Need a culture of response. Must care it is down. Monitoring. Reacting when down. Connect engineer’s work to bottom line.
  • Need a culture of responsibility.Must do post mortem or root cause analysis.Need to know how wi prevent the following time. Must be blameless
  • If people afraid to speak up, you extend outages
  • Operational experiences informs system design. Want to avoid getting paged about same thing repeatedly

Prevention

  • Practice disaster- DIRT – disaster in reovery training. Know what week it will happened, but not what/when. ex: California has been eaten by zombies. Not user facing,but flush out weakness in system. If actually causes user facing outage, mark test as failed and bring up immediately.
  • Wheels of Misfortune – role playing within team. Team member prepares scenario. Something that happened recently or might happen in future. Then talk through. Can ask any question that a compter can answer.

Dickerson’s Hierarchy of reliablity (like Maslow’s hiarchy of needs)- monitoring, incident response, post-mortems, testing/release, capacity planning, development, UX

Even if don’t have less outages, want them to be unnoticed.

Q&A

  • How test disasters without creating customer facing outage? Could use shadow system. Or break down to simulate components of outage on different days.
  • How communicate in crisis? Internal tool to create incident. Create IRC channel dedicated to emergency. Create doc with status/brief summary. All are discretionary. Depends on outage. For example, don’t use IRC when problem is corporate network is down.
  • How know what important in hundreds of pages? They were laptop probes; not humans. Can see from page that alert stream has patterns. Useful info but don’t need to read them all. Recognize cost of each page.

akka streams – viktor klang – qcon

This is part of my live blogging from QCon 2015. See my QCon table of contents for other posts.

Streams are:

  • Ephemeral – time dependent
  • Possibly unbounded in length
  • Focus on transformation and transformation of data

Programming involves moving things from A to B and changing the bytes along the way.

Akka

  • message oriented programming model for reactive apps
  • can be used from Java or Scala.

Actors

  • unit of proessing a called an Actor
  • component with address, mailbox, current behavior and storage
  • No CPU cost if actor doesn’t have someting to do. Only use a few bytes each so scale well
  • Each actor has a parent- handling for failures

Akka Streams

  • Higher order abstraction for concurrency
  • Solving the 100% case is impossible and 90% case is hard. Akka focuses on 80% case. Solve most problems
  • Immutable, reusable, composable, coordinated, async transformations.
  • Flows – one input and one output. Not connected to source or destination. Like a reusable pipe. Goal is to describe the transformation.

Types of transformations

  • Linear Time-agnotistic -ex common ones like map
  • Linear Time sensitive – ex dealing with infinite streams
  • Linear Rate detached – ex how deal with buffering
  • Linear Async – ex calling a service
  • Non-linear – ex binary flow (two inputs going to two outputs), custom stages

Sources – publisher, iterator, future, etc

Sinks – subscriber, foreach/fold/onComplete, ignore, head, etc

Fan in/fan out – merge, zip, etc

Output/Input – Http, tcp, InputStreamSource/OutputStreamSink blocking wrappers, etc

Materialization – taking description of transformation and making it run. Streams are blueprint. Materializer makes it run. Cycles are allowed. Look for troublesome ones.

If can’t solve problem without programming; wnt be able to solve it without programming. Why would be able to tell a computer how to do something if we don’t understand it either.

Push vs pull

  • Want to get data across async boundary without blocking back-pressure.
  • When pushing data, have to drop if get it too fast.
  • “A slow solution is no solution”
  • Pull can be slow
  • Better: Two way channel – only push when know ready. Know what needed based on requests. Don’t demand 100 ice creams unless know what to do with them. Switching between push/pull and batching requests when know about related ones. Called “dynamic push pull”

Also see The Reactive Streams Initiave

Impressons: good talk; learned a lot. Both about Akka and conceptually. Glad I’m comfortable with the basics of functional programming so it wasn’t overwhelming.

how did we end up here – todd montgomery and tisha gee – qcon

This is part of my live blogging from QCon 2015. See my <a /2015/06/10/qcon/”>QCon table of contents</a> for other posts.

When new, we assume everone knows what they are doing and everything is logical/clean/organized. Yet, reality is messy. Sometimes get to the point where can’t touch anything. Sometimes get to the point where proud of over-engineered/terrible solutions.

Software process stats

  • agile and iterative have similar project success rate
  • Ad-hoc and waterfall are similar
  • Agile/iterative is only a little better
  • Projects with less than 10 people significantly better than twenty or more
  • Different reports mesure successful projects differently. 30%- 60%.
  • More expensive project not better “throwing more $ at projects don’t make it better” [wouldn’t this be because of size of project; not throwing $ at it?]

Enterprise architect == so good at job that no longer do development. Surgeons design systems not people who used to,

More organizations are looking at contributions to open source to prove can do development before start. No place to hide; mediocrity becomes visible. Open source is not a business plan, but can be a distribution model.

Agile

  • Minimal viable product – people don’t want to buy MVP. Why aim for it? [we don’t it is about doing that first so guaranteed to have a base]
  • Product owner – why can’t talk to real customer. Why need a proxy? Want direct feedback. Actually pairing with customer to see how use app brings up usabiity/process issues. Technology should be part of the business.
  • Poll: how many people doing agile release less than every 3 months? A few
  • Water-scrum-full – tiny waterful. Do scrum practices, but ignore retrospective and still waterfall. Should focus on learning, feedback cycles and outcomes
  • Projects often succeed because 1-2 people make it happen in spite of process. Find out by thinking “what would happen if you weren’t involved”

“Shared mutable state” should be scary. Should be for sysetms programming where domain is smaller and understand hardware. Otherwise, shouldn’t be taken for granted. Embrace append only, single writer and shared nothing designs.

Ambdal’s law says can only scale up to a certain # of CPUs before maxing out. Universal Scalability Law says it gests worse well before that. “Coherence penalty”

Simplicity is better

Text encoding (JSON, XML) doesn’t need to be human readable. It’s computers talking to each other

Have to deal with probems and lack of response for both synchronous and async communication

Errors need to be first class messages. Java Exceptions name imply they are unusual cases. Don’t know what to do with them anyway.

Abstractions shouldn’t mean generalization. hould be about creating a semantic level so can be more precise.

Issues: Superiority complex with different technologies/techniques and anyone who says otherwise is wrong. Religious wars. X is the solution to everything.

Functional programming is not the anwer to multi-core. The math (universal scalability law) still hits . [still helps up to a point though on the graph]

Think about transformation and flow of data; not code

Hardware

  • Mobile makes us think about hardware again – battery and bandwidth. Free lunch on throwning hardware at problem is over
  • Hardware has been designed to make bets on locality and predictable for access. Pre-fetching and the like make an order of magnitude difference.
  • Bandwidth is increasing. Latency is staying the same.

Diversity

  • Testosterone Driven Development
  • Increase in women went up with other STEM fields. Then in CS it started declining in 1980. This is when PCs were introduced at marketed to boys.
  • Can catch up in a year even if start coding in college.
  • Grace Hopper invented COBOL and the compiler
  • Margret Hamilton invented async programming via NASA

Don’t be afraid to fail

Impressions: great substitute keynote. I wonder how long they had to practicing together. Trisha said she’s seen it given before (with Martin Thompson)