choose your own adventure: chaos engineering – live blogging from qcon

State of Chaos Engineering
Speaker: Nora Jones

See the list of all blog posts from the conference

Known ways for testing for availability

    Unit test
  • Integration test
  • Regression Test
  • Chaos Engineering – much less common. Doesn’t replace need for “traditional” tests

Chaos Engineering

  • Most people know what Chaos Monkey is; far less know about Chaos Engineering. The former is a tool’; the later is a strategy. The former is mature; the later is emerging
  • “chaos” means different things to different companies. Common things: experimenting, distributed system, make system stronger through experiments
  • Goal is to run chaos all the time, not just on deployment

Why to start

  • Can’t keep blaming your cloud provider. Need to own failure
  • Failures will happen anyway. Why are we afraid of that?
  • Computers are complicated and they will break

“Chaos Carol”

Introducing chaos

  • think about where you are now and expected response
  • How many people should know the chaos is intentional? Helpful to know running an experiment.
  • Define “normal” system and behavior
  • Relate chaos to automated tests, SLAs and customer experiments
  • Start in QA, not Prod. This estabilishes a baseline
  • Only run during business day

Ways to create chaos

  • Start small – graceful restarts or degregedation
  • Randomly turn things off
  • Recreate things that have already happened – good once reach a steady state

Culture and implementation

  • People need to understand revealing problems is good (vs causing problems)
  • Start with opt in so people have control
  • Monitoring is important. Use dashboards to communicate
  • Automatically shut down experiment if goes too far astray
  • Have your incident/Jira/PagerDuty tickets gone down
  • Don’t forget about your company’s customers. Focus on business goals and not causing customer pain

Cascading failure

  • Try later on
  • Start in QA
  • May fail in unexpected ways – the tool broke QA for a week
  • Problems lie dormant for a long time

Testing

  • FIT – Failure Injection Testing
  • F# library: https://github.com/norajones/FailureInjectionLibrary
  • Types of chaos failures – exceptions, latency
  • After FIT, focus on minimizing blast radius and concentrating failures
  • Targeted chaos – important to have a steady state before introduce so know what caused by introduction

The choose your own adventure was a fun series of choices to think about viable options. Or not viable in some cases.

Leave a Reply

Your email address will not be published. Required fields are marked *