choose your own adventure: chaos engineering – live blogging from qcon

State of Chaos Engineering
Speaker: Nora Jones

See the list of all blog posts from the conference

Known ways for testing for availability

    Unit test
  • Integration test
  • Regression Test
  • Chaos Engineering – much less common. Doesn’t replace need for “traditional” tests

Chaos Engineering

  • Most people know what Chaos Monkey is; far less know about Chaos Engineering. The former is a tool’; the later is a strategy. The former is mature; the later is emerging
  • “chaos” means different things to different companies. Common things: experimenting, distributed system, make system stronger through experiments
  • Goal is to run chaos all the time, not just on deployment

Why to start

  • Can’t keep blaming your cloud provider. Need to own failure
  • Failures will happen anyway. Why are we afraid of that?
  • Computers are complicated and they will break

“Chaos Carol”

Introducing chaos

  • think about where you are now and expected response
  • How many people should know the chaos is intentional? Helpful to know running an experiment.
  • Define “normal” system and behavior
  • Relate chaos to automated tests, SLAs and customer experiments
  • Start in QA, not Prod. This estabilishes a baseline
  • Only run during business day

Ways to create chaos

  • Start small – graceful restarts or degregedation
  • Randomly turn things off
  • Recreate things that have already happened – good once reach a steady state

Culture and implementation

  • People need to understand revealing problems is good (vs causing problems)
  • Start with opt in so people have control
  • Monitoring is important. Use dashboards to communicate
  • Automatically shut down experiment if goes too far astray
  • Have your incident/Jira/PagerDuty tickets gone down
  • Don’t forget about your company’s customers. Focus on business goals and not causing customer pain

Cascading failure

  • Try later on
  • Start in QA
  • May fail in unexpected ways – the tool broke QA for a week
  • Problems lie dormant for a long time

Testing

  • FIT – Failure Injection Testing
  • F# library: https://github.com/norajones/FailureInjectionLibrary
  • Types of chaos failures – exceptions, latency
  • After FIT, focus on minimizing blast radius and concentrating failures
  • Targeted chaos – important to have a steady state before introduce so know what caused by introduction

The choose your own adventure was a fun series of choices to think about viable options. Or not viable in some cases.

unifying banks & blockchains at coinbase – live blogging at qcon

Unifying Banks & Blockchains @Coinbase
Speaker: Jim Posen

See the list of all blog posts from the conference

Coinbase converts blockchain currency with traditional/fiat currency. Started doing that in 2012. Then added support for European banks in 2014. Then in 2015, added an Exchange.

In 2016, the Rails app was becoming a problem and the BitCoin logic started to degrade. Created first microservice at that point. Now support multiple currencies. Still maintain monolithic Rails app.

Bitcoin

  • definition – Bitcoin is a scarce digital asset and a protocol for transfering the asset over the internet. “Email” is overloaded in two ways as well.
  • Public transactions ledger
  • About 30 minutes for transactions to clear – regardless of hours and holidays
  • Irreversible payments

Coinbase architecture

  • Uploads batch file daily to the originating depository financial institution (ODFI). Clears thorough ACH operator to receiving depository financial institution (RDFI) and receiver receives
  • The RDFI has 24 hours to return for insufficient funds
  • Then the receiver can challenge/return for up to 60 days – important consumer protection, but a challenge
  • Bitcoin uses gossip protocol where nodes talk to other nodes

More happened after I left. The session was good. I had a hard stop today and had to leave.

removing friction in the developer experience – live blogging from qcon

Removing Friction in the Developer Experience
Speaker: Adrian Trenaman

See the list of all blog posts from the conference

Started with a funny story explaining his talk was about removing red tape and bureaucracy. To TSA/immigration.

Goal: minimize the distance between hello world and prod. Need to be able to deploy quickly, safely and own in prod

Developer hierarchy of needs

  1. self actualize – get stuff done and have cool stories that impress your friends
  2. perks – fuzbul,beanbags,free food – we don’t work for treats. a bit like the breakfast buffet at a hotel; love at first, but then meaningless
  3. basics – laptop, wifi, vpm, eat, standing desk, screen, warmth, light

Good software org

  • Teams 3-7
  • Departments 16-24
  • Leaders not managers, leaders who code – 85% of time as lead, 60% of time as director, 15% of time beyond that
  • DevOps, ownership, open source

Work is hard – like pushing up a hill. Friction is a force that pushes back when try to do something

Friction: Staging/Testing environments

  • Too many of these such environments. Waste
  • In physical world, draw map of area and make one continuous
  • line of what need to do in order to complete job. The resulting spaghetti diagram shows wasted effort.
  • Doing this on the environment shows number of people deploying and number of deployments. Helps highlight handoffs between groups of people – dev, qa, deployers.
  • Muda – waste in process – Intellect (building environments), Overprocessing (retest in multiple enviroments), rework (environments never match prod), inventory (commits held up), transporation (deliveries to prod), motion (commit/deploy cycles), waiting (held up on someone else) and overproduction (fewer big bing releases)
  • Instead deploy directly to prod – dark canary (see if working), canary (one of X servers has new code) release (all servers get new code), rollback (if needed)
  • Think of team as a startup providing services to other dev teams
  • Teams need secure, unfettered control to their infrastructure. Break down master account into subaccounts. Also helps with cost model because can see which teams use what. Some teams need everything locked down, but not all do.

Friction: Forced technology choices

  • Voluntary adoption – let people choose technology. If successful, more will use. If nobody using, see should stop using it
  • Looks like chaos, but creating an environment where people can create own choices
  • Standards and recommendations on github: https://github.com/gilt/standards
  • Continuum of adoption by role and voluntary adoption.
  • Eventually converges on a set of norms

Friction: Fear of breaking all the things

  • Knowing going to prod makes one cautious
  • Gilt is LOSA – lots of small apps – aka “micro-frontends”. Each page considered own app
  • Gives confidence that can’t break checkout by changing the product page

Friction: Forced team choices

  • Nothing worse than working with people you don’t like
  • Leader locks down product manager, tech lead, etc.
  • Pitch and let people sign up
  • Somehow this works and everyone wants to be on the team. Everyone picks in a room on a board so can see if too many people have same skill set or too many junior people. Ultimately the tech lead chooses. Can negotitate : will do unsexy work if can also work on X. If nobody wants to work on project, think about why can’t get people excited about it. If it is operational work, can spread across teams.
  • Teams stay together 12-18 months. Better to bring work to the teams than to self-select teams every few months

Friction: Distractions

  • Coding is the primary activity
  • Everyone likes being in flow
  • Red Hot Engineer – one person is in charge of problems/distractions for a few weeks. If quiet, they can read a book or whatever
  • Minimize meetings – they have 2.75-5 hours of meetings a week. Ask at end of recurring meeting if useful and if should meet again.

Measure how doing and compare over time – delivering value, fun, ease of release, health of codebase, whether learning, missing, are we players aor pawns, speed, suitable process, support, teamwork