[QCOn 2019] The Trouble with Learning in Complex Systems

Jason Hand from Microsoft – @jasonhand

For other QCon blog posts, see QCon live blog table of contents

Definitions

  • We use terms where not everyone on same page as to meanings.
  • Ex: what does “complex” mean
  • Types of systems
    • Whether can determine cause and effect
    • Ordered vs unordered
    • Ordered – Obvious (Can take it apart/put it back together. Know how works. ex: bicycle), complicated (ex: motorcycle)
    • Unordered – obvious, complicated, complex (ex: people on road, human body), chaotic (ex: NYC)
  • Sociotechnical systems – th epeople part is hard

Complex system

  • Causality can only be examined/understood/determined in hindsight
  • Specialists, but lack broad understanding of system
  • Imperfect information
  • Constantly changing
  • Users good at surprising us with what system can/can’t do

Learning

  • Takes time
  • Takes success and failure. Need both
  • Learning opportunities not evenly distributed
  • Sample learning opportunities – code commits, config changes, feature releases and incident response. Commits occur much more often than instances
  • However, the cost to recovery is low for the more frequent opportunities
  • High opportunity – low stakes and high frequency. GIt push is muscle memory
  • Low opportunity – high stakes and low opportunity
  • Frequency is what creates the opportunity

Incident

  • Everyone would agree impacting the customer is an incident
  • If didn’t affect the customer, not always called an incident.
  • If not called an incident, no incident review.
  • Missed learning opportunity
  • We view incidents as bad.
  • Incidents are unplanned work.
  • Near misses save the day, but don’t get recognized or learned from
  • Systems are continuously changing; will never be able to remove all problems from system

Techniques to learn

  • Root cause analysis is insufficient. Like a post mortem, it is just about what went wrong.
  • Needs to be a learning review
  • Discuss language barriers, tools, confidence level, what people tried
  • Discuss what happened by time and the impact
  • ChatOps better than phone bridge because can capture what happened. Nobody is going to transcribe later. Having clean channel for communication helps.
  • However, incidents not linear.
  • Book: Overcomplicated
  • If someone just does one thing, the learning doesn’t transfer. Need operational knowledge and mental models

Learning Reviews

  • Set context – not looking for answers/fixes. Looking for ways to learn even if no action items
  • Set aside time/effort to be curious
  • Asking linear questions (ex: five whys), don’t get to reality system
  • Invite people who weren’t part of incident response. They should still learn and can provide info about system
  • Understand and reduce blind spots

My impression

Good talk. It’s definitely thought provoking. And suggests small things one can do to start making things better

Leave a Reply

Your email address will not be published. Required fields are marked *