[QCon 2019] Navigating Complexity – High-performance discovery teams

Conal Scanlon

For other QCon blog posts, see QCon live blog table of contents

Discovery vs Delivery

  • Delivery: completing lots of tickets, points completed, happy team
  • Discovery – different goals
  • Two types of work; two different ways of thinking
  • Dual track agile – discovery is for learning. Inputs into delivery track.
  • Discovery will happen whether planning to do it or not. ex: bug report
  • Can change % of tie on discovery vs delivery over phase in project


  • Welsh word
  • Complex/Chaotic/Complicated/Obvious
  • Also box in middle for when don’t know which in.
  • Complicated – can use knowledge to see what to do

Key areas of discovery work

  • Maximize learning – accelerate discovery, MPVs
    • Want to encounter problems as quickly as possible
    • Learning is messy and doesn’t easily fit Scrum process
    • MVP goal – maximize learning while minimizing risk and investment
    • MVPs can be paper prototype or a single use case
  • Better ideas – idea flow, collective intelligence
    • Levels: Psychology safety, dependability, structure & clarity, meaning, impact
    • Validate with people outside team
    • Closer relationships with specific customers so can see reaction as progress
  • Alignment – OKRs, Briefs, Roadmap
    • OKR = objective and key results
    • Think about where want to go and how get there
    • Should understand why vs an aspirational goal
    • Alignment and autonomy are orthogonal
    • Product brief – map from strategic level to feature going to build. It is not a requirements or architectural doc
    • Roadmap – show on larger ranges of time
  • Metrics – 3 levels, correct category (delivery vs discovery)
    • Business, product and engagement metrics.

My impression

I like that he provided an outline with the key points up front. The OKR section was detailed with examples. I like that there were book references/recommendations. And it was certainly interesting. I think I expected it to be about something else, but I’m glad I came. I would have liked more on examples of discovery projects specifically.

[QCOn 2019] The Trouble with Learning in Complex Systems

Jason Hand from Microsoft – @jasonhand

For other QCon blog posts, see QCon live blog table of contents


  • We use terms where not everyone on same page as to meanings.
  • Ex: what does “complex” mean
  • Types of systems
    • Whether can determine cause and effect
    • Ordered vs unordered
    • Ordered – Obvious (Can take it apart/put it back together. Know how works. ex: bicycle), complicated (ex: motorcycle)
    • Unordered – obvious, complicated, complex (ex: people on road, human body), chaotic (ex: NYC)
  • Sociotechnical systems – th epeople part is hard

Complex system

  • Causality can only be examined/understood/determined in hindsight
  • Specialists, but lack broad understanding of system
  • Imperfect information
  • Constantly changing
  • Users good at surprising us with what system can/can’t do


  • Takes time
  • Takes success and failure. Need both
  • Learning opportunities not evenly distributed
  • Sample learning opportunities – code commits, config changes, feature releases and incident response. Commits occur much more often than instances
  • However, the cost to recovery is low for the more frequent opportunities
  • High opportunity – low stakes and high frequency. GIt push is muscle memory
  • Low opportunity – high stakes and low opportunity
  • Frequency is what creates the opportunity


  • Everyone would agree impacting the customer is an incident
  • If didn’t affect the customer, not always called an incident.
  • If not called an incident, no incident review.
  • Missed learning opportunity
  • We view incidents as bad.
  • Incidents are unplanned work.
  • Near misses save the day, but don’t get recognized or learned from
  • Systems are continuously changing; will never be able to remove all problems from system

Techniques to learn

  • Root cause analysis is insufficient. Like a post mortem, it is just about what went wrong.
  • Needs to be a learning review
  • Discuss language barriers, tools, confidence level, what people tried
  • Discuss what happened by time and the impact
  • ChatOps better than phone bridge because can capture what happened. Nobody is going to transcribe later. Having clean channel for communication helps.
  • However, incidents not linear.
  • Book: Overcomplicated
  • If someone just does one thing, the learning doesn’t transfer. Need operational knowledge and mental models

Learning Reviews

  • Set context – not looking for answers/fixes. Looking for ways to learn even if no action items
  • Set aside time/effort to be curious
  • Asking linear questions (ex: five whys), don’t get to reality system
  • Invite people who weren’t part of incident response. They should still learn and can provide info about system
  • Understand and reduce blind spots

My impression

Good talk. It’s definitely thought provoking. And suggests small things one can do to start making things better

[QCon 2019] Low Latency in the Cloud, with OSS

Mark Price @epickrram

For other QCon blog posts, see QCon live blog table of contents


  • Trading app
  • Need microsecond (not millisecond) response time
  • Need data in memory vs database
  • Lock free programming
  • Redundancy
  • High volume
  • Predictable latency


  • System built on OSS
  • Opinionated framework to accelerate app dev
  • Clients communicate with stateless, scalable gateways
  • Persistors – manage data in memory.
  • Gateway – converts large text message to something smaller and more efficient

Design choices

  • Replay logs to reapply changes. Business logic must be fully deterministic. Bounded recovery times
  • Placement group in cloud – machines guaranteed to be near each other. Minimizes latency between nodes

Testing latency

  • Do as part of CD pipeline
  • Can’t physically monitor with fibertab
  • Capture in histogram to get statistical view and calculate data
  • Test under load
  • Fan out where test from
  • Store % in time series data
  • Can see jigger for garbage collection

Performance on shared box/cloud

  • Not in control of resources running on
  • Containers share L3 cache so can see higher rates of cache miss
  • CPU throttling effects
  • Hard to measure since can’t see what neighbors are doing
  • One option is to rent the largest box possible and compare to vendor website for specs. If have max # cores, know have box to self. Expensive. Was about five dollars a year. At that price, might be worth just buying own machine in data center
  • Can pack non latency services onto shared machines

<missed some at the end. I got an email that distracted me>

My impressions

There was a lot of discussion about the histogram. I would have liked to see some examples rather than just talking about how it is calculated. They didn’t have to be real examples to be useful. There were some interesting facts and it was a good case study so I’m glad I went. I was glad he addressed that non-cloud is a possible option for this scenario