[2019 QCon] Using Bets, Boards and Missions to inspire Org Wide Agility

Speaker: John Cuttler @johncutlefish

For other QCon blog posts, see QCon live blog table of contents

Goal: want to be a change agent and see what works. Want teams to see more impact i their work. Want to create nudges

General

  • People seem interested and want to try. Then there is fear and nothing happened. Then people give up.
  • Confusing own needs, continuous improvement and specific ideas
  • It’s hard for everyone.
  • Some companies are healthier than others. Range to get rid of a toxic employee is 12 months to forever
  • Companies thinking they want a magic tool or framework
  • Common Problems: Structural, culture/alignment, strategy, decision, making, revenue pressure, deal closing, feature factory, busyness high utilization, constraints
  • Angst is easy to trigger. People want certainty, impact and coherence.
  • Coherence != agreement. Coherence means understand.
  • Want flow of impactful stories
  • How know if in a feature factory: https://hackernoon.com/12-signs-youre-working-in-a-feature-factory-44a5b938d6a2

Key ideas

  • Product development is a beautiful mess
  • Efforts to simplify/standardize often backfire. If can reflect mess back, becomes a change agent. Mirrors are beautiful. Ad libs for bets: https://dazzling-allen-f0bcd2.netlify.com

Hacks

  • In response to a think doing. Say “oh, well that’s an interesting bet”. Starts conversation about risk and more. Bets can be of any size and risk
  • Work ranges n 1-3 hour/day/week/month/quarter/year/decade range. Make nesting of work more visible 1-3 year bets in PowerPoint so every one sees. Everything in month or less visible in Jira. In between not visible. See if developer can map their tasks to end result in less than 2 minutes.
  • Shift words.
    • Problems/Solutions -> Opportunities/Interventions
    • Projects -> Missions
    • Experiments > Bets
    • ”Done -> Decision point or review and measurement
    • Dependency wrangling > Playing Tetris
    • Debt -> Drag
  • Checklist of what need to know. Ok to not know as long as aware. Key is for list to be a one pager
  • A letter to the future
  • Make a map of work in progress when feels high.. Show impact and why it is terrible. Safe way to talk about anxiety
  • Use a board to show what’s next, what currently focusing on and what is in review. Also includes different levels (time frame) for bets. Board covers multiple teams
  • Weekly learning users – share learning that is consumed by others in last week
  • Broadcasted learnings chart/dashboard/notebook consumed by two people in a week
  • Consumption of learnings – total reach of broadcasted learnings

Q&A

  • What if org not ready? See if can do it on one project.
  • Social dynamics? First 5-10 minutes people think they will be measured and fail based on this. Showing what other companies does helps

My impression

I want a double green button to vote. This was great. It was interesting, relatable, actionable, easy to understand and can apply at many levels. So even if an org isn’t ready for all this, smaller parts can be done.

[QCon 2019] PID loops and the art of keeping systems stable

Colm MacCarthaigh @colmmacc from Amazon

For other QCon blog posts, see QCon live blog table of contents

Control Theory

  • PID loops are from control theory
  • Feedback loop – present, observe, feedback, react
  • Hundred year old field
  • Different fields claim to have invented it. Then realized fields had same equations and approaches

Furnace example

  • Classic example is a furnace. Want to get tank to a desired temperature. Measure water temperature and react by raising/lowering heat.
  • Could just put over fire until done. But will cool off too fast.
  • Can overheat because lag
  • Focus on error – distance of error to desired state.

PID

  • P = proportionate. Make change proportional to the error
  • P controller not stable because oscillates a lot
  • I = Integral. Oscillate far less
  • D = Derivative. Prevents all oscillation
  • In real world, PI controllers are often sufficient.

Comes up in context of (but rarely applied)

  • Autoscaling and placement (instances, storage, network, etc) – daily or weekly load pattern/cycle. ML can infer what will happen in future. The I component forecasts what will happen next.
  • Fairness algorithms (TCP, queues, throttling). Can scale elastically by rest region. Figure out capacity of each site and peak usage. Ensure no sites overwhelmed. Cloudfront looks at how close to capacity and considers as error.
  • Systems stability

Anti-patterns

  • Open loops
    • Should check that the requested action occurs.
    • How would you know if something suddenly went wrong?
    • “Sometimes they do things, but they don’t know why. So they pressed another magic button and it fixes everything”.
    • Frequently occurs for infrequent actions or for historic reasons.
    • Close loop by measuring everything can think of
    • Make infrequent operations more frequent (Ex: chaos engineering)
    • “if we have something that is happening once a year, it is doomed to failure” ex: 1-3 year certificate rotation. People forget or leave
    • Declarative config is easier to formally verify
  • Power laws
    • Errors propogate
    • Make system more compartmentalized so failure stays as small as possible
    • Exponential backoff – an integral. Limit retries
    • Rate limiters – token buckets can be effective
    • Working backpressure – AWS SDK retry strategy = Token buckets + rate limiters + persistent state
    • Recommend AWS article
  • Liveness and lag
    • Operating on old info can be worse than operating on no info. Environment changes so can be worse than an average.
    • Temporary shocks such as spike or momentary outage can take time to recover
    • Want constant time scaling as much as possible. Not always possible
    • Short queues are safer
    • LIFO good for prioritizing recent date. Can do out of order back fill to catch up
  • False functions
    • Want to move in a predictable way that you control.
    • UNIX load metric is evil. System and network latency aren’t good either. Need to look at underlying metrics. CPU is a good metric
  • Edge triggering
    • Triggers only at edge
    • Good for alerting humans.
    • Bad for software as only kicks in at time of high stress.
    • How ensure “deliver exactly once”

My impression

This talk was great. I encountered PID in robotics. Seeing it applied to our field was cool. All the things AWS thinks about in the environment was fascinating as well. Makes you happy as a user 🙂

[QCon 2019] ML Panel

Hein Lu @Linked in, Brad Mitro @GoogleJeff Smith @ Facebook

For other QCon blog posts, see QCon live blog table of contents

Getting Started

  • People with other strong IT skills switched over
  • Can learn from books, coursera,, udacity, grad school
  • Look for specific applications
  • Domain is very large
  • Learn libraries, existing datasets
  • Understand where organization is at. Ex want to do ML vs specific problem
  • Focus on how will deliver business value

General

  • Many problems repeat so can get ideas from others
  • Important to have organizational alignment
  • Make sure to train on realistic data
  • Deep learning is very successful use case of ML
  • ”AI is the new electricity”
  • Limits of Moore’s law. Physical limitations with Quantum
  • Research on how to get algorithms to train theselve

Tools

  • PyTorch Hub

Learning resources

Q&A

  • How learn without business case? How know what don’t know? Many educational resources start generally. Can skip some core concepts and learn later.
  • How pick good training data? Iterate on testing. Important to keep training with new data
  • Data heurisitcs? How much data? How many labels?
  • How make more agile? Use a pretrained model to start. Exist as a service or pull in via code
  • How know when good enough? Sometimes you have to just try. Or look to those who solved similar problems
  • Tech stack? Hardware acceleration. Iibraries
  • Fraud? Retrain data

My impressions

This was a good panel. Interesting responses. One panelist was missing, but it came out well