[QCon 2019] PID loops and the art of keeping systems stable

Colm MacCarthaigh @colmmacc from Amazon

For other QCon blog posts, see QCon live blog table of contents

Control Theory

  • PID loops are from control theory
  • Feedback loop – present, observe, feedback, react
  • Hundred year old field
  • Different fields claim to have invented it. Then realized fields had same equations and approaches

Furnace example

  • Classic example is a furnace. Want to get tank to a desired temperature. Measure water temperature and react by raising/lowering heat.
  • Could just put over fire until done. But will cool off too fast.
  • Can overheat because lag
  • Focus on error – distance of error to desired state.

PID

  • P = proportionate. Make change proportional to the error
  • P controller not stable because oscillates a lot
  • I = Integral. Oscillate far less
  • D = Derivative. Prevents all oscillation
  • In real world, PI controllers are often sufficient.

Comes up in context of (but rarely applied)

  • Autoscaling and placement (instances, storage, network, etc) – daily or weekly load pattern/cycle. ML can infer what will happen in future. The I component forecasts what will happen next.
  • Fairness algorithms (TCP, queues, throttling). Can scale elastically by rest region. Figure out capacity of each site and peak usage. Ensure no sites overwhelmed. Cloudfront looks at how close to capacity and considers as error.
  • Systems stability

Anti-patterns

  • Open loops
    • Should check that the requested action occurs.
    • How would you know if something suddenly went wrong?
    • “Sometimes they do things, but they don’t know why. So they pressed another magic button and it fixes everything”.
    • Frequently occurs for infrequent actions or for historic reasons.
    • Close loop by measuring everything can think of
    • Make infrequent operations more frequent (Ex: chaos engineering)
    • “if we have something that is happening once a year, it is doomed to failure” ex: 1-3 year certificate rotation. People forget or leave
    • Declarative config is easier to formally verify
  • Power laws
    • Errors propogate
    • Make system more compartmentalized so failure stays as small as possible
    • Exponential backoff – an integral. Limit retries
    • Rate limiters – token buckets can be effective
    • Working backpressure – AWS SDK retry strategy = Token buckets + rate limiters + persistent state
    • Recommend AWS article
  • Liveness and lag
    • Operating on old info can be worse than operating on no info. Environment changes so can be worse than an average.
    • Temporary shocks such as spike or momentary outage can take time to recover
    • Want constant time scaling as much as possible. Not always possible
    • Short queues are safer
    • LIFO good for prioritizing recent date. Can do out of order back fill to catch up
  • False functions
    • Want to move in a predictable way that you control.
    • UNIX load metric is evil. System and network latency aren’t good either. Need to look at underlying metrics. CPU is a good metric
  • Edge triggering
    • Triggers only at edge
    • Good for alerting humans.
    • Bad for software as only kicks in at time of high stress.
    • How ensure “deliver exactly once”

My impression

This talk was great. I encountered PID in robotics. Seeing it applied to our field was cool. All the things AWS thinks about in the environment was fascinating as well. Makes you happy as a user 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *