[QCon 2019] PID loops and the art of keeping systems stable

Colm MacCarthaigh @colmmacc from Amazon

For other QCon blog posts, see QCon live blog table of contents

Control Theory

  • PID loops are from control theory
  • Feedback loop – present, observe, feedback, react
  • Hundred year old field
  • Different fields claim to have invented it. Then realized fields had same equations and approaches

Furnace example

  • Classic example is a furnace. Want to get tank to a desired temperature. Measure water temperature and react by raising/lowering heat.
  • Could just put over fire until done. But will cool off too fast.
  • Can overheat because lag
  • Focus on error – distance of error to desired state.

PID

  • P = proportionate. Make change proportional to the error
  • P controller not stable because oscillates a lot
  • I = Integral. Oscillate far less
  • D = Derivative. Prevents all oscillation
  • In real world, PI controllers are often sufficient.

Comes up in context of (but rarely applied)

  • Autoscaling and placement (instances, storage, network, etc) – daily or weekly load pattern/cycle. ML can infer what will happen in future. The I component forecasts what will happen next.
  • Fairness algorithms (TCP, queues, throttling). Can scale elastically by rest region. Figure out capacity of each site and peak usage. Ensure no sites overwhelmed. Cloudfront looks at how close to capacity and considers as error.
  • Systems stability

Anti-patterns

  • Open loops
    • Should check that the requested action occurs.
    • How would you know if something suddenly went wrong?
    • “Sometimes they do things, but they don’t know why. So they pressed another magic button and it fixes everything”.
    • Frequently occurs for infrequent actions or for historic reasons.
    • Close loop by measuring everything can think of
    • Make infrequent operations more frequent (Ex: chaos engineering)
    • “if we have something that is happening once a year, it is doomed to failure” ex: 1-3 year certificate rotation. People forget or leave
    • Declarative config is easier to formally verify
  • Power laws
    • Errors propogate
    • Make system more compartmentalized so failure stays as small as possible
    • Exponential backoff – an integral. Limit retries
    • Rate limiters – token buckets can be effective
    • Working backpressure – AWS SDK retry strategy = Token buckets + rate limiters + persistent state
    • Recommend AWS article
  • Liveness and lag
    • Operating on old info can be worse than operating on no info. Environment changes so can be worse than an average.
    • Temporary shocks such as spike or momentary outage can take time to recover
    • Want constant time scaling as much as possible. Not always possible
    • Short queues are safer
    • LIFO good for prioritizing recent date. Can do out of order back fill to catch up
  • False functions
    • Want to move in a predictable way that you control.
    • UNIX load metric is evil. System and network latency aren’t good either. Need to look at underlying metrics. CPU is a good metric
  • Edge triggering
    • Triggers only at edge
    • Good for alerting humans.
    • Bad for software as only kicks in at time of high stress.
    • How ensure “deliver exactly once”

My impression

This talk was great. I encountered PID in robotics. Seeing it applied to our field was cool. All the things AWS thinks about in the environment was fascinating as well. Makes you happy as a user 🙂

[QCon 2019] ML Panel

Hein Lu @Linked in, Brad Mitro @GoogleJeff Smith @ Facebook

For other QCon blog posts, see QCon live blog table of contents

Getting Started

  • People with other strong IT skills switched over
  • Can learn from books, coursera,, udacity, grad school
  • Look for specific applications
  • Domain is very large
  • Learn libraries, existing datasets
  • Understand where organization is at. Ex want to do ML vs specific problem
  • Focus on how will deliver business value

General

  • Many problems repeat so can get ideas from others
  • Important to have organizational alignment
  • Make sure to train on realistic data
  • Deep learning is very successful use case of ML
  • ”AI is the new electricity”
  • Limits of Moore’s law. Physical limitations with Quantum
  • Research on how to get algorithms to train theselve

Tools

  • PyTorch Hub

Learning resources

  • Jeff’s book – Machine Learning Systems
  • Andrew Ng’s Coursera ML course
  • Coming out this year “AI is for everyone”

Q&A

  • How learn without business case? How know what don’t know? Many educational resources start generally. Can skip some core concepts and learn later.
  • How pick good training data? Iterate on testing. Important to keep training with new data
  • Data heurisitcs? How much data? How many labels?
  • How make more agile? Use a pretrained model to start. Exist as a service or pull in via code
  • How know when good enough? Sometimes you have to just try. Or look to those who solved similar problems
  • Tech stack? Hardware acceleration. Iibraries
  • Fraud? Retrain data

My impressions

This was a good panel. Interesting responses. One panelist was missing, but it came out well

[QCon 2019] Not Sold Yet: GraphQL

Speaker: Garrett Heinlein from Netflix

For other QCon blog posts, see QCon live blog table of contents

Use case at Netflix

  • Takes big bets
  • Organize data as single entity graphs
  • Wanted to merge graphs so can cross query
  • Early on in GraphQL journey

Team dynamics

  • “Monoliths are great” – single code base, atomic changes, simple deployments. “It might be a big ball of mud, but you love that ball of mud”
  • Can’t have over a dozen people working on one system
  • Microservices reduce costs with smaller teams and lessen communication
  • REST APIs require care to change spec changes

GraphQL

  • Express what is possible
  • Can get just data need
  • Schema is the source of truth
  • Can focus on product
  • Optimize exploration over documentation

Disadvantages

  • Rewriting code. But don’t have to change everything
  • Multiple entity graphs require managing release cycles

Consider

  • Designing graphs
  • Talk to others who have already done
  • Whether focus is data or clients
  • UI or entity centric schema
  • Who owns the schema? ex: ivory tower committee, informed captain per entity
  • DIstributed writes. Reading is far easier. For now Netflix is limitting updates to single entity
  • Error handling

Q&A [he left a lot of time for Q&A which was good because lots of good questions!]

  • Performance? Can do rate limiting. Can whitelist allowed queries for production. batching. Recommends Apollo product.
  • Concurrency and parallelism? GraphQL planner (like explain) so can optimize query. Complicated. Will be open sourced [missed product name]
  • Lazy problems with different sources? If know something can fail, put that error state into the schema
  • Multiple teams using entity with different view? Each team owns subtype and one team owns main type.
  • Business logic errors? Evolving topic. Can “or” return type so can be User type *or* “404 type”. Then calling code has different logic based on return type.
  • How deal with breaking changes in Federated graph? Apollo helps with this. Can report on specifications and performance based on production usage
  • Team communication for design? Working groups for things like pagniation. Now have federated team to build entity graphs and teams maintain their entities.
  • How avoid duplication? All using graphql. Not doing long enough to have bloat.
  • Deprecation strategy? Netflix doesn’t tell people what to do. Developers choose. Thinks Facebook never deprecated anything internally. Hasn’t encountered yet at Netflix.
  • Need to write resolver for each query. Bigger benefit is for clients

My impression

I’ve used GraphQL with GitHub as a consumer. It was interesting “formalizing” that knowledge by learning why another company started using it. Along with some pros and cons. They are early in the journey. I’d be interested to see this talk in another year or two after they have more experience. I enjoyed the talk as is though. It provided good insights and things to think about.