[QCon 2019] PID loops and the art of keeping systems stable

Posted on June 24, 2019 by Jeanne Boyarsky

Colm MacCarthaigh @colmmacc from Amazon

For other QCon blog posts, see QCon live blog table of contents

Control Theory

PID loops are from control theory
Feedback loop – present, observe, feedback, react
Hundred year old field
Different fields claim to have invented it. Then realized fields had same equations and approaches

Furnace example

Classic example is a furnace. Want to get tank to a desired temperature. Measure water temperature and react by raising/lowering heat.
Could just put over fire until done. But will cool off too fast.
Can overheat because lag
Focus on error – distance of error to desired state.

PID

P = proportionate. Make change proportional to the error
P controller not stable because oscillates a lot
I = Integral. Oscillate far less
D = Derivative. Prevents all oscillation
In real world, PI controllers are often sufficient.

Comes up in context of (but rarely applied)

Autoscaling and placement (instances, storage, network, etc) – daily or weekly load pattern/cycle. ML can infer what will happen in future. The I component forecasts what will happen next.
Fairness algorithms (TCP, queues, throttling). Can scale elastically by rest region. Figure out capacity of each site and peak usage. Ensure no sites overwhelmed. Cloudfront looks at how close to capacity and considers as error.
Systems stability

Anti-patterns

Open loops
- Should check that the requested action occurs.
- How would you know if something suddenly went wrong?
- “Sometimes they do things, but they don’t know why. So they pressed another magic button and it fixes everything”.
- Frequently occurs for infrequent actions or for historic reasons.
- Close loop by measuring everything can think of
- Make infrequent operations more frequent (Ex: chaos engineering)
- “if we have something that is happening once a year, it is doomed to failure” ex: 1-3 year certificate rotation. People forget or leave
- Declarative config is easier to formally verify
Power laws
- Errors propogate
- Make system more compartmentalized so failure stays as small as possible
- Exponential backoff – an integral. Limit retries
- Rate limiters – token buckets can be effective
- Working backpressure – AWS SDK retry strategy = Token buckets + rate limiters + persistent state
- Recommend AWS article
Liveness and lag
- Operating on old info can be worse than operating on no info. Environment changes so can be worse than an average.
- Temporary shocks such as spike or momentary outage can take time to recover
- Want constant time scaling as much as possible. Not always possible
- Short queues are safer
- LIFO good for prioritizing recent date. Can do out of order back fill to catch up
False functions
- Want to move in a predictable way that you control.
- UNIX load metric is evil. System and network latency aren’t good either. Need to look at underlying metrics. CPU is a good metric
Edge triggering
- Triggers only at edge
- Good for alerting humans.
- Bad for software as only kicks in at time of high stress.
- How ensure “deliver exactly once”

My impression

This talk was great. I encountered PID in robotics. Seeing it applied to our field was cool. All the things AWS thinks about in the environment was fascinating as well. Makes you happy as a user 🙂

[QCon 2019] ML Panel

Posted on June 24, 2019 by Jeanne Boyarsky

Hein Lu @Linked in, Brad Mitro @GoogleJeff Smith @ Facebook

For other QCon blog posts, see QCon live blog table of contents

Getting Started

People with other strong IT skills switched over
Can learn from books, coursera,, udacity, grad school
Look for specific applications
Domain is very large
Learn libraries, existing datasets
Understand where organization is at. Ex want to do ML vs specific problem
Focus on how will deliver business value

General

Many problems repeat so can get ideas from others
Important to have organizational alignment
Make sure to train on realistic data
Deep learning is very successful use case of ML
”AI is the new electricity”
Limits of Moore’s law. Physical limitations with Quantum
Research on how to get algorithms to train theselve

Tools

PyTorch Hub

Learning resources

Jeff’s book – Machine Learning Systems
Andrew Ng’s Coursera ML course
Coming out this year “AI is for everyone”

Q&A

How learn without business case? How know what don’t know? Many educational resources start generally. Can skip some core concepts and learn later.
How pick good training data? Iterate on testing. Important to keep training with new data
Data heurisitcs? How much data? How many labels?
How make more agile? Use a pretrained model to start. Exist as a service or pull in via code
How know when good enough? Sometimes you have to just try. Or look to those who solved similar problems
Tech stack? Hardware acceleration. Iibraries
Fraud? Retrain data

My impressions

This was a good panel. Interesting responses. One panelist was missing, but it came out well

[QCon 2019] Not Sold Yet: GraphQL

Posted on June 24, 2019 by Jeanne Boyarsky

Speaker: Garrett Heinlein from Netflix

For other QCon blog posts, see QCon live blog table of contents

Use case at Netflix

Takes big bets
Organize data as single entity graphs
Wanted to merge graphs so can cross query
Early on in GraphQL journey

Team dynamics

“Monoliths are great” – single code base, atomic changes, simple deployments. “It might be a big ball of mud, but you love that ball of mud”
Can’t have over a dozen people working on one system
Microservices reduce costs with smaller teams and lessen communication
REST APIs require care to change spec changes

GraphQL

Express what is possible
Can get just data need
Schema is the source of truth
Can focus on product
Optimize exploration over documentation

Disadvantages

Rewriting code. But don’t have to change everything
Multiple entity graphs require managing release cycles

Consider

Designing graphs
Talk to others who have already done
Whether focus is data or clients
UI or entity centric schema
Who owns the schema? ex: ivory tower committee, informed captain per entity
DIstributed writes. Reading is far easier. For now Netflix is limitting updates to single entity
Error handling

Q&A [he left a lot of time for Q&A which was good because lots of good questions!]

Performance? Can do rate limiting. Can whitelist allowed queries for production. batching. Recommends Apollo product.
Concurrency and parallelism? GraphQL planner (like explain) so can optimize query. Complicated. Will be open sourced [missed product name]
Lazy problems with different sources? If know something can fail, put that error state into the schema
Multiple teams using entity with different view? Each team owns subtype and one team owns main type.
Business logic errors? Evolving topic. Can “or” return type so can be User type *or* “404 type”. Then calling code has different logic based on return type.
How deal with breaking changes in Federated graph? Apollo helps with this. Can report on specifications and performance based on production usage
Team communication for design? Working groups for things like pagniation. Now have federated team to build entity graphs and teams maintain their entities.
How avoid duplication? All using graphql. Not doing long enough to have bloat.
Deprecation strategy? Netflix doesn’t tell people what to do. Developers choose. Thinks Facebook never deprecated anything internally. Hasn’t encountered yet at Netflix.
Need to write resolver for each query. Bigger benefit is for clients

My impression

I’ve used GraphQL with GitHub as a consumer. It was interesting “formalizing” that knowledge by learning why another company started using it. Along with some pros and cons. They are early in the journey. I’d be interested to see this talk in another year or two after they have more experience. I enjoyed the talk as is though. It provided good insights and things to think about.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

[QCon 2019] PID loops and the art of keeping systems stable

[QCon 2019] ML Panel

[QCon 2019] Not Sold Yet: GraphQL

Share this:

Share this:

Share this: