[uberconf 2023] Architect’s Guide to Site Reliability Engineering

Speaker: Nathaniel Schutta

@ntschutta

For more, see the table of contents

General

Agile – do more of what works
Conflicting incentives – ops – “don’t change if works”
Monoliths to service oriented to microservices – solves some problems and created new ones
SRE (site reliability engineer) – new role
We are good at giving something a new name and pretending have never done before. Ex: cloud computing – big pile of commute and slice of what need – ex: mainframe
Everything we do involve other people. Most problems are people problems, which we tend to ignore

History

Goes back to Apollo problem. First SRE was Margret Hamilton. She wanted to add error checking and did update the docs
Phone autocorrects on map – recalculating….
Traditionally systems run by system admins.
Now have hundreds/thousands of services
CORBA – facilitate communicate for disparate systems on diverse platforms. Also a good definition for microservices. EJBs too. Then SOA
APIs exploded because we all have smartphones/supercomputers in our pocket
Amazon had policy of everything being an API

Challenges

Who page if go through 20 service. It’s clearly another team’s problem
How monitor
How debug
How even find services
We argue about definition of made up words. Ex: microservice. Nate likes definition that it can be written in two weeks. How many services can a team support? If change alot, 4-6. If stable, 15-20
How do we define an application?
Conflicting incentives – release often
The way we do things might be the first way we tried vs a better way.

SRE

What if we asked software devs to define an ops team?
Software engineering applied to operations
Did that with testing too; made more like dev
Replace manual tasks with automation
Many SREs former software engineers.
Helpful to understand Linux
Can’t reply on quarterly “Review Board”. This is a very slow quality gate. However, most orgs don’t audit gate to see if useful.
Goal: move fast, safely
Doesn’t happen in spare cycles “when have time”. Can’t be on call all the time or be doing tickets/manual work all the time.
Humans can’t do the same thing twice. Ex: golf

What does SRE do/consider

Availability
Stability
Latency
Performance
Monitoring
Capacity planning
Emergency response
Understand SLOs
Embrace/manage/mitigate risk. Risk is a continuum and a business decision
Short term vs long term thinking. Heroics works for a while, but isn’t sustainable. Often better to lower SLOs for a short time to come up with better solution.
Focus on mean time to recovery. No such thing as 100% uptime.
Runbook is helpful. Not everyone is an expert on the system. Even if do know, brain doesn’t work well in middle of the night or under pressure. Playbook produces 2x improvement in mean time to recovery
People fall to level of training, react worse when stressed.
Alerting. Need to know what is important/critical and when it is important. Ex: can ignore car oil change message for a bit but not for too long
Logging best practices. Logs tend to be nothing or repeating the same thing 10x.
Four golden signals – latency, traffic level, error rate, saturation
Automate everything; manual toil drives people out of SRE
Which services most important
Establish an error budget. Can experiment when more stable. Can’t deploy when error budget used up. Helps understand tradeoffs.
Production readiness reviews.
Get everyone on same page with what service does – dev, archs, etc. Improves understanding and can find bottlenecks
Checklists – quantifiable and measurable items.
Think about how it can fail and what happens if it does
Chaos engineering

Outage impact

What do customers expect? Used to be 5×12 – 6am-6pm M-F. Most things are 24×7 now
What do competitors provide? Need to do same
Cost – more failover is super expensive
When cloud goes down, it is news
Depends if needs redundant backup. How much venue lose than cost of being down?

Postmortems

Don’t want to make same mistakes. Learn from yours and others. Avoid them becoming a blame session
Outages will still happen; must learn from them so that bad thing doesn’t happen again
Living documents – status as the outage is happening, impact to business, root causes
Tactical vs long term/strategic fix
Action items to avoid in future
Cultural issues
Wheel of misfortune – role play disaster, practice
Recognize people for participating
Need senior management to encourage
Provide a retro on postmortem to improve process
Education – if you already understood that, we’d give you something harder to do

SLO/SLA

99% – 7.2 hours a month, 14.4 minutes a day
99.9% – 8.76 hours a year, 1.44 minutes a day
99.99% – 4.38 minutes a month; 8.66 seconds a day
99.999% – 4.38 minutes a month; 864 millis a day
Google K8S Engine availability is 99.5%. Can;t exceed service provider
SLA (service level agreement) means financial consequence of missing. Otherwise, it is a SLO
More to always better; can’t be infinity
Can tighten later; hard to losen
If a system exceeds their SLA, can’t rely on that. Could stop at any time.
Might have internal SLO that is tighter than the advertised one.
Everyone wants five nines until they see the cost. “If you have to ask, you can’t afford it”

Fitness functions

Tests to make sure architecture still does what want it to do
If know when breaks, can tie back to code change and fix

Next steps

Build an SRE team if don’t have one
Applications changing rapidly
Need to enable environment to move fast and safely
Must work well together

My take

Nate’s style is a ton of slides with a mostly few words/sentence on each. It’s a fun style. It also means the font size is super large and I don’t need to wear my glasses for most slides. I had to step out for a few minutes for the restroom. [I’ve been doing an excellent version hydrating!] Hard to step out, but easy to catch when came back.

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Leave a Reply

Share this:

Leave a Reply