[kcdc 2022] getting started with site reliability engineering

Speaker: Shradha Khard

For more, see the table of contents

Notes

Site Reliability Engineering
Operations is a software problem.
SRE is what you get when you treat ops as software and staff it with software engineers
Software dev: idea -> strategy -> dev (design, code, test)-> ops(build, deploy, support) -> deliver (real world)
Ops – maintenance, system upgrades and isntalls, security, compliance, cost, support help desk escalations, vendor contracts
Conflict – dev wants new features, ops want to make sure doesn’t break

DevOps

SRE implements DevOps.
SRE is a substream
Ensures durable focus on engineering. Need to make sure product up and running. 50% time automate to make sure that happens
ex: augment S3 bucket
See how fast can make changes without violated SLO
Error budget – metric for how unreliable a system is allowed to be
Monitoring is not just logging in system. Need to alert and ticket too
Change management
Demand forecasting/capacity planning
Provisioning
Efficiency and Performance
SRE doesn’t replace DevOps people who deploy to cloud

Enabling SRE/How to Start

Centralized SFE team (core platform, networking)
Embedded (full team members of project team, teach devs how to manage, work with core team)
Need same skillset as dev to be SRE

Metrics

MTTR – mean time to recovery – how long to get system healthy again. Emergency response helps with this
Lead time to release or rollback
Improve monitoring to catch and detect issues earlier
Estabilish error budget to have budget based risk management

Service levels

SLA (service level agreement) – legal agreement. Often involves compensation if not
SLO (service level objective) – number which SLI should be before needing improvement
SLI (service level indicator) – metric over time. Quantitive measure – ex: throughput, latency, error rate, utlization
3 nines (99.9%) – 10 mnutes per week, 8.8 hours per year
4 nines – 1 minute per week, 52 minutes per yeaar
5 nines – 6 seconds per week, 5 minutes per year

Incident Management

Books

My take

There was a lot of info, but easy to follow. It was great to see a structured intro vs that random things I’ve read online

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky