[kcdc 2022] getting started with site reliability engineering

Speaker: Shradha Khard

For more, see the table of contents

Notes

  • Site Reliability Engineering
  • Operations is a software problem.
  • SRE is what you get when you treat ops as software and staff it with software engineers
  • Software dev: idea -> strategy -> dev (design, code, test)-> ops(build, deploy, support) -> deliver (real world)
  • Ops – maintenance, system upgrades and isntalls, security, compliance, cost, support help desk escalations, vendor contracts
  • Conflict – dev wants new features, ops want to make sure doesn’t break

DevOps

  • SRE implements DevOps.
  • SRE is a substream
  • Ensures durable focus on engineering. Need to make sure product up and running. 50% time automate to make sure that happens
  • ex: augment S3 bucket
  • See how fast can make changes without violated SLO
  • Error budget – metric for how unreliable a system is allowed to be
  • Monitoring is not just logging in system. Need to alert and ticket too
  • Change management
  • Demand forecasting/capacity planning
  • Provisioning
  • Efficiency and Performance
  • SRE doesn’t replace DevOps people who deploy to cloud

Enabling SRE/How to Start

  • Centralized SFE team (core platform, networking)
  • Embedded (full team members of project team, teach devs how to manage, work with core team)
  • Need same skillset as dev to be SRE

Metrics

  • MTTR – mean time to recovery – how long to get system healthy again. Emergency response helps with this
  • Lead time to release or rollback
  • Improve monitoring to catch and detect issues earlier
  • Estabilish error budget to have budget based risk management

Service levels

  • SLA (service level agreement) – legal agreement. Often involves compensation if not
  • SLO (service level objective) – number which SLI should be before needing improvement
  • SLI (service level indicator) – metric over time. Quantitive measure – ex: throughput, latency, error rate, utlization
  • 3 nines (99.9%) – 10 mnutes per week, 8.8 hours per year
  • 4 nines – 1 minute per week, 52 minutes per yeaar
  • 5 nines – 6 seconds per week, 5 minutes per year

Incident Management

  • Goals: Restore service to normal and minimize business impact
  • Be able to get the people who can help solve it
  • Log of events so can see when started
  • Blameless post mortems

Books

  • Google book ”Seeking SRE”
  • Google book ”The Site Reliability Workbook”
  • Book: Implementing Service Level Objectives

My take

There was a lot of info, but easy to follow. It was great to see a structured intro vs that random things I’ve read online

Leave a Reply

Your email address will not be published. Required fields are marked *