DevOps @ Scale – Oracle Code NYC live blog | Down Home Country Coding With Scott Selikoff and Jeanne BoyarskyDown Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Title: DevOps at Scale; Greek Tragedy in Three Acts
Speakers: Baruch Sadogursky (JFrog), Alena Prokharchyk (Rancher Inc)

Slides: https://jfrog.com/shownotes/

General notes

Reactive ops

Imaginary three person company Eager to learn
Buzzwords – serverless, no ops. Both sound like don’t need to do anything. But…
Basic tools all in cloud – jira, github, travis ci, oracle cloud
Challenge: many logs because of microservices
Challenge: time zone of cloud provider vs your time zone
Two years ago “the internet broke” – NPM registry took down a dependency (left pad)

Scale

Hire ops person
Grow staff
Add Scrum
Add exploratory testing [surprised hardly anyone knew what this was; it’s just trying stuff an testing]
Developer on call
More tools – Confluence, Artifactory, Sumologic (analyze logs; like Splunk), Pingdom (monitoring)

More maturity

Root cause analysis – includes syptoms, what happened to lead up to problem, next steps so can prevent happening again
Want to have new problems; not the same one happening ove and over
Retrospective
Importance of disclosure – ex: gitlab lost 6 hours of data last year. Were forgiven because so transparent

Perfect storm

More scale – 5 ops people, 1 performance engineer, 74 deveopers, chief architect, customer success team (bridge between developers and customers like developer on call but a bunch of them that know system)
SAFE – scrum at scale
System testing
Ops team – two ways to do devops 1) hire brighest engineers in world (like Netlfix) where know dev and ops perfectly. Rare to be able to do this. 2) Specializations exist. The ops people set up everything and then evangelize it so devops can happen. Often called the tools team.
Escalation path: SME and manager on call. The manager can work on relationships. Also makes fixing faster since know will be escalated to management.
SOC II – regulation/audit for service organization. Requires separation of controls so people who write code cannot deploy to prod. Can’t write code and control system that deploys it. Interesting. Ex: the tooling team doesn’t allow skipping integration tests. So the people writing apps don’t control the deployment pipeline
Problem: Need to find out if have any code that uses a certai license. Lots of work to do manually. [easy if you have IQServer! Or a JFrog product; I didn’t catch which one]
Problem: Guessing how much to spend on servers for next year. Guess. Nobody will shut down server if need more resources
Problem: Will it scale. Guess. False confidence. It didn’t. Greek tragedy; everyone dies.

To avoid problems

Performance/scalability testing
License and seucrity management
Code in monitoring. Ex: docker. By extending base image, get things built in.
Tools support process – JFrog commercial :).
Showed pie chart of where time goes. Interesting way of looking at fragmentation. The pie hart had a lot of slices!
Can’t have a non-functional definition of done
Majority of industry is in fire alarm/reactive improvement mode. But still strive to proactive improvement. It is hard and expensive.

Takeways:

must be responsible for what you build. “You build it; you own it”
Data is the key. Even if it is in Excel.
Pain in instructional. Results in improvements. Continuous improvement. If something hurts, do it more often.

My take

This was a fun start. It tells a great entertaining story and includes information. Watch the video. My notes nor the slides do it justice!

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky