[QCOn 2019] The Trouble with Learning in Complex Systems

Jason Hand from Microsoft – @jasonhand

For other QCon blog posts, see QCon live blog table of contents

Definitions

  • We use terms where not everyone on same page as to meanings.
  • Ex: what does “complex” mean
  • Types of systems
    • Whether can determine cause and effect
    • Ordered vs unordered
    • Ordered – Obvious (Can take it apart/put it back together. Know how works. ex: bicycle), complicated (ex: motorcycle)
    • Unordered – obvious, complicated, complex (ex: people on road, human body), chaotic (ex: NYC)
  • Sociotechnical systems – th epeople part is hard

Complex system

  • Causality can only be examined/understood/determined in hindsight
  • Specialists, but lack broad understanding of system
  • Imperfect information
  • Constantly changing
  • Users good at surprising us with what system can/can’t do

Learning

  • Takes time
  • Takes success and failure. Need both
  • Learning opportunities not evenly distributed
  • Sample learning opportunities – code commits, config changes, feature releases and incident response. Commits occur much more often than instances
  • However, the cost to recovery is low for the more frequent opportunities
  • High opportunity – low stakes and high frequency. GIt push is muscle memory
  • Low opportunity – high stakes and low opportunity
  • Frequency is what creates the opportunity

Incident

  • Everyone would agree impacting the customer is an incident
  • If didn’t affect the customer, not always called an incident.
  • If not called an incident, no incident review.
  • Missed learning opportunity
  • We view incidents as bad.
  • Incidents are unplanned work.
  • Near misses save the day, but don’t get recognized or learned from
  • Systems are continuously changing; will never be able to remove all problems from system

Techniques to learn

  • Root cause analysis is insufficient. Like a post mortem, it is just about what went wrong.
  • Needs to be a learning review
  • Discuss language barriers, tools, confidence level, what people tried
  • Discuss what happened by time and the impact
  • ChatOps better than phone bridge because can capture what happened. Nobody is going to transcribe later. Having clean channel for communication helps.
  • However, incidents not linear.
  • Book: Overcomplicated
  • If someone just does one thing, the learning doesn’t transfer. Need operational knowledge and mental models

Learning Reviews

  • Set context – not looking for answers/fixes. Looking for ways to learn even if no action items
  • Set aside time/effort to be curious
  • Asking linear questions (ex: five whys), don’t get to reality system
  • Invite people who weren’t part of incident response. They should still learn and can provide info about system
  • Understand and reduce blind spots

My impression

Good talk. It’s definitely thought provoking. And suggests small things one can do to start making things better

[QCon 2019] Low Latency in the Cloud, with OSS

Mark Price @epickrram

For other QCon blog posts, see QCon live blog table of contents

Requirements

  • Trading app
  • Need microsecond (not millisecond) response time
  • Need data in memory vs database
  • Lock free programming
  • Redundancy
  • High volume
  • Predictable latency

Hydra

  • System built on OSS
  • Opinionated framework to accelerate app dev
  • Clients communicate with stateless, scalable gateways
  • Persistors – manage data in memory.
  • Gateway – converts large text message to something smaller and more efficient

Design choices

  • Replay logs to reapply changes. Business logic must be fully deterministic. Bounded recovery times
  • Placement group in cloud – machines guaranteed to be near each other. Minimizes latency between nodes

Testing latency

  • Do as part of CD pipeline
  • Can’t physically monitor with fibertab
  • Capture in histogram to get statistical view and calculate data
  • Test under load
  • Fan out where test from
  • Store % in time series data
  • Can see jigger for garbage collection

Performance on shared box/cloud

  • Not in control of resources running on
  • Containers share L3 cache so can see higher rates of cache miss
  • CPU throttling effects
  • Hard to measure since can’t see what neighbors are doing
  • One option is to rent the largest box possible and compare to vendor website for specs. If have max # cores, know have box to self. Expensive. Was about five dollars a year. At that price, might be worth just buying own machine in data center
  • Can pack non latency services onto shared machines

<missed some at the end. I got an email that distracted me>

My impressions

There was a lot of discussion about the histogram. I would have liked to see some examples rather than just talking about how it is calculated. They didn’t have to be real examples to be useful. There were some interesting facts and it was a good case study so I’m glad I went. I was glad he addressed that non-cloud is a possible option for this scenario

[QCon 2019] Making npm install safe

Kate SIlls @kate_sills

For other QCon blog posts, see QCon live blog table of contents

General

  • Building financial software in JavaScript
  • 97% of code in a modern web app comes from npm

Security issues

  • All packages are risky
  • Imports and global variables
  • Effects opaque
  • Can be from dependency many levels deep

Pattern

  • Event stream package (11/28/18)
  • Electron native notify package (6/4/19)
  • Can call node built it modules to read a file and send it over the network
  • Targetted cryptocurrency

Options for solution

  • Write everything yourself – not scalable
  • Pay open source maintainers so someone responsible for security – people make mistakes. Even people who are paid can compromise a system
  • Code audits – don’t see everything, Hard to find clever 

Other approach

  • Preventing attacks requires infallability
  • Better to look for ways to limit damage
  • For example, would be better of if can’t import fs
  • JavaScript is good at code isolation. Clear separation between pure computation and connection to outside world

Realms – draft proposal

  • Want to be able to create realm without overhead of an iframe
  • Featherweight compartment – shares primordials/context
  • There is a realm shim now
  • Self/window not defined in the compartment

Attack – prototype poisoning

  • Save copy of original function
  • Do something bad first and then call original function so it looks right
  • SES (Secure ECMAScript)  – realms + transitive freezing/hardening
  • Can’t change prototype behavior with SES
  • npm install ses
  • SES.makeSESRootRealm()

POLA

  • Principle of least authority
  • Same as principle of least privilege
  • Reasonable to want to access file system. Can attenuate (reduce the impact of) access by wrapping fs with check for correct file name. (Not clear how prevents using original fs). Method harden protects
  • The chalk package needs process/OS access to change color
  • But can kill process and change priority of process with that access
  • Want to limit access to just what needed
  • Chalk only needs OS to get the release. Can attenuate so just have that one function to return release string.
  • Object capabilities – http://habitatchronicles.com/2017/05/what-are-capabilities/

Moddable XS

  • Only completed ECMA Script 2018 engine optimized for embedded device
  • Contains SES
  • Safe for users to install JS apps
  • Can only do specific things
  • Can add own app to washing machine

Metamask’s Sesify

  • Ethereum wallet
  • Can run Ethereum apps in browser without running full Ethereum node

Salesforce’s Locker Service

  • One of primary co-authors of Realms and SES
  • Plugin platform

Caveats for Realms

  • Work in progress
  • Have to stringify to use
  • Still in draft

Q&A

  • What if add something bad? https://ocapjs.org/t/tofu-trusted-on-first-use-tool/27 Putting something bad in wrapper would show up in diff/code review.
  • How SES different than Object.freeze? Object.freeze only freezes that instance and doesn’t go up prototype chain 
  • How know what functions/authorities need to provide to packages? Still developing patterns of use. For now might be trial and error. Might need changes to module.
  • Why don’t we hear about npm install attacks in other languages? Still have problems. Java can’t protect [I raised my hand and described how Sonatype helps protect Maven Central]. Worse on JavaScript because lots of tiny packages. Visibility will help in future.
  • Will this be bolted on to web frameworks? Hasn’t yet, but hope will happen.

My impressions

While I was aware of the problem, the solution (or future solution) is really interesting! She left lots of time for Q&A which was nice after yesterday. [My track didn’t have much time for Q&A in most sessions]