[QCon 2019] Low Latency in the Cloud, with OSS

Mark Price @epickrram

For other QCon blog posts, see QCon live blog table of contents

Requirements

  • Trading app
  • Need microsecond (not millisecond) response time
  • Need data in memory vs database
  • Lock free programming
  • Redundancy
  • High volume
  • Predictable latency

Hydra

  • System built on OSS
  • Opinionated framework to accelerate app dev
  • Clients communicate with stateless, scalable gateways
  • Persistors – manage data in memory.
  • Gateway – converts large text message to something smaller and more efficient

Design choices

  • Replay logs to reapply changes. Business logic must be fully deterministic. Bounded recovery times
  • Placement group in cloud – machines guaranteed to be near each other. Minimizes latency between nodes

Testing latency

  • Do as part of CD pipeline
  • Can’t physically monitor with fibertab
  • Capture in histogram to get statistical view and calculate data
  • Test under load
  • Fan out where test from
  • Store % in time series data
  • Can see jigger for garbage collection

Performance on shared box/cloud

  • Not in control of resources running on
  • Containers share L3 cache so can see higher rates of cache miss
  • CPU throttling effects
  • Hard to measure since can’t see what neighbors are doing
  • One option is to rent the largest box possible and compare to vendor website for specs. If have max # cores, know have box to self. Expensive. Was about five dollars a year. At that price, might be worth just buying own machine in data center
  • Can pack non latency services onto shared machines

<missed some at the end. I got an email that distracted me>

My impressions

There was a lot of discussion about the histogram. I would have liked to see some examples rather than just talking about how it is calculated. They didn’t have to be real examples to be useful. There were some interesting facts and it was a good case study so I’m glad I went. I was glad he addressed that non-cloud is a possible option for this scenario

[2018 oracle code one] Bulletproof Java Enterprise Applications

Building Bulletproof Java Enterprise Applications
Speaker: Sebastian Daschner

For more blog posts, see The Oracle Code One table of contents


 

Being resilient

  • Don’t crash
  • Prevent faiures from casading
  • Don’t allow actions that are doomed to fail

Timeouts

  • Avoid deadlocks
  • Kill at some point so overall system and continue
  • Especially http and database timeouts.
  • Some libraries default to no timeouts

Retries

  • Immediately retry to avoid temporary failure – but be careful that not putting more load on a failing server
  • Avoid unnecessary error codes
  • Decide how often and how many times to retry

Java EE Extensions

My take: I like that there are actual code examples. I don’t like that the text based slides are in vi (or a screenshot of vi). Such a smal font and tons of wasted whitespace.

performance tuning selenium – firefox vs chrome vs headless

I’m the co-volunteer coordinator for NYC FIRST. Every year we are faced with a problem: we want to export the volunteer data including preferences for offseason events. The system provides an export feature but does not include a few fields we want. A few years ago, my friend Norm said “if only we could export those fields.” I’m a programmer; of course we can!

So I wrote him a program to do just this. It’s export-vol-data at Github. And fittingly, he “paid” me with free candy from the NYC FIRST office. Once a year we meet, Norm gives his credentials to the program and we wait. And wait. And wait. This year NYC FIRST had more events than ever before so it took a really long time. I wanted to tune it.

Getting test data

The problems with tuning have been:

  1. I have no control over when people volunteer for the event. It’s hard to performance test when the data set keeps changing.
  2. The time period when I have access to the event is not the time period that I have the most free time.

Norm solved these problems by creating a test event for me. I started over the summer, but then got accepted to speak at JavaOne and was really busy getting ready for that. Then I went back to it and someone deleted my test event. Norm solved that problem by creating a new event called “TEST EVENT FOR SOFTWARE DEVELOPMENT – DO NOT ENROLL OR DELETE, please. – FLL”. And one person did volunteer for that. But not a lot so it helped.

Performance tuning

I tried the following performance improvements based on our experience exporting in April 2017.

  1. SUCCESS: Run the program on the largest events first. (It’s feasible to manually export the data for small events. Plus those have largely people who also volunteered at a larger event.) This allows us to run for the events with the most business value first. It also allows us to abort the program at any time.
  2. SUCCESS: Skip events and roles with zero volunteers. For some reason, it takes a lot longer to load a page with no volunteers. So skipping this makes the program MUCH faster.
  3. SKIP: Add parallelization. I wound up not doing this because the program is so fast now.
  4. FAILED: Switch from Firefox driver to PhantomJS. I knew the site didn’t function with HtmlUnitDriver. I thought maybe it would work with PhantomJS – an in memory driver with better JavaScript support. Alas it didn’t.
  5. FAILED: Try to go directly to URLs with data. FIRST prevents this from working. You can’t simply simulate the REST calls externally.
  6. SUCCESS: Switch from  Firefox driver to Chrome driver. This made a huge difference in both performance and stability. The program would crash periodically in Firefox. I was never able to figure out why. I have retry/resume logic, but having to manually click “continue” makes it slower.
  7. UNKNOWN: I added support for Headless Chrome in the program. It doesn’t seem noticeably faster though. And it is fun for Norm and I to watch the program “click” through the site. So I left it as an option, but not the default.

Results

Like any good programming exercise, some things worked and some didn’t.  The program is an order of magnitude faster now that at the start though so I declare this a success!