Mark Price @epickrram
For other QCon blog posts, see QCon live blog table of contents
- Trading app
- Need microsecond (not millisecond) response time
- Need data in memory vs database
- Lock free programming
- High volume
- Predictable latency
- System built on OSS
- Opinionated framework to accelerate app dev
- Clients communicate with stateless, scalable gateways
- Persistors – manage data in memory.
- Gateway – converts large text message to something smaller and more efficient
- Replay logs to reapply changes. Business logic must be fully deterministic. Bounded recovery times
- Placement group in cloud – machines guaranteed to be near each other. Minimizes latency between nodes
- Do as part of CD pipeline
- Can’t physically monitor with fibertab
- Capture in histogram to get statistical view and calculate data
- Test under load
- Fan out where test from
- Store % in time series data
- Can see jigger for garbage collection
Performance on shared box/cloud
- Not in control of resources running on
- Containers share L3 cache so can see higher rates of cache miss
- CPU throttling effects
- Hard to measure since can’t see what neighbors are doing
- One option is to rent the largest box possible and compare to vendor website for specs. If have max # cores, know have box to self. Expensive. Was about five dollars a year. At that price, might be worth just buying own machine in data center
- Can pack non latency services onto shared machines
<missed some at the end. I got an email that distracted me>
There was a lot of discussion about the histogram. I would have liked to see some examples rather than just talking about how it is calculated. They didn’t have to be real examples to be useful. There were some interesting facts and it was a good case study so I’m glad I went. I was glad he addressed that non-cloud is a possible option for this scenario
Building Bulletproof Java Enterprise Applications
Speaker: Sebastian Daschner
For more blog posts, see The Oracle Code One table of contents
- Don’t crash
- Prevent faiures from casading
- Don’t allow actions that are doomed to fail
- Avoid deadlocks
- Kill at some point so overall system and continue
- Especially http and database timeouts.
- Some libraries default to no timeouts
- Immediately retry to avoid temporary failure – but be careful that not putting more load on a failing server
- Avoid unnecessary error codes
- Decide how often and how many times to retry
Java EE Extensions
My take: I like that there are actual code examples. I don’t like that the text based slides are in vi (or a screenshot of vi). Such a smal font and tons of wasted whitespace.
I’m the co-volunteer coordinator for NYC FIRST. Every year we are faced with a problem: we want to export the volunteer data including preferences for offseason events. The system provides an export feature but does not include a few fields we want. A few years ago, my friend Norm said “if only we could export those fields.” I’m a programmer; of course we can!
So I wrote him a program to do just this. It’s export-vol-data at Github. And fittingly, he “paid” me with free candy from the NYC FIRST office. Once a year we meet, Norm gives his credentials to the program and we wait. And wait. And wait. This year NYC FIRST had more events than ever before so it took a really long time. I wanted to tune it.
Getting test data
The problems with tuning have been:
- I have no control over when people volunteer for the event. It’s hard to performance test when the data set keeps changing.
- The time period when I have access to the event is not the time period that I have the most free time.
Norm solved these problems by creating a test event for me. I started over the summer, but then got accepted to speak at JavaOne and was really busy getting ready for that. Then I went back to it and someone deleted my test event. Norm solved that problem by creating a new event called “TEST EVENT FOR SOFTWARE DEVELOPMENT – DO NOT ENROLL OR DELETE, please. – FLL”. And one person did volunteer for that. But not a lot so it helped.
I tried the following performance improvements based on our experience exporting in April 2017.
- SUCCESS: Run the program on the largest events first. (It’s feasible to manually export the data for small events. Plus those have largely people who also volunteered at a larger event.) This allows us to run for the events with the most business value first. It also allows us to abort the program at any time.
- SUCCESS: Skip events and roles with zero volunteers. For some reason, it takes a lot longer to load a page with no volunteers. So skipping this makes the program MUCH faster.
- SKIP: Add parallelization. I wound up not doing this because the program is so fast now.
- FAILED: Try to go directly to URLs with data. FIRST prevents this from working. You can’t simply simulate the REST calls externally.
- SUCCESS: Switch from Firefox driver to Chrome driver. This made a huge difference in both performance and stability. The program would crash periodically in Firefox. I was never able to figure out why. I have retry/resume logic, but having to manually click “continue” makes it slower.
- UNKNOWN: I added support for Headless Chrome in the program. It doesn’t seem noticeably faster though. And it is fun for Norm and I to watch the program “click” through the site. So I left it as an option, but not the default.
Like any good programming exercise, some things worked and some didn’t. The program is an order of magnitude faster now that at the start though so I declare this a success!