[javaone2026] Running GPU-Accelerated AI Inference from Java at Uber Scale

Speakers: Baojun Liu & Anshuman Mishra

See the live blog table of contents

Michelangelo

Uber’s unified ML platform
20K models trained/month
5.3 models in prod

Java

Online prediction is all Java
High concurrency orchestration
Business logic management
Production ecosystem integration

Spark ML pipeline

C++ is ML runtime and Math libraries
Java manages threads, coordinates execution, handles back pressure

Scaling

Model sizes have increased from 100K to 800G
Traffic growth about 30% per year

GPU/CPU

GPU cluster – 20x cost reduction and order of magnitude less instances
NVIDIA Triton inference server (C++)
GPU can handle 10x-100x more requests
Dynamic latency – trades latency for throughput
Scale down instance count for large models to save memory and up for high queries per second
One GPU to multiple CPUs. CPU bottleneck – cold start, error spikes, memory pressure

Techniques

Profiling
Send dummy data to avoid cold start
Tune GC memory and pause time
Frequent efficient collections
Panama
Virtual threads
Upgrade Java newer version to reduce CPU use

Other notes

Java is not the bottleneck, bad architecture is

My take

The beginning was the same slide as the keynote. Ok to review though. It got new quickly and it was nice to see what Uber is doing. Good questions from the audience

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Leave a Reply

Share this:

Leave a Reply