[javaone2026] Running GPU-Accelerated AI Inference from Java at Uber Scale

Speakers: Baojun Liu & Anshuman Mishra

See the live blog table of contents


Michelangelo

  • Uber’s unified ML platform
  • 20K models trained/month
  • 5.3 models in prod

Java

  • Online prediction is all Java
  • High concurrency orchestration
  • Business logic management
  • Production ecosystem integration

Spark ML pipeline

  • C++ is ML runtime and Math libraries
  • Java manages threads, coordinates execution, handles back pressure

Scaling

  • Model sizes have increased from 100K to 800G
  • Traffic growth about 30% per year

GPU/CPU

  • GPU cluster – 20x cost reduction and order of magnitude less instances
  • NVIDIA Triton inference server (C++)
  • GPU can handle 10x-100x more requests
  • Dynamic latency – trades latency for throughput
  • Scale down instance count for large models to save memory and up for high queries per second
  • One GPU to multiple CPUs. CPU bottleneck – cold start, error spikes, memory pressure

Techniques

  • Profiling
  • Send dummy data to avoid cold start
  • Tune GC memory and pause time
  • Frequent efficient collections
  • Panama
  • Virtual threads
  • Upgrade Java newer version to reduce CPU use

Other notes

  • Java is not the bottleneck, bad architecture is

My take

The beginning was the same slide as the keynote. Ok to review though. It got new quickly and it was nice to see what Uber is doing. Good questions from the audience

Leave a Reply

Your email address will not be published. Required fields are marked *