Speakers: Baojun Liu & Anshuman Mishra
See the live blog table of contents
Michelangelo
- Uber’s unified ML platform
- 20K models trained/month
- 5.3 models in prod
Java
- Online prediction is all Java
- High concurrency orchestration
- Business logic management
- Production ecosystem integration
Spark ML pipeline
- C++ is ML runtime and Math libraries
- Java manages threads, coordinates execution, handles back pressure
Scaling
- Model sizes have increased from 100K to 800G
- Traffic growth about 30% per year
GPU/CPU
- GPU cluster – 20x cost reduction and order of magnitude less instances
- NVIDIA Triton inference server (C++)
- GPU can handle 10x-100x more requests
- Dynamic latency – trades latency for throughput
- Scale down instance count for large models to save memory and up for high queries per second
- One GPU to multiple CPUs. CPU bottleneck – cold start, error spikes, memory pressure
Techniques
- Profiling
- Send dummy data to avoid cold start
- Tune GC memory and pause time
- Frequent efficient collections
- Panama
- Virtual threads
- Upgrade Java newer version to reduce CPU use
Other notes
- Java is not the bottleneck, bad architecture is
My take
The beginning was the same slide as the keynote. Ok to review though. It got new quickly and it was nice to see what Uber is doing. Good questions from the audience