[javaone 2026] Look Inside a Large Language Model to Become a Better Java Developer

Speaker: Barry Burd

See the table of contents

Opener

Imagine have to find lowest point on a line but can only see a few steps ahead or behind you. One dimensional problem.
On a mountain, finding lowest point but in two dimensions.
LLM does that but with many more dimensions – ex: billions
Uses a lot of tricks, not just extending the one dimensional problem

Problem

Training GPT-3 required 10K NVIDIA GPUs
PyTorch is highly optimized 0 biuldt in libraries, deep integration with GPU hardware (NVIDIA CUDA)
Apple has GPU stack
Want to do with Java

Solution

HAT (Heterogeneous Acceleration Toolkit)
Work in progress
Part of project Babylon
Code models/reflection
Barry’s goal: algorithms to run on these

Deeplearning4j (ND4j)

CUDA support
No MDS support
Arrays stored off-heap (outside JVM)
Several arrays can point to several subarrays of same data.

What LLM does

After analyzing a possible incomplete string, the LLM decides to add the string’s next token.
Too many words to predict word
Characters too granular because missing meaning.

Tokens

I’ve groked Heinelin’s work as tokens:

*I

've

 gro

k

ked

 Hein

lein

's 

work

s

Token id is a number that goes with the tokens
Token is sequence of characters that occur together frequently enough using byte-pair encoding.
Supposing string is “a b r a c a d a b r a” the tokens are: a, b, c, d,r . Then observer token pair ab appears frequently so ab is also a token. Now have “ab r a c a d ab r a” with “ab” added to vocabulary. Then repeat and see ab and r appear next to each other so “abr a c a d abr a” with “abr” added to token list. Then “abra”
Python library tiktoken
Java library JTokkkit

Brain

86 billion neurons in human brain
Dendrite – input from another cell
Soma – cell body
Axon – output to other cells
Oversimplified: Imagine cell body multiplies each input by a certain weight (different per cell) and adds them. That’s like multiplying a vector and a matrix

Math terms

Vector – array/list of numbers. Can represent a point in n-dimensional space. Usually visualized as an arrow from origin to that point. ex: 1526 dimensions is 1526 number in vector
Matrix – rectangular array of vectors. Turns one vector into another
Tensor – stack of matrixes. Array of array of matrixes. Not important here.
Dot product of two vectors – multiply elements in same spot in each vector and add them up.
Matrix multiplication – had nice animation

ND4j

N dimensions for J.
Knows how to do vector/matrix math

Embedding

Each token gets assigned to an arbitrary vector at first. This is the token embedding
Picture adjusting bunny years antenna and how it only works when you are touching it. Walking away breaks the reception.
Each number in the initial arbitrary vector is like a dial that needs to be tuned

Gradient Descent

Normally millions of minimum points and easy to get stuck in point that is a local minimum. Looks like at lowest point in all directions, but there is a lower one elsewhere. LLM training is meant to avoid that pitfall
Eclipse DeepLearning4J – can configure neural network and make a model

Vector meaning

Similar vectors are similar semantics
Applying related vectors to others should have consistent semantic meaning
RGB for colors are vectors with three points representing the colors.
Dot product of Cyan (0/255/255) and Red (255/0/0) because nothing in common
Add positional embedding to token embedding so know where in sentence. (Add to each element with different scale so know which part goes with which). These combined are in the input embedding

Attention

Attention is all you need https://arxiv.org/abs/1706.03762
Attention examples: grammatical structure, meaning, word order
Long range dependency – like a pronoun that refers to something many words away
Attention helps focus on important parts, ex: “The cat sat on the mat”. What’s on the me (the mat) at that point.. Has to know.
Apply key matrix to cat and a query matrix to mat. Key matrix offers info. Query matrix is what you want to know.
Start with random values. Then multiply by tokens. Then take dot product and get a number to see what predicts about next word. See how far prediction off from what it actually is. Apply to surrounding neighbors and go in direction that makes error less. Repeat a very large number of times
A lot of this can be done in parallel

Feed forward

Linear – straight – ex: 2x + 3y (2 and 3 are the knobs to tune)
Non-linear – wavy – more dimensions
Languages isn’t linear by nature
“great, terrific meal” – can add up
“not good” – not flips meaning of sentence. So can’t just predict the next word.

Universal approximation theorem

Imagine a wavy line as a series of bumps from the ground to that line
The more granular the bump, the more accurate the result.
Each bump can be represented by a linear formula
Apply GeLU (non linear) function to make the bumps

Experiment

Translated Karpathy’s 253 line LLM into Java with a list of baby names
Took about a minute to train.
- Generated new names. A couple legit but most random looking

My take

This was great. Tokens and critical concepts get used without defining them and we take for granted. So being able to think about it was informative and helpful. I like that Barry showed math but said ok to not to understand. [it was understandable]

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Leave a Reply

Share this:

Leave a Reply