[javaone 2026] Look Inside a Large Language Model to Become a Better Java Developer

Speaker: Barry Burd

See the table of contents


Opener

  • Imagine have to find lowest point on a line but can only see a few steps ahead or behind you. One dimensional problem.
  • On a mountain, finding lowest point but in two dimensions.
  • LLM does that but with many more dimensions – ex: billions
  • Uses a lot of tricks, not just extending the one dimensional problem

Problem

  • Training GPT-3 required 10K NVIDIA GPUs
  • PyTorch is highly optimized 0 biuldt in libraries, deep integration with GPU hardware (NVIDIA CUDA)
  • Apple has GPU stack
  • Want to do with Java

Solution

  • HAT (Heterogeneous Acceleration Toolkit)
  • Work in progress
  • Part of project Babylon
  • Code models/reflection
  • Barry’s goal: algorithms to run on these

Deeplearning4j (ND4j)

  • CUDA support
  • No MDS support
  • Arrays stored off-heap (outside JVM)
  • Several arrays can point to several subarrays of same data.

What LLM does

  • After analyzing a possible incomplete string, the LLM decides to add the string’s next token.
  • Too many words to predict word
  • Characters too granular because missing meaning.

Tokens

  • I’ve groked Heinelin’s work as tokens:
*I

've

 gro

k

ked

 Hein

lein

's 

work

s
  • Token id is a number that goes with the tokens
  • Token is sequence of characters that occur together frequently enough using byte-pair encoding.
  • Supposing string is “a b r a c a d a b r a” the tokens are: a, b, c, d,r . Then observer token pair ab appears frequently so ab is also a token. Now have “ab r a c a d ab r a” with “ab” added to vocabulary. Then repeat and see ab and r appear next to each other so “abr a c a d abr a” with “abr” added to token list. Then “abra”
  • Python library tiktoken
  • Java library JTokkkit

Brain

  • 86 billion neurons in human brain
  • Dendrite – input from another cell
  • Soma – cell body
  • Axon – output to other cells
  • Oversimplified: Imagine cell body multiplies each input by a certain weight (different per cell) and adds them. That’s like multiplying a vector and a matrix

Math terms

  • Vector – array/list of numbers. Can represent a point in n-dimensional space. Usually visualized as an arrow from origin to that point. ex: 1526 dimensions is 1526 number in vector
  • Matrix – rectangular array of vectors. Turns one vector into another
  • Tensor – stack of matrixes. Array of array of matrixes. Not important here.
  • Dot product of two vectors – multiply elements in same spot in each vector and add them up.
  • Matrix multiplication – had nice animation

ND4j

  • N dimensions for J.
  • Knows how to do vector/matrix math

Embedding

  • Each token gets assigned to an arbitrary vector at first. This is the token embedding
  • Picture adjusting bunny years antenna and how it only works when you are touching it. Walking away breaks the reception.
  • Each number in the initial arbitrary vector is like a dial that needs to be tuned

Gradient Descent

  • Normally millions of minimum points and easy to get stuck in point that is a local minimum. Looks like at lowest point in all directions, but there is a lower one elsewhere. LLM training is meant to avoid that pitfall
  • Eclipse DeepLearning4J – can configure neural network and make a model

Vector meaning

  • Similar vectors are similar semantics
  • Applying related vectors to others should have consistent semantic meaning
  • RGB for colors are vectors with three points representing the colors.
  • Dot product of Cyan (0/255/255) and Red (255/0/0) because nothing in common
  • Add positional embedding to token embedding so know where in sentence. (Add to each element with different scale so know which part goes with which). These combined are in the input embedding

Attention

  • Attention is all you need https://arxiv.org/abs/1706.03762
  • Attention examples: grammatical structure, meaning, word order
  • Long range dependency – like a pronoun that refers to something many words away
  • Attention helps focus on important parts, ex: “The cat sat on the mat”. What’s on the me (the mat) at that point.. Has to know.
  • Apply key matrix to cat and a query matrix to mat. Key matrix offers info. Query matrix is what you want to know.
  • Start with random values. Then multiply by tokens. Then take dot product and get a number to see what predicts about next word. See how far prediction off from what it actually is. Apply to surrounding neighbors and go in direction that makes error less. Repeat a very large number of times
  • A lot of this can be done in parallel

Feed forward

  • Linear – straight – ex: 2x + 3y (2 and 3 are the knobs to tune)
  • Non-linear – wavy – more dimensions
  • Languages isn’t linear by nature
  • “great, terrific meal” – can add up
  • “not good” – not flips meaning of sentence. So can’t just predict the next word.

Universal approximation theorem

  • Imagine a wavy line as a series of bumps from the ground to that line
  • The more granular the bump, the more accurate the result.
  • Each bump can be represented by a linear formula
  • Apply GeLU (non linear) function to make the bumps

Experiment

  • Translated Karpathy’s 253 line LLM into Java with a list of baby names
  • Took about a minute to train.
    • Generated new names. A couple legit but most random looking

My take

This was great. Tokens and critical concepts get used without defining them and we take for granted. So being able to think about it was informative and helpful. I like that Barry showed math but said ok to not to understand. [it was understandable]

Leave a Reply

Your email address will not be published. Required fields are marked *