ADD / XOR / ROL: July 2025

Friday, July 11, 2025

Understand Neural Nets better, post 5 of N -- Code Assistant shootout

In a series of previous blogposts [1, 2, 3, 4] I ran some experiments drawing the boundaries of the polytopes generated by a fully-connected leaky ReLU network while it was getting trained on reproducing an input image.

As I tried to scale the experiments to larger networks, I noticed a dramatic slowdown in the code, caused by the calculation of a hash of the activation pattern happening on CPU -- so each training step would be fast, but then everything would grind to a halt for the visualisation, and for each pixel the code would forward-evaluate the NN (all in all 1024*1024 times), and whenever the prediction was calculated, it'd transfer the activation pattern to CPU and then perform the hashing. This was very slow, and very non-parallel.

I had contemplated writing some custom CUDA code to speed things up - there's no reason to store the activation pattern or transfer it, the "right" way to solve the problem is computing a hash on the fly, ideally a hash with a commutative update function so the order in which the different ReLU neurons update the hash doesn't matter.

Then again, this is a hobby project, and I don't have the time to do anything overly smart for the moment. So I decided to - before doing anything sophisticated - I'll see if I can have one of the two existing coding assistant that I use regularly solve the problem for me.

So I created two different directories, checked out the same base repo into both, created branches in both, and then queried both Gemini CLI and Claude Code perform the task, using the following prompt:

The Python script in this directory trains a fully connected leaky ReLU network on an input image and tries 
to reproduce it. It also draws pictures illustrating the boundaries of the polytopes generated by the creases
that the ReLU creates in input space. Unfortunately, the code to generate the polytope visualisation is slow,
because it involves 1024*1024 evaluations of the NN forward, and then it needs to hash the activation pattern
into a hash to uniquely identify what polytope the pixel resides on.

I would like to speed up this computation, by - instead of calculating a hash of the activation pattern at the 
end - somehow embedding the calculation of a hash into the forward pass on-GPU. This might be doable with 
PyTorch hooks, but I don't know precisely. 

What I do know is that if I run 
```
python3 ./draw-poly-while-training.py  --input ./centered_ring.png --shape [100]*20 --epochs 30 --seed 12345678 --points 5050 --save-interval 10
``` 

the output looks something like this: 
```
(...)
Input size (MB): 0.01
Forward/backward pass size (MB): 16.39
Params size (MB): 0.77
Estimated Total Size (MB): 17.17
==========================================================================================
2025-07-08 15:15:25,811 - polytope_nn - INFO - Epoch 1/2000000 - Train Loss: 3.315190, Val Loss: 0.329414
2025-07-08 15:15:25,857 - polytope_nn - INFO - Epoch 2/2000000 - Train Loss: 1.045730, Val Loss: 0.065818
2025-07-08 15:15:25,901 - polytope_nn - INFO - Epoch 3/2000000 - Train Loss: 1.414065, Val Loss: 0.488735
2025-07-08 15:15:25,948 - polytope_nn - INFO - Epoch 4/2000000 - Train Loss: 0.201550, Val Loss: 0.102159
2025-07-08 15:15:26,100 - polytope_nn - INFO - Epoch 5/2000000 - Train Loss: 0.198983, Val Loss: 0.050712
2025-07-08 15:15:26,145 - polytope_nn - INFO - Epoch 6/2000000 - Train Loss: 0.255710, Val Loss: 0.060731
2025-07-08 15:15:26,189 - polytope_nn - INFO - Epoch 7/2000000 - Train Loss: 0.122960, Val Loss: 0.091274
2025-07-08 15:15:26,232 - polytope_nn - INFO - Epoch 8/2000000 - Train Loss: 0.180629, Val Loss: 0.053913
2025-07-08 15:15:26,276 - polytope_nn - INFO - Epoch 9/2000000 - Train Loss: 0.826762, Val Loss: 0.156673
2025-07-08 15:15:26,320 - polytope_nn - INFO - Epoch 10/2000000 - Train Loss: 0.211313, Val Loss: 0.117810
2025-07-08 15:16:27,853 - polytope_nn - INFO - Visualization @ epoch 10: 61.53s
2025-07-08 15:16:27,899 - polytope_nn - INFO - Epoch 11/2000000 - Train Loss: 0.174978, Val Loss: 0.053103
2025-07-08 15:16:27,943 - polytope_nn - INFO - Epoch 12/2000000 - Train Loss: 0.332561, Val Loss: 0.095801
2025-07-08 15:16:27,987 - polytope_nn - INFO - Epoch 13/2000000 - Train Loss: 0.192859, Val Loss: 0.064341
2025-07-08 15:16:28,031 - polytope_nn - INFO - Epoch 14/2000000 - Train Loss: 0.115424, Val Loss: 0.051763
2025-07-08 15:16:28,076 - polytope_nn - INFO - Epoch 15/2000000 - Train Loss: 0.362009, Val Loss: 0.128609
2025-07-08 15:16:28,122 - polytope_nn - INFO - Epoch 16/2000000 - Train Loss: 0.117143, Val Loss: 0.058641
2025-07-08 15:16:28,165 - polytope_nn - INFO - Epoch 17/2000000 - Train Loss: 0.335812, Val Loss: 0.082517
2025-07-08 15:16:28,211 - polytope_nn - INFO - Epoch 18/2000000 - Train Loss: 0.079342, Val Loss: 0.060753
2025-07-08 15:16:28,257 - polytope_nn - INFO - Epoch 19/2000000 - Train Loss: 0.104123, Val Loss: 0.047914
2025-07-08 15:16:28,304 - polytope_nn - INFO - Epoch 20/2000000 - Train Loss: 0.097466, Val Loss: 0.050452
2025-07-08 15:17:31,553 - polytope_nn - INFO - Visualization @ epoch 20: 63.25s
```

From this we can see that a single visualisation step takes more than a minute for a network of this size, and 
profiling shows that most of this time is spent in hashing things on the CPU, not the GPU.
I would like you to find a way to do the calculation of the hash during the forward pass on the GPU, ideally 
without storing the activation vector in memory, and instead having a hash function that can be updated
commutatively so each ReLU unit can update the final hash while it calculates the forward pass.

I want you to:

1) Create a plausible plan for improving and speeding up the code.
2) Implement that plan.
3) Re-run the script with the specified command line, and observe if a speedup indeed took place -- e.g. check
that (a) the visualisation was sped up and (b) the sum of 10 training steps and the visualisation together was
sped up.

It is frightfully easy to speed up the visualisation step but slow down the training steps so much that 10
training steps and 1 visualisation step get *slower*.

Please also verify that the image output is the same between the pre-change and post-change version, to ensure
that the changes do not break anything.

I then allowed both models to churn for a while. Both models provided changes, but Gemini failed to actually verify that the results are the same. Claude one-shotted the problem; Gemini needed the following additional prompt:

I have run your example code, and checked the output. The output images are not identical between the
pre-change and post-change version, and even the training loss changed. FWIW, none of the polytopes
are visible in your version. Could you re-check your work, and this time make sure you check whether
the outputs are the same?

With that extra prodding / prompting, the solution provided by the model worked flawlessly, and was even a tiny bit faster than the Claude version.

Let's look at the code that both models generated: The Gemini branch and the Claude branch. Reading the changes, a few things become clear:

Gemini shot itself in the foot on the RNG by generating a bunch of random hash coefficients, and that messed up the state of the RNG, so the training runs were no longer comparable pre/post change.
Gemini is using torch.matmul for the hash computation, whereas Claude is computing the hash as torch.sum( A * B ).
Claude has broken up the code in more smaller functions, whereas Gemini didn't. Claude's code is mildly more readable, Gemini's is the more minimal change.

Interesting stuff. Neither solution is quite what I had in mind, but they are good enough for the moment, and provide a pretty significant speedup over the (also vibe-coded) stuff that I started out with. This is the first time for me that a coding assistant helped me optimize code in a nontrivial manner, and that's ... certainly something.

Anyhow, with these optimizations I can now run my data visualisation movie generation on slightly larger NNs with millions of parameters, so more studying ahead. I now need to figure out how to upload YouTube videos programmatically, but in the meantime, here is a video of training a 100-neuron, 10 layer deep network on the "circle drawing" task from my previous posts. Vibe coding randomly changed the color of my lines, but hey, that's ok.

As per usual, there are more questions than answers in this video. The thing that puzzles me most is the relative "instability" of the training in later epochs. This is visible in "flickers" where seemingly randomly the SGD step hits on a vastly higher loss, with parts of the screen turning black and loss spiking, and then the training needs to recover. Interestingly, the geometry of the polytopes doesn't change a lot in these situations, but the linear function on many of them changes at once, in a way that is very detrimental to overall performance. Once programmatic uploading works, I'll upload many more videos, because one of the intriguing observations I have is the following:

When training diverges (for larger and deeper nets), the divergence starts by first messing up the linear functions, and only after they are gloriously messed up, the geometry of the polytopes starts to go haywire, too.

Until then!

Sunday, July 06, 2025

A non-anthropomorphized view of LLMs

In many discussions where questions of "alignment" or "AI safety" crop up, I am baffled by seriously intelligent people imbuing almost magical human-like powers to something that - in my mind - is just MatMul with interspersed nonlinearities.

In one of these discussions, somebody correctly called me out on the simplistic nature of this argument - "a brain is just some proteins and currents". I felt like I should explain my argument a bit more, because it feels less simplistic to me:

The space of words

The tokenization and embedding step maps individual words (or tokens) to some \(\mathbb{R}^n\) vectors. So let us imagine for a second that we have \(\mathbb{R}^n\) in front of us. A piece of text is then a path through this space - going from word to word to word, tracing a (possibly convoluted) line.

Imagine now that you label each of the "words" that form the path with a number: The last word with 1, counting forward until you hit the first word or the maximum context length \(c\). If you've ever played the game "Snake", picture something similar, but played in very high-dimensional space - you're moving forward through space with the tail getting truncated off.

The LLM takes your previous path into account, calculates probabilities for the next point to go to, and then makes a random pick into the next point according to these probabilities. An LLM instantiated with a fixed random seed is a mapping of the form \((\mathbb{R}^n)^c \mapsto (\mathbb{R}^n)^c\).

In my mind, the paths generated by these mappings look a lot like strange attractors in dynamical systems - complicated, convoluted paths that are structured-ish.

Learning the mapping

We obtain this mapping by training it to mimic human text. For this, we use approximately all human writing we can obtain, plus corpora written by human experts on a particular topic, plus some automatically generated pieces of text in domains where we can automatically generate and validate them.

Paths to avoid

There are certain language sequences we wish to avoid - because the sequences these models generate try to mimic human speech in all it's empirical structure, but we feel that some of the things that humans have empirically written are very undesirable to be generated. We also feel that a variety of other paths should ideally not be generated, if - when interpreted by either humans or other computer systems - undesirable results arise.

We can't specify strictly in a mathematical sense which paths we would prefer not to generate, but we can provide examples and counterexamples, and we try to hence nudge the complicated learnt distribution away from them.

"Alignment" for LLMs

Alignment and safety for LLMs mean that we should be able to quantify and bound the probability with which certain undesirable sequences are generated. The trouble is that we largely fail at describing "undesirable" except by example, which makes calculating bounds difficult.

For a given LLM (without random seed) and sequence, it is trivial to calculate the probability of the sequence to be generated. So if we had a way of somehow summing or integrating over these probabilities, we could say with certainty "this model will generate an undesirable sequence once every N model evaluations". We can't, currently, and that sucks, but at the heart, this is the mathematical and computational problem we'd need to solve.

The surprising utility of LLMs

LLMs solve a large number of problems that could previously not be solved algorithmically. NLP (as the field was a few years ago) has largely been solved.

I can write a request in plain English to summarize a document for me and put some key datapoints from the document in a structured JSON format, and modern models will just do that. I can ask a model to generate a children's book story involving raceboats and generate illustrations, and the model will generate something that is passable. And much more, all of which would have seemed like absolute science fiction 5-6 years ago.

We're on a pretty steep improvement curve, so I expect the number of currently-intractable problems that these models can solve to keep increasing for a while.

Where anthropomorphization loses me

The moment that people ascribe properties such as "consciousness" or "ethics" or "values" or "morals" to these learnt mappings is where I tend to get lost. We are speaking about a big recurrence equation that produces a new word, and that stops producing words if we don't crank the shaft.

To me, wondering if this contraption will "wake up" is similarly bewildering as if I was to ask a computational meteorologist if he isn't afraid of his meteorological numerical calculation will "wake up".

I am baffled that the AI discussions seem to never move away from treating a function to generate sequences of words as something that resembles a human. Statements such as "an AI agent could become an insider threat so it needs monitoring" are simultaneously unsurprising (you have a randomized sequence generator fed into your shell, literally anything can happen!) and baffling (you talk as if you believe the dice you play with had a mind of their own and could decide to conspire against you).

Instead of saying "we cannot ensure that no harmful sequences will be generated by our function, partially because we don't know how to specify and enumerate harmful sequences", we talk about "behaviors", "ethical constraints", and "harmful actions in pursuit of their goals". All of these are anthropocentric concepts that - in my mind - do not apply to functions or other mathematical objects. And using them muddles the discussion, and our thinking about what we're doing when we create, analyze, deploy and monitor LLMs.

This muddles the public discussion. We have many historical examples of humanity ascribing bad random events to "the wrath of god(s)" (earthquakes, famines, etc.), "evil spirits" and so forth. The fact that intelligent highly educated researchers talk about these mathematical objects in anthropomorphic terms makes the technology seem mysterious, scary, and magical.

We should think in terms of "this is a function to generate sequences" and "by providing prefixes we can steer the sequence generation around in the space of words and change the probabilities for output sequences". And for every possible undesirable output sequence of a length smaller than \(c\), we can pick a context that maximizes the probability of this undesirable output sequence.

A much clearer formulation, which helps more clearly articulate the problems to solve.

Why many AI luminaries tend to anthropomorphize

Perhaps I am fighting windmills, or rather a self-selection bias: A fair number of current AI luminaries have self-selected by their belief that they might be the ones getting to AGI - "creating a god" so to speak, the creation of something like life, as good as or better than humans. You are more likely to choose this career path if you believe that it is feasible, and that current approaches might get you there. Possibly I am asking people to "please let go of the belief that you based your life around" when I am asking for an end to anthropomorphization of LLMs, which won't fly.

Why I think human consciousness isn't comparable to an LLM

The following is uncomfortably philosophical, but: In my worldview, humans are dramatically different things than a function \((\mathbb{R}^n)^c \mapsto (\mathbb{R}^n)^c\). For hundreds of millions of years, nature generated new versions, and only a small number of these versions survived. Human thought is a poorly-understood process, involving enormously many neurons, extremely high-bandwidth input, an extremely complicated cocktail of hormones, constant monitoring of energy levels, and millions of years of harsh selection pressure.

We understand essentially nothing about it. In contrast to an LLM, given a human and a sequence of words, I cannot begin putting a probability on "will this human generate this sequence".

To repeat myself: To me, considering that any human concept such as ethics, will to survive, or fear, apply to an LLM appears similarly strange as if we were discussing the feelings of a numerical meteorology simulation.

The real issues

The function class represented by modern LLMs are very useful. Even if we never get anywhere close to AGI and just deploy the current state of technology everywhere where it might be useful, we will get a dramatically different world. LLMs might end up being similarly impactful as electrification.

My grandfather lived from 1904 to 1981, a period which encompassed moving from gas lamps to electric, the replacement of horse carriages by cars, nuclear power, transistors, all the way to computers. It also spanned two world wars, the rise of Communism and Stalinism, almost the entire lifetime of the USSR and GDR etc. The world on his birth looked nothing like the world when he died.

Navigating the dramatic changes of the next few decades while trying to avoid world wars and murderous ideologies is difficult enough without muddying our thinking.