ADD / XOR / ROL: April 2025

What is this? After my first post on the topic, 9 months elapsed before I posted again, and now I am posting within days of the last post?

Anyhow, after my last post I could not resist and started running some experiments trying to see whether I could induce "overfitting" in the neural networks I had been training - trying to get a heavily overparametrized neural network to just "memorize" the training points so it generalizes poorly.

In the experiments I ran in previous posts, one of the key advantages is that I know the "true distribution" from which we are drawing our training data -- the input image. An overfit network would hence find ways to color the points in the training data correctly, but somehow not do so by drawing a black ring on white background (so it would be correct on the training data but fail to generalize).

So the experiment I kicked off was the following: Start with a network that has many times more parameters than we have training points: Since we start with 5000 training points, I picked 30 layers of 30 neurons for a total parameter count of approximately 27000 parameters. If von Neumann said he can draw an elephant with 4 parameters and make it wriggle it's trunk with 5, he'd certainly manage to fit 5000 training points with 27000 parameters?

Anyhow, to my great surprise, there was no hint of overfitting:

The network very clearly learns to draw a circle instead of fitting individual points. That is somewhat surprising, but perhaps this is just an artifact of our training points being relatively "dense" in the space, 5000 training points out of 1024*1024 is still 0.4%, that's a good chunk of the total space.

As a next step, I trained the same network, but with ever-reduced quantities of training data: 2500 points, 1250 points, 625 points, and 312 points. Surely training on 312 data points using 27000 parameters should generate clear signs of overfitting?

At 2500 points, while there is a noticeable slowdown in the training process, the underlying concept seems to be learnt just fine:

As we drop much lower, to 625 points, we can see how the network is struggling much more to learn the concept, but ... it still seems to have a strong bias toward creating a geometric shape resembling the ring instead of overfitting on individual points?

It appears that the learning process is slowed down - by epoch 6000 the network hasn't managed to reproduce the entire circle yet - and training seems to be less stable - but it looks as if the network is moving into the right direction. What happens if we halve the training points once more?

It's a bit of a mystery - I would have expected that by now we're clearly in a regime where the network should try fit individual points, we gave it just 0.02% of the points in the space. The network is clearly struggling to learn, and by epoch 6000 it is far from "ready" -- but it's certainly working towards a ring shape.

These experiments raise a number of questions for me:

1. It seems clear to me that the networks have some form of baked-in tendency to form contiguous areas - perhaps even a geometric shape - and the data needs to become very very sparse in order for true overfitting to occur. It's really unclear to me why we see the emergence of shapes here -- it would certainly be easy for the network to just pick the 312 polytopes in which the training points reside, and their immediate neighbors, and then have a steep linear function with big parameters to color just the individual dots black. But that's not what is happening here; there's some mechanism or process that leads to the emergence of a shape.

2. It almost seems like there is a trade-off -- if you have less data, you need to train longer, perhaps much longer. But it's really not clear to me that we will not arrive at comparatively good approximations even with 312 data points.

As a next step, I am re-running these experiments with 20000 epochs instead of 6000, to see if the network trained on very sparse training data catches up with the networks that have more data over time.

In this post, I will explain my current thinking about neural networks. In a previous post I explained the intuition behind my "origami view of NNs" (also called the "polytope lens" in some circles). In this post, I will go a little bit into the mathematical details of this.

The standard textbook explanation of a layer of a neural network looks something like this:

$σ (\overset{―}{W} x + b)$

where

σ : R \to R

is a nonlinearity (either the sigmoid or the ReLU or something like it),

\overset{―}{W}

is the matrix of weights attached to the edges coming into the neurons, and

b

is the vector of "biases". Personally, I find this notation somewhat cumbersome, and I prefer to pull the bias vector into the weight matrices, so that I can think of an NN as "matrix multiplications alternating with applying a nonlinearity".

I really don't like to think about NNs with nonlinearities other than ReLU and leaky ReLU - perhaps over time I will have to accept that these are a thing, but for now all NNs that I think about are either ReLU or leaky ReLU. For the moment, we also assume that the network outputs a real vector in the end, so it is not (yet) a classifier.

Assume we have a network with

k

layers, and the number of neurons in each layer are

n_{1}, \dots, n_{k}

. The network maps between real vector spaces (or an approximation thereof) of dimension

i

and

o

N N : R^{i} \to R^{o}

I would like to begin by pulling the bias vector into the matrix multiplications, because it greatly simplifies notation. So the input vector

\overset{―}{x}

gets augmented by appending a 1, and the bias vector

b

gets appended to

\overset{―}{W}

W^{'} = [\overset{―}{W} b], x = [\begin{matrix} \overset{―}{x} \\ 1 \end{matrix}]

Instead of

σ (o v e r l i n e W x + b)

we can write

σ (W^{'} x)

.
In our case,

σ

is always ReLU or leaky ReLU, so a "1" will be mapped to a "1" again. For reasons of being able to compose things nicely later, I would also like the output of

σ (W^{'} x)

to have a 1 as last component, like our input vector

x

. To achieve this, I need to append a row of all zeroes terminated in a 1 to

W^{'}

. Finally we have:

W = [\begin{array}{cc} \overset{―}{W} & b \\ 0, \dots & 1 \end{array}], x = [\begin{matrix} \overset{―}{x} \\ 1 \end{matrix}]

The previous post explained why the NN divides the input space into polytopes on which the approximated function will be entirely linear. Consider the data point

x_{1}

. If you evaluate the NN on

x_{1}

, a few of the ReLUs will light up (because their incoming data sums to more than 0) and a few will not. For a given

x_{1}

, there will be

k

boolean vectors representing the activation (or non-activation) of each ReLU in the NN. Which means we have a function which for a given input vector, layer, and neuron number in the layer returns either

0

1

in the ReLU case, or

0.01

and

1

in the leaky ReLU case.

We call this function

a

. We could make it a function with three arguments (layer, neuron index, input vector), but I prefer to move the layer and the neuron index into indices, so we have:

a_{l, n} : R^{i} \to {0, 1} for ReLU

and

a_{l, n} : R^{i} \to {0.01, 1} for leaky ReLU

This gives us a very linear-algebra-ish expression for the entire network:

N N (x) = W_{1} A_{1} \dots W_{k} A_{k} x = \prod_{i = 0}^{k} (W_{i} A_{i}) x

Where the

A_{k}

are of the form

A_{k} = (\begin{array}{cccc} a_{k, 1} (x) & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots \\ 0 & \dots & a_{k, n_{k}} (x) & 0 \\ 0 & 0 & 0 & 1 \end{array})

So we can see now very clearly that the moment that the activation pattern is determined, the entire function becomes linear, and just a series of matrix multiplications where every 2nd matrix is a diagonal matrix with the image of the activation pattern on the diagonal.

This representation shows us that the function remains identical (and linear) provided the activation pattern does not change - points on the same polytope will have an identical activation pattern, and we can hence use the activation pattern as a "polytope identifier" -- for any input point

x

I can run it through the network, and if a second point

x^{'}

has the same pattern, I know it lives on the same polytope.

So from this I can take the sort of movies for single-layer NNs that were created in part 1 - where we can take an arbitrary 2-dimensional image as the unknown distribution that we wish to learn and then visualize the training dynamics: Show how the input space is cut up into different polytopes on which the function is then linearly approximated, and show how this partition and approximation evolves through the training process for differently-shaped networks.

We take input images of size 1024x1024, so one megabyte of byte-sized values, and sample 5000 data points from them - a small fraction, about 0.4% of the overall points in the image. We specify a shape for the MLP, and train it for 6000 steps, visualizing progress.

For simplicity, we try to learn a black ring on white ground, with sharply-delineated edges - first with a network that has 14 neurons per layer, and is 6 layers deep.

On the left-hand side, we see the evaluated NN with the boundaries of the polytopes that it has generated to split the input space. In the center, we only see the output of the NN - what the NN has "learnt" to reproduce so far. And on the right hand side we see the original image, with the tiny, barely perceptible red dots the 5000 training points, and the blue dots a validation set of 1000 points.

Here is a movie of the dynamics of the training run:

This is pretty neat, how about a differently-shaped NN? What happens if we force the NN through a 2-neuron bottleneck during the training process?

This last network has 10 layers of 10 neurons, then one layer of 2 neurons, then another 3 layers of 10 neurons. By number of parameters it is vaguely comparable to the other network, but it exhibits noticeably different training dynamics.

What happens if we dramatically overparametrize a network? Will it overfit our underlying data, and find a way to carve up the input space to reduce the error on the training set without reproducing a circle?

Let's try - how about a network with 20 neurons, 40 layers deep? That should use something like 20k floating point parameters in order to learn 5000 data points, so perhaps it will overfit?

Turns out this example doesn't, but it offers particularly rich dynamics as we watch it: Around epoch 1000 we can see how the network seems to have the general shape of the circle figured out, and most polytope boundaries seem to migrate to this circle. The network wobbles a bit but seems to make headway. By epoch 2000 we think we have seen it all, and the network will just consolidate around the circle. Between epoch 3000 and 4000 something breaks, loss skyrockets, and it seems like the network is disintegrating and training is diverging. By epoch 4000 it has re-stabilized, but in a very different configuration for the input space partition. This video ends around epoch 5500.

This is quite fascinating. There is no sign of overfitting, but we can see how the as the network gets deeper, training gets less stable: The circle seems to wobble much more, and we have these strange catastrophic-seeming phase changes after which the network has to re-stabilize. It also appears as if the network accurately captures the "circle" shape in spite of having only relatively few data points and more than enough capacity to overfit on them.

I will keep digging into this whenever time permits, I hope this was entertaining and/or informative. My next quest will be building a tool that - for a given point in input space - extracts a system of linear inequations that describe the polytope that this point lives on. Please do not hesitate to reach out if you ever wish to discuss any of this!

ADD / XOR / ROL

Thursday, April 10, 2025

Some experiments to help me understand Neural Nets better, post 3 of N

Saturday, April 05, 2025

Some experiments to help me understand Neural Nets better, post 2 of N

About Me

Links