After the previous blog posts here, here, and here, a friend of mine pointed me to some literature to read, and I will do so now :-).
The papers on my reading list are:
1. https://proceedings.mlr.press/v80/balestriero18b.html - Randall Balestrieros paper on DNNs as splines.
2. https://arxiv.org/abs/1906.00904 - ReLU networks have surprisingly few activation patterns (2019)
3. https://arxiv.org/abs/2305.09145 - Deep ReLU networks have surprisingly simple polytopes (2023)
4. https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2023.1274831/full
I'll blog more once I get around to reading them all.
ADD / XOR / ROL
A blog about reverse engineering, mathematics, politricks and some more ...
Thursday, May 22, 2025
Some experiments to help me understand Neural Nets better, post 4 of N
Thursday, April 10, 2025
Some experiments to help me understand Neural Nets better, post 3 of N
What is this? After my first post on the topic, 9 months elapsed before I posted again, and now I am posting within days of the last post?
Anyhow, after my last post I could not resist and started running some experiments trying to see whether I could induce "overfitting" in the neural networks I had been training - trying to get a heavily overparametrized neural network to just "memorize" the training points so it generalizes poorly.
In the experiments I ran in previous posts, one of the key advantages is that I know the "true distribution" from which we are drawing our training data -- the input image. An overfit network would hence find ways to color the points in the training data correctly, but somehow not do so by drawing a black ring on white background (so it would be correct on the training data but fail to generalize).
So the experiment I kicked off was the following: Start with a network that has many times more parameters than we have training points: Since we start with 5000 training points, I picked 30 layers of 30 neurons for a total parameter count of approximately 27000 parameters. If von Neumann said he can draw an elephant with 4 parameters and make it wriggle it's trunk with 5, he'd certainly manage to fit 5000 training points with 27000 parameters?
Anyhow, to my great surprise, there was no hint of overfitting:
The network very clearly learns to draw a circle instead of fitting individual points. That is somewhat surprising, but perhaps this is just an artifact of our training points being relatively "dense" in the space, 5000 training points out of 1024*1024 is still 0.4%, that's a good chunk of the total space.
At 2500 points, while there is a noticeable slowdown in the training process, the underlying concept seems to be learnt just fine:
As we drop much lower, to 625 points, we can see how the network is struggling much more to learn the concept, but ... it still seems to have a strong bias toward creating a geometric shape resembling the ring instead of overfitting on individual points?
It's a bit of a mystery - I would have expected that by now we're clearly in a regime where the network should try fit individual points, we gave it just 0.02% of the points in the space. The network is clearly struggling to learn, and by epoch 6000 it is far from "ready" -- but it's certainly working towards a ring shape.
These experiments raise a number of questions for me:
1. It seems clear to me that the networks have some form of baked-in tendency to form contiguous areas - perhaps even a geometric shape - and the data needs to become very very sparse in order for true overfitting to occur. It's really unclear to me why we see the emergence of shapes here -- it would certainly be easy for the network to just pick the 312 polytopes in which the training points reside, and their immediate neighbors, and then have a steep linear function with big parameters to color just the individual dots black. But that's not what is happening here; there's some mechanism or process that leads to the emergence of a shape.
As a next step, I am re-running these experiments with 20000 epochs instead of 6000, to see if the network trained on very sparse training data catches up with the networks that have more data over time.
Saturday, April 05, 2025
Some experiments to help me understand Neural Nets better, post 2 of N
In this post, I will explain my current thinking about neural networks. In a previous post I explained the intuition behind my "origami view of NNs" (also called the "polytope lens" in some circles). In this post, I will go a little bit into the mathematical details of this.
The standard textbook explanation of a layer of a neural network looks something like this:
\[ \sigma( \overline{W}x + b )\]
where \(\sigma : \mathbb{R} \to \mathbb{R}\) is a nonlinearity (either the sigmoid or the ReLU or something like it), \(\overline{W}\) is the matrix of weights attached to the edges coming into the neurons, and \(b\) is the vector of "biases". Personally, I find this notation somewhat cumbersome, and I prefer to pull the bias vector into the weight matrices, so that I can think of an NN as "matrix multiplications alternating with applying a nonlinearity".NN : \mathbb{R}^i \to \mathbb{R}^o
\]
I would like to begin by pulling the bias vector into the matrix multiplications, because it greatly simplifies notation. So the input vector \(\overline{x}\) gets augmented by appending a 1, and the bias vector \(b\) gets appended to \(\overline{W}\):
\[
W' = [\overline{W}b], x = \left[\begin{array}{c}\overline{x}\\1\end{array}\right]
\]
In our case, \(\sigma\) is always ReLU or leaky ReLU, so a "1" will be mapped to a "1" again. For reasons of being able to compose things nicely later, I would also like the output of \(\sigma(W'x)\) to have a 1 as last component, like our input vector \(x\). To achieve this, I need to append a row of all zeroes terminated in a 1 to \(W'\). Finally we have:
W = \left[\begin{array}{cc}\overline{W}& b\\0, \dots & 1\end{array}\right], x = \left[\begin{array}{c}\overline{x}\\1\end{array}\right]
\]
We call this function \(a\). We could make it a function with three arguments (layer, neuron index, input vector), but I prefer to move the layer and the neuron index into indices, so we have:
a_{l, n} : \mathbb{R}^i \to \{ 0, 1 \} \textnormal{ for ReLU }
\]
and
a_{l, n} : \mathbb{R}^i \to \{ 0.01, 1 \} \textnormal{ for leaky ReLU }
\]
This gives us a very linear-algebra-ish expression for the entire network:
NN(x) = W_1 A_1 \dots W_k A_k x = \prod_{i=0}^k (W_i A_i)x
\]
A_k = \left( \begin{array}{cccc} a_{k, 1}(x) & \dots & 0 & 0\\ \dots & \dots & \dots & \dots \\ 0 & \dots & a_{k, n_k}(x) & 0\\ 0 & 0 & 0 & 1\end{array}\right)
\]
This representation shows us that the function remains identical (and linear) provided the activation pattern does not change - points on the same polytope will have an identical activation pattern, and we can hence use the activation pattern as a "polytope identifier" -- for any input point \(x\) I can run it through the network, and if a second point \(x'\) has the same pattern, I know it lives on the same polytope.
So from this I can take the sort of movies for single-layer NNs that were created in part 1 - where we can take an arbitrary 2-dimensional image as the unknown distribution that we wish to learn and then visualize the training dynamics: Show how the input space is cut up into different polytopes on which the function is then linearly approximated, and show how this partition and approximation evolves through the training process for differently-shaped networks.
We take input images of size 1024x1024, so one megabyte of byte-sized values, and sample 5000 data points from them - a small fraction, about 0.4% of the overall points in the image. We specify a shape for the MLP, and train it for 6000 steps, visualizing progress.
For simplicity, we try to learn a black ring on white ground, with sharply-delineated edges - first with a network that has 14 neurons per layer, and is 6 layers deep.
On the left-hand side, we see the evaluated NN with the boundaries of the polytopes that it has generated to split the input space. In the center, we only see the output of the NN - what the NN has "learnt" to reproduce so far. And on the right hand side we see the original image, with the tiny, barely perceptible red dots the 5000 training points, and the blue dots a validation set of 1000 points.
Here is a movie of the dynamics of the training run:
This last network has 10 layers of 10 neurons, then one layer of 2 neurons, then another 3 layers of 10 neurons. By number of parameters it is vaguely comparable to the other network, but it exhibits noticeably different training dynamics.
What happens if we dramatically overparametrize a network? Will it overfit our underlying data, and find a way to carve up the input space to reduce the error on the training set without reproducing a circle?
Let's try - how about a network with 20 neurons, 40 layers deep? That should use something like 20k floating point parameters in order to learn 5000 data points, so perhaps it will overfit?
Turns out this example doesn't, but it offers particularly rich dynamics as we watch it: Around epoch 1000 we can see how the network seems to have the general shape of the circle figured out, and most polytope boundaries seem to migrate to this circle. The network wobbles a bit but seems to make headway. By epoch 2000 we think we have seen it all, and the network will just consolidate around the circle. Between epoch 3000 and 4000 something breaks, loss skyrockets, and it seems like the network is disintegrating and training is diverging. By epoch 4000 it has re-stabilized, but in a very different configuration for the input space partition. This video ends around epoch 5500.
This is quite fascinating. There is no sign of overfitting, but we can see how the as the network gets deeper, training gets less stable: The circle seems to wobble much more, and we have these strange catastrophic-seeming phase changes after which the network has to re-stabilize. It also appears as if the network accurately captures the "circle" shape in spite of having only relatively few data points and more than enough capacity to overfit on them.
I will keep digging into this whenever time permits, I hope this was entertaining and/or informative. My next quest will be building a tool that - for a given point in input space - extracts a system of linear inequations that describe the polytope that this point lives on. Please do not hesitate to reach out if you ever wish to discuss any of this!
Sunday, March 02, 2025
The German debt brake is stupid!
Welcome to one of my political posts. This blog post should rightfully be titled "the German debt brake is stupid, and if you support it, so are you (at least in the domain of economics)". Given that a nontrivial number of Germans agree with the debt brake, and given that there is a limit on the sensible number of characters in the title, I chose a shorter title - for brevity and to reduce offense. I nonetheless think that support for the debt brake, and supporters of the debt brake, are stupid.
In the following, I will list the reasons why I think the debt brake is stupid, and talk about a few arguments I have heard in favor of the debt brake, and why I don't buy any of them.
Reason 1: The debt brake is uniquely German, and I think the odds that Germany has somehow uncovered a deeper economic truth than anyone else is not high.
If you engage with economists a bit, you'll hear non-German economists make statements such as "there is economics, and there is German economics, and they have little in common" or "the problem with German economics is that it's really a branch of moral philosophy and not an empirical science". Pretty much the entire world stares in bewilderment at the debt brake law, and I have yet to find a non-German economist of any repute that says the German debt brake is a sensible construct.
The Wikipedia page is pretty blatant in showing that pretty much the only group supporting the debt brake are ... 48% of a sample of 187 German university professors for economics, in a poll conducted by an economic research think tank historically associated with the debt brake.
Now, I am not generally someone that blindly advocates for going with the mainstream majority opinion, but if the path you have chosen is described by pretty much the entire world as bizarre, unempirical, and based on moral vs. scientific judgement, one should possibly interrogate one's beliefs carefully.
If the German debt brake is a sensible construct, then pretty much every other country in the world is wrong by not having it, and the German government has enacted something unique that should convey a tangible advantage. It should also lead to other countries looking at these advantages and thinking about enacting their own, similar, legislation.
The closest equivalent to the German debt brake is the Swiss debt brake - but Switzerland has a lot of basis-democratic institutions that allow a democratic majority to change the constitution; in particular, a simple double-majority - majority of voters in the majority of cantons - is sufficient to remove the debt brake again. Switzerland can still act in times of crisis provided most voters in most cantons want to.
Germany, with the 2/3rds parliamentary majority required for a constitutional change, cannot. As such, the German debt brake is the most stringent and least flexible such rule in the world.
I don't see any evidence that the debt brake is providing any benefits to either Germans or the world. I see no other country itching to implement a similarly harsh law. Do we really believe that Germany has uncovered a deeper economic truth nobody else can see?
Reason 2: The debt brake is anti-market, and prevents a mutually beneficial market activity
While I am politically center-left, I am fiercely pro-market. I think markets are splendid allocation instruments, decentralized decision-making systems, information processors, and by-and-large the primary reason why the West out-competed the USSR when it came to producing goods. Markets allow the many actors in the economy to find ways how they can obtain mutual advantage by trading with each other, and interfering with markets should be done carefully, usually to correct some form of severe market failure (natural monopolies, tragedy-of-the-common, market for lemons etc. -- these are well-documented).
The market for government debt is a market like any other. Investors that believe that the government provides the best risk-adjusted return when compared to all other investment opportunities wish to lend the government money to invest it and provide the return. The government pays interest rate to these investors, based on the risk-free rate plus a risk premium.
Capital markets exist in order to facilitate decentralized resource allocation. If investors think that the best risk-adjusted returns are to be had by loaning the government money to invest in infrastructure or spend on other things, they should be allowed to offer lower and lower risk premia.
The debt brake interferes in this market by artificially constraining the government demand for debt. Even if investors were willing to pay the German government money to please please invest it in the broader economy, the German government wouldn't be allowed to do it.
In some sense, this is a deep intervention in the natural signaling of debt markets, and the flow of goods. It is unclear what market failure is being addressed here.
Reason 3: The debt brake prevents investments with positive expected value
Assuming an opportunity arises where the government can invest sensibly in basic research or other infrastructure investments with strongly positive expected value for GDP growth and hence governmental income. Why should an arbitrary debt brake prohibit investments that are going to be net good for the whole of society?
Reason 4: The debt brake is partially responsible for the poor handling of the migration spike in 2015
Former Chancellor Merkel is often criticised for her "Wir schaffen das" ("We can do it") during the 2015 migration crisis. My main criticism, even back then, was that a sudden influx of young refugees has the potential for providing a demographic dividend, *provided* one manages to integrate the refugees into the society, the work force, and the greater economy rapidly. This necessitates investment, though: German language lessons, housing in economically non-deprived areas, German culture lessons, and much more -- and that sticking to the debt brake in an exceptional situation such as the 2015 migrant crisis is a terrible idea, because a sudden influx of refugees can have a destabilizing and economically harmful effect if the integration is botched. Successfully integrated people pay taxes and strengthen society, failure of integration leads to unemployment, potentially crime, and social disorder.
My view is that Merkel dropped the entire weight of the integration work on German civil society (which performed as best as they could, and admirably) because she was entirely committed to a stupid and arbitrary rule. I also ascribe some of the strength of Germany's far right on the disappointment that came from this mishandling of a crisis-that-was-also-an-opportunity.
Reason 5: The debt brake is based on numbers that economists agree are near-impossible to estimate correctly
Reason 6: The debt brake is fundamentally based on a fear that politicians act too much in their own interest - but does not provide a democratic remedy
We can discuss the extent to which this is true, but in the end a democracy should adhere to the sovereign, which is the voters. If we are afraid of a political caste abusing their position as representatives to pilfer the public's coffers, we should give the public more direct voting rights in budgetary matters, not artificially constrain what may be legitimate and good investments.
There is a deep anti-democratic undercurrent in the debt brake discussion: Either that the politicians cannot be trusted to behave in a fiscally responsible manner, or that the voters cannot be trusted to behave in a fiscally responsible manner, or that the view of politicians, voters and markets about what constitutes fiscal responsibility are somehow incorrect.
Reason 7: A German debt brake would be terrible policies for any business, why is it a good idea for a country?
Reason 8: A lot of debt-brake advocacy is based in the theory of "starving the beast"
Debt-brake advocates are often simultaneous advocates of lower taxes. The theory is that by lowering taxes (and hence revenues) while creating a hard fiscal wall (the debt brake) one can force the government to cut popular programs to shrink the government - in other situations, cutting popular programs would be difficult as voters would not support it.
This idea was called "starving the beast" among US conservatives in the past. There's plenty of criticism of the approach, and all empirical evidence points to it being a terrible idea. It's undemocratic, too, as one is trying to create a situation of crisis to achieve a goal that would - assuming no crisis and democracy - not achievable.
Reason 9: Germany has let it's infrastructure decay to a point where the association of German industry is begging for infrastructure investments
The empirical evidence seems to be "when presented with a debt brake, politicians make necessary investments, and instead prefer to hollow out existing infrastructure".
Reason 10: Europe needs rearmament now, which requires long-time commitments to defense spending, but also investment in R&D etc.
The post-1945 rules-based order has been dying, first slowly in the GWOT, then it convulsed with the first Trump term; it looked like it might survive when Biden got elected, but with the second Trump term it is clear that it is dead. Europeans have for 20 years ignored that this is coming, in spite of everybody that made regular trips to Washington DC having seen it. The debt brake now risks paralyzing the biggest Eurozone economy by handing control over increased defense spending to radical fringe parties that are financed and supported by hostile adversaries.
Imagine a German parliament where the AfD and BSW jointly hold 1/3rd of the seats, and a war breaks out. Do we really want an adversary to be able to decide how much debt we can issue for national defense?
But the debt brake reassures investors and hence drives down Germany's interest rate payments!
Now, this is probably the only argument I have heard in favor of the debt brake that may merit some deeper discussion or investigation. There is an argument to be made that if investors perceive the risk of a default or the risk of inflation to be lower, they will demand a lesser coupon on the debt they provide. And I'm willing to entertain that thought. Something either I or someone that reads it should do is:
1. Calculate the risk premium that Germany had to pay over the risk-free rate in the past.
2. Observe to what extent the introduction of the debt brake, or the introduction of the COVID spending bills etc. impacted the spread between the risk-free rate and the yield on German government debt.
There are some complications with this (some people argue that the yield on Bunds *is* the risk-free rate, or at least the closest approximation thereof), and one would still have to quantify what GDP shortfall was caused by excessive austerity, so the outcome of this would be a pretty broad spectrum of estimates. But I will concede that this is worth thinking about and investigating.
At the same time, we are in a very special situation: The world order we all grew up in is largely over. The 1990s belief that we will all just trade, that big countries don't get to invade & pillage small countries, and that Europe can just disarm because the world is peaceful now is dead, and only a fool would cling to it.
I know that people would like to see a more efficient administration, and a leaner budget. These are good goals, and should be pursued - but not by hemming in your own government to be unable to react to crises, be captured by an aggressive minority, and reduce democratic choice.
Apologies for this rant, but given the fact that Europe has squandered the last 20 years, and that I perceive the German approach to debt and austerity to be a huge factor in this, it is hard for me to not show some of my frustration.
Thursday, December 05, 2024
What I want for Christmas for the EU startup ecosystem
Hey all,
I have written about the various drags on the European tech industry in the past, and recently been involved in discussions on both X and BlueSky about what Europe needs.
In this post, I will not make a wishlist of what concrete policy reforms I want, but rather start "product centric" -- e.g. what "user experience" would I want as a founder? Once it is clear what experience you want as a founder, it becomes easier to reverse-engineer what policy changes will be needed.
What would Europe need to make starting a company smoother, easier, and better?
Let's jointly imagine a bit what the world could look like.
Imagine a website where the following tasks can be performed:
- Incorporation of a limited liability company with shares. The website offers a number of standardized company bylaws that cover the basics, and allows the incorporation of a limited liability company on-line (after identity verification etc.).
- Management of simple early-stage funding rounds on-line: Standardized SAFE-like instruments, or even a standardized Series A agreement, and the ability to sign these instruments on-line, and verify receipt of funds.
- Management of the cap table (at least up to and including the Series A).
- Ability to employ anyone in the Eurozone, and run their payroll, social security contributions, and employer-side healthcare payments. Possibly integrated with online payment.
- Ability to grant employee shares and manage the share grants integrated with the above, with the share grants taxed in a reasonable way (e.g. only tax them on liquidity event, accept the shares themselves as tax while they are illiquid, or something similar to the US where you can have a lightweight 409a valuation to assign a value to the shares).
- Integration with a basic accounting workflow that can be managed either personally or by an external accountant, with the ability to file simplified basic taxes provided overall revenue is below a certain threshold.
- Ways of dealing with all the other paperwork involved in running a company on-line.
Ideally, I could sign up to the site, verify my identity, incorporate a basic company with standardized bylaws, raise seed funding, employ people, run their payroll, and file basic taxes and paperwork.
In the above dream, what am I missing?
My suspicion is that building and running such a website would actually be not difficult (if the political will in Europe existed), and would have a measurable impact on company formation and GDP. If we want economic growth like the US, Europe needs to become a place where building and growing a business is easier and has less friction than in the US.
So assuming the gaps that I am missing are filled in, the next step is asking: What policy reforms are necessary to reach this ideal?
Wednesday, July 10, 2024
Someone is wrong on the internet (AGI Doom edition)
The last few years have seen a wave of hysteria about LLMs becoming conscious and then suddenly attempting to kill humanity. This hysteria, often expressed in scientific-sounding pseudo-bayesian language typical of the „lesswrong“ forums, has seeped into the media and from there into politics, where it has influenced legislation.
This hysteria arises from the claim that there is an existential risk to humanity posed by the sudden emergence of an AGI that then proceeds to wipe out humanity through a rapid series of steps that cannot be prevented.
Much of it is entirely wrong, and I will try to collect my views on the topic in this article - focusing on the „fast takeoff scenario“.
I had encountered strange forms of seemingly irrational views about AI progress before, and I made some critical tweets about the messianic tech-pseudo-religion I dubbed "Kurzweilianism" in 2014, 2016 and 2017 - my objection at the time was that believing in an exponential speed-up of all forms of technological progress looked too much like a traditional messianic religion, e.g. "the end days are coming, if we are good and sacrifice the right things, God will bring us to paradise, if not He will destroy us", dressed in techno-garb. I could never quite understand why people chose to believe Kurzweil, who, in my view, has largely had an abysmal track record predicting the future.
Apparently, the Kurzweilian ideas have mutated over time, and seem to have taken root in a group of folks associated with a forum called "LessWrong", a more high-brow version of 4chan where mostly young men try to impress each other by their command of mathematical vocabulary (not of actual math). One of the founders of this forum, Eliezer Yudkowsky, has become one of the most outspoken proponents of the hypothesis that "the end is nigh".
I have heard a lot of of secondary reporting about the claims that are advocated, and none of them ever made any sense to me - but I am also a proponent of reading original sources to form an opinion. This blog post is like a blog-post-version of a (nonexistent) YouTube reaction video of me reading original sources and commenting on them.
I will begin with the interview published at https://intelligence.org/2023/03/14/yudkowsky-on-agi-risk-on-the-bankless-podcast/.
The proposed sequence of events that would lead to humanity being killed by an AGI is approximately the following:
- Assume that humanity manages to build an AGI, which is a computational system that for any decision "outperforms" the best decision of humans. The examples used are all zero-sum games with fixed rule sets (chess etc.).
- After managing this, humanity sets this AGI to work on improving itself, e.g. writing a better AGI.
- This is somehow successful and the AGI obtains an "immense technological advantage".
- The AGI also decides that it is in conflict with humanity.
- The AGI then coaxes a bunch of humans to carry out physical actions that enable it to then build something that kills all of humanity, in case of this interview via a "diamondoid bacteria that replicates using carbon, hydrogen, oxygen, nitrogen, and sunlight", that then kills all of humanity.
Incorrectness and incompleteness of human writing
Human writing is full of lies that are difficult to disprove theoretically
Practical world-knowledge is rarely put in writing
No progress without experiments
No superintelligence can reason itself to progress without doing basic science
- To this day, CFD simulations of the air resistance that a train is exposed to when hit by wind at an angle need to be experimentally validated - simulations have the tendency to get important details wrong.
- It is safe to assume that the state-supported hackers of the PRCs intelligence services have stolen every last document that was ever put into a computer at all the major chipmakers. Having all this knowledge, and the ability to direct a lot of manpower at analyzing these documents, have not yielded the knowledge necessary to make cutting-edge chips. What is missing is process knowledge, e.g. the details of how to actually make the chips.
- Producing ballpoint pen tips is hard. There are few nations that can reliably produce cheap, high-quality ballpoint pen tips. China famously celebrated in 2017 that they reached that level of manufacturing excellence.
- Simulating reality accurately and cheaply is not a thing. We cannot simulate even simple parts of reality to a high degree of accuracy (think of a water faucet with turbulent flow splashing into a sink).
- The rules for reality are not known in advance. Humanity has created some good approximations of many rules, but both humanity and a superintelligence still need to create new approximations of the rules by careful experimentation and step-wise refinement.
- The rules for adversarial and competitive games (such as a conflict with humanity) are not stable in time.
- Evaluating any experiment in reality has significant cost, particularly to an AI.
Superintelligence will also be bound by fundamental information-theoretic limits
Next-token prediction cannot handle Kuhnian paradigm shifts
Enough for today. Touch some grass, build some stuff
Thursday, July 04, 2024
Some experiments to help me understand Neural Nets better, post 1 of N
I am a big proponent of "bottom-up" mathematics: Playing with a large number of examples to inform conjectures to be dealt with later. I tend to run through many experiments to build intuition; partly because I have crippling weaknesses when operating purely formally, partly because most of my mathematics is somewhat "geometric intuition" based -- e.g. I rely a lot on my geometric intuition for understanding problems and statements.
As a result, earlier this year, I finally found time to take a pen, pencil, and wastebasket and began thinking a bit about what happens when you send data through a neural network consisting of ReLU units. Why only ReLUs? Well, my conjecture is that ReLUs are as good as anything, and they are both reasonably easy to understand and actually used in practical ML applications. They are also among the "simplest examples" to work with, and I am a big fan of trying the simple examples first.
This blog post shares some of my experiments and insights; I called it the "paper plane or origami perspective to deep learning". I subsequently found out that there are a few people that have written about these concepts under the name "the polytope lens", although this seems to be a fringe notion in the wider interpretability community (which I find strange, because - unsurprisingly - I am pretty convinced this is the right way to think about NNs).
1. We can intuitively understand what the NN is learning.
2. We can simulate training error and generalisation errors by taking very high-resolution images and training on low-resolution samples.
3. We stay within the realm of low-dimensional geometry for now, which is something most of us have an intuitive understanding of. High dimensions will create all sorts of complications soon enough.
Let's begin by understanding a 2-dimensional ReLU neuron - essentially the function f(x, y) = max( ax + by + c, 0) for various values of a, b, and c.
This will look a bit like a sheet of paper with a crease in it:
As a next step, let's imagine a single-layer ReLU network that takes the (x,y) coordinates of the plane, and then feeds it into 10 different ReLU neurons, and then combines the result by summing them using individual weights.
The resulting network will have 3 parameters to learn for each neuron: a, b, and c. Each "neuron" will represent a separate copy of the plane that will then be combined (linearly, additively, with a weight) into the output function. The training process will move the "creases" in the paper around until the result approximates the desired output well.
Let's draw that process when trying to learn the picture of a circle: The original is here:
Let's do another movie, this time with a higher number of first-layer neurons - 500. And let's see how well we will end up approximating the circle.
- I don't understand enough about Adam as an optimizer to understand where the very visible "pulse" in the optimization process is coming from. What's going on here?
- I am pretty surprised by the fact that so many creases end up being extremely similar -- what would cause them to bundle up into groups in the way they do? The circle is completely rotation invariant, but visually the creases seem to bunch into groups much more than random distribution would suggest. Why?
- It's somewhat surprising how difficult it appears to be to learn a "sharp" edge, the edge between white and black in the above diagram is surprisingly soft. I had expected it to be easier to learn to have a narrow polytope with very large a/b constants to create a sharp edge, somehow this is difficult? Is this regularization preventing the emergence of sharp edges (by keeping weights bounded)?
Anyhow, this was the first installment. I'll write more about this stuff as I play and understand more.
Steps I'll explain in the near future:
- What happens as you deepen your network structure?
- What happens if you train a network on categorical data and cross-entropy instead of a continuous output with MSE?
- What can we learn about generalization, overfitting, and overparametrization from these experiments?