After a number of Twitter discussions, and repeating myself a lot in these discussions, it is time to write a short note on the economics of advancing LLM capabilities through RL, about principles of propaganda and coining new words, and about my stubborn refusal to use the term "distillation" except in a specific narrow sense.
How do models advance when human-curated data has run out?
This is very elegant in a way, because you are kinda pulling yourself up by your own bootstraps. The cost is computational - if you have a 1% chance of finding a solution given your current LLM and current training data, you need to do 100s or 1000s of rollouts to get a reasonable variety of useful solutions.
Once you have a model that can generate a good solution for this problem with high probability, and you make that model available to others, you also provide a much cheaper way of producing the better training data: Third parties can now just ask your model to generate good solutions for them.
So for the second-mover that gets to use your model, improving their model from your model outputs is cheaper, as they can skip the more-or-less-random-search into a high-dimensional solution space and be guided better.
Terms of service, copyright law, crimes vs. contract disputes
The frontier labs have all argued that training on public data does not require them to obtain licenses from the copyright holders (a self-serving and somewhat dubious claim). The Llama release further muddied the waters by adding a license to the redistribution of model weights - by law, the output of an algorithm itself (such as model weights) are not a copyrightable object, and Meta just pretended they were. Other model labs followed suit, in the hope of establishing a practical precedent that can then be used to shape legislation in the future.
But a priori, model weights are not copyrightable.
There is an argument, though, that prompts, and the resulting output from the model are copyrightable to the person submitting the prompts. Certainly not to the model provider: Running an algorithm on somebody else's copyrightable work without human input does not make you the owner of the work. There is no human creativity input, which is the minimum threshold for establishing copyright in our current legal system.
Terms-of-service are very different from copyright law - they are essentially private law contracts about the exchange of services between entities. So if a model provider says "you may not use this service to generate training data for your competing LLM", they can say so, and they have the right to terminate your account if they catch you doing so.
That said - let's say I was to run a benchmarking service that tests the progress of LLMs against my favorite programming problems, and all I do is (a) run rollouts against these services (b) score the results (c) archive the results (d) sell access to the results to third parties so they can evaluate progress of models and the quality of their reasoning and (e) publish the positive results after a few months for free.
This is not a violation of the terms of service -- I am just measuring the capabilities of the models and have them solve problems for me. Publishing the data isn't a violation of the terms of service either.
Yet - by me publishing the positive results into the greater internet makes them part of the training corpus, so the improvement in capability that the model provider achieves will flow into other models. There is no way around this in our current legal system.
Reframing an inconvenient issue with your business model in moral terms
It will be easy to convince yourself that the flaw in your business model that gives your competitors a way to catch up with lesser investment is a moral outrage - it is so unjust! - and then complain about the fact that others have the right to do what they are doing.
"Distillation" is great, because it evokes bootlegging and 1920s prohibition-era intrigue. And "attack" is great because only bad people attack. So you leverage the fact that people called a technique to teach a smaller model from a larger model provided you have access to the internals of the larger model "distillation", you tack on the word "attack" to make it sound more nefarious, and you start screaming from the rooftops that evil distillation attackers are killing your morally superior business (that started by actual copyright violations, only justified ex-post by your success).
If the chinese models are distilled, so is the Cursor fine-tune of Kimi, or any model that is trained on the output of other models - and most of human output is now model-assisted.