New Kimi report on Muon scalability. Btw, I have no idea what Muon is
Models are trained in batches. For example during pre-training, many chunks of sentences are given in one step and the model has to predict the next token for all of them. The predicted tokens are then compared to actual ground-truth tokens and the differences are computed into what is called gradients, that mainly define in which direction (like increase or decrease) should each parameter of the model (weights) change to reduce the error, i.e. to better predict the actual tokens next time you try.
Once you have the gradients (i.e. the directions) you need to change the actual weights accordingly. But how? And here comes the optimizer, and Muon is one of them.
The most straightforward way is to simply change the weight by the scaled down gradient (you don't apply it fully otherwise training become unstable). But there are more clever ways. The current standard is called that applies not only current batch's gradients but also a scaled down version of the previous batches, so to keep some "momentum" and smooth statistical differences across batches.
The authors claim Muon is better then AdamW, like 2 times better. It means they can reach the same results using half the training tokens. For instance a model trained with 10T tokens with Muon should yield same performance of 20T token with AdamW. This is a big claim and has to be confirmed by independent tests. Train 10T tokens into a big model can take many months of a full datacenter with thousands of GPU, so the saving could be very big if confirmed.
What assets? All of the source codes is open and shared with everyone. There are no patents, trademarks, copyrights, or trade secrets that Deepseek can use to earn monetary stream to do R&D to keep itself at forefront of AI development. So the Chinese government have to keep giving money and subsidize Deepseek to keep it competitive and then share all of its knowledge freely to everyone? Foreign AI companies can monetize Deepseek, comb through the source code and improve it all the while not sharing or contribute anything to Deepseek. I am all for open source and sharing Deeepseek for the betterment of the humanity, however I hope Deepseek can come up with a way to sustain itself without future money/subsidies from the government.
I was thinking DeepSeek business model is like Google's Android, but now I'm starting to think their business model resambles more Linux. They aim to be the Linux of AI models.
Is it a business model that makes sense? Your argumentations could have been applied verbatim to Linux 30 years ago...and history proved them wrong already.