If you have EPYC/Xeon along with around 700 GB RAM (671 GB for the FP8 model + room for context) or a similar amount of VRAM on GPUs, then yeah, you can run R1 locally. If you have the same rig as above, then you can also run LLaMA 3.1 405B locally - albeit, if you run on CPU, then it would work painfully slow unlike R1 because 405B is a dense model while R1 is MoE with 37B activated parameters at a time. However, if it's for your personal usage, then it does not make any sense to spend so much money when the DeepSeek API is dirt cheap. Plus, you can use stuff like web search through their Web UI.$850M usd capex though. While R1 can run on a desktop apparently.
$850M Capex is for the datacentre, DeepSeek also was not trained on a desktop, lol, they had 2048xH800 iirc. It is still impressive because typically the models of that class require at least the same number of cards as this big LLaMA - i.e. 16k H100, so DeepSeek pulling this off with only 2k H800s, which are gimped versions of H100, is huge.
Distilled versions are not really comparable to the full R1 though, not even in the same league. Most are not even the best in their parameter size category, mainly because these distilled versions did not go through the same training process as R1 with RL and all of that, only fine-tuned on the R1 outputs.I've seen people run distilled versions locally on Raspberries and mobile phones. It's what makes the "ccp is lying about hardware specs" cope all the more hillarious
I wonder what the Qwen team will deliver this time; their Qwen 2.5 models are still the best or close to the top in their respective parameter sizes. For example, Qwen-2.5-32b-Coder is the best coding model at this size, no contest.
Last edited: