tbh I consider these kinds of tests meaningless. So much time has passed that it's almost certain that the model has trained on these questions and their answers alreadyi tested it in hyperbolic, indeed much smarter, and only 40 cents per million tokens, now context length 130k+
View attachment 140527
I don't doubt that these new models are an improvement though. It's just that it's getting increasingly difficult to prove it with just a few questions.
Great model, good value, long context, open source. Great stuff as usual from Zuck
Let's see if Alibaba and Deepseek release something these days