Artificial Intelligence thread

Coalescence

Senior Member
Registered Member
Currently even the latest and greatest LLMs (looking at you ClosedAI o1) sometimes crashes and burns even with simple standalone React components. They are completely and utterly useless with writing code for my legacy backend codebase. Unless I can fit the entire codebase into the context window, no LLMs can contribute without me spoonfeeding them all the relevant context... which is like 90% of the work.
I agree with this very much, especially after testing the o1-preview model and finally hitting the token limit. Before laying out the problems I still have with the newest model, the codebase I'm working with is written in Vue.js 2, and uses Element UI and IView component library. The problem that I have with the model are:
1. It keeps using syntax and methods that are only found in Vue.js 3, when I've already specified and reminded the code is in Vue.js 2. I had to tell it to use the right methods and functions for it to use in order to work.
2. After modifying the code and wanting it to add or do changes on the new code, I would give it the modified code and use that for its generation. Sometimes it revert the changes on the new code, other times it just iterates on the old code.
3. As I mentioned before, it have a tendency to change parts of the code that doesn't relate to the request. This is still a problem in o1-preview, and is very annoying having to figure out what went wrong, and copy pasting back the working portions of the old code.

The newest model is definitely smarter than before, but it having the same problem as the previous model have above but now with worse speed, I would rather stick with the old model and just iterate the solutions it manually. Also have you guys noticed you can't provide file attachments to o1-preview and o1-mini?
 

9dashline

Captain
Registered Member
Running large language models like Facebook’s LLaMA3 (405 billion parameters) already requires serious hardware, like an 8-way NVIDIA H100 cluster. But it’s not just about throwing more GPUs at the problem—it’s about how these models are utilized during inference. Techniques like Chain of Thought (CoT) don’t necessarily need more GPUs; they need more time.

Think of it like AlphaGo’s Monte Carlo simulations: once AlphaGo’s neural network made its move predictions, the ELO performance improved by running multiple playouts to refine those predictions further. CoT is similar—during inference, the model "questions itself" and iterates on its reasoning, not just outputting a single pass response but evaluating and refining multiple chains internally. So while the hardware footprint stays roughly the same, inference time increases due to these deeper, more sophisticated reasoning steps.

But this shift towards CoT is also a subtle admission that brute-force scaling of models—just piling on more and more parameters—is starting to show its limitations. We’re moving away from simply making models bigger to making them “think” more effectively. It’s not unlike the early days of CPUs: at first, it was all about more gigahertz until that ran into thermal constraints. Then it became about multi-core designs and optimizing software to take advantage of parallelism. Eventually, GPUs became the new frontier. Now, with LLMs, instead of just scaling up parameter counts endlessly, techniques like CoT represent a new way to push the boundaries without hitting the same brick walls of diminishing returns.

Looking ahead, as rumors suggest models will soon hit the 3-5 trillion parameter mark, we’re likely going to see even more emphasis on complex inference processes rather than just parameter inflation. It’s not just about raw power anymore but smarter utilization, requiring better memory management and model partitioning. CoT is just one example of optimizing inference to continue scaling, not in size, but in capability and sophistication.
 

tphuang

Lieutenant General
Staff member
Super Moderator
VIP Professional
Registered Member
Running large language models like Facebook’s LLaMA3 (405 billion parameters) already requires serious hardware, like an 8-way NVIDIA H100 cluster. But it’s not just about throwing more GPUs at the problem—it’s about how these models are utilized during inference. Techniques like Chain of Thought (CoT) don’t necessarily need more GPUs; they need more time.

Think of it like AlphaGo’s Monte Carlo simulations: once AlphaGo’s neural network made its move predictions, the ELO performance improved by running multiple playouts to refine those predictions further. CoT is similar—during inference, the model "questions itself" and iterates on its reasoning, not just outputting a single pass response but evaluating and refining multiple chains internally. So while the hardware footprint stays roughly the same, inference time increases due to these deeper, more sophisticated reasoning steps.

But this shift towards CoT is also a subtle admission that brute-force scaling of models—just piling on more and more parameters—is starting to show its limitations. We’re moving away from simply making models bigger to making them “think” more effectively. It’s not unlike the early days of CPUs: at first, it was all about more gigahertz until that ran into thermal constraints. Then it became about multi-core designs and optimizing software to take advantage of parallelism. Eventually, GPUs became the new frontier. Now, with LLMs, instead of just scaling up parameter counts endlessly, techniques like CoT represent a new way to push the boundaries without hitting the same brick walls of diminishing returns.

Looking ahead, as rumors suggest models will soon hit the 3-5 trillion parameter mark, we’re likely going to see even more emphasis on complex inference processes rather than just parameter inflation. It’s not just about raw power anymore but smarter utilization, requiring better memory management and model partitioning. CoT is just one example of optimizing inference to continue scaling, not in size, but in capability and sophistication.
If you need more time per request then that is time those GPUs can't be used to run inference on anything else.

As such, you need more GPUs to handle more requests.
 

diadact

New Member
Registered Member
If you need more time per request then that is time those GPUs can't be used to run inference on anything else.

As such, you need more GPUs to handle more requests.
Right now o1 just outputs text once models start outputting videos & images the requirement for compute will go through the roof
 
Top