Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.
I don't get the anti-LLM sentiment because plenty of trends continue to show steady progress with LLMs over time. Sure, you can poke at some dumb things LLMs do as evidence of some fundamental issue, but the frontier capabilities continue to amaze people. I suspect the anti-LLM sentiment comes from people who haven't given a serious chance at seeing all the things they're capable of for themselves. I used to be skeptical, but I've changed my mind quite a bit over the past year, and there are many others who've changed their stance towards LLMs as well.
Or, people who've actually trained and used models in domains where "stuff on the internet" is of no relevance to what you are actually doing realize the profound limitations to what these LLMs actually do. They are amazing, don't get me wrong, but not so amazing in many specific contexts.
It'll steadily continue the same way Moore's law has continued for a while. I don't think people question the general trend in Moore's law besides the point where it's nearing the limit of physics. It's a lot harder to claim LLMs don't work as a universal claim, whereas claiming something is possible for LLMs only needs some evidence.
Lecun has already been proven wrong countless times over the years regarding his predictions of what LLMs can or cannot do. While LLMs continue to improve, he has yet to produce anything of practical value from his research. The salt is palpable, and for this he's memed for a reason.
I think the most neutral solution right now is having multiple competing models as different perspectives. We already see this effect in social media algorithms amplifying certain biases and perspectives depending on the platform.
I don’t think the two kinds of vibe coding are entirely separate. There’s a spectrum of how much context you care to understand yourself, and it’s feasible to ask a lot of questions to gain more understanding or let loose and give more discretion to the LLM.
I’ve had a hard time parsing what exactly the paper is trying to explain. So far I’ve understood that their comparison seems to be models within the same family and same weight tensor dimensions, so they aren’t showing a common subspace when there isn’t a 1:1 match between weight tensors in a ViT and GPT2. The plots showing the distribution of principal component values presumably does this on every weight tensor, but this seems to be an expected result that the principal component values shows a decaying curve like a log curve where only a few principal components are the most meaningful.
What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.
The hardest part about making a new architecture is that even if it is just better than transformers in every way, it’s very difficult to both prove a significant improvement at scale and gain traction. Until google puts in a lot of resources into training a scaled up version of this architecture, I believe there’s plenty of low hanging fruit with improving existing architectures such that it’ll always take the back seat.
Do you think there might be an approval process to navigate when experiments costs might run seven or eight digits and months of reserved resources?
While they do have lots of money and many people, they don't have infinite money and specifically only have so much hot infrastructure to spread around. You'd expect they have to gradually build up the case that a large scale experiment is likely enough to yield a big enough advantage over what's already claiming those resources.
I would imagine they do not want their researchers unnecessarily wasting time fighting for resources - within reason. And at Google, "within reason" can be pretty big.
But, it's companies like Google that made tools like Jax and TPU's saying we can throw together models with cheap, easy scaling. Their paper's math is probably harder to put together than an alpha-level prototype which they need anyway.
So, I think they could default on doing it for small demonstrators.
At the same time, there is now a ton of data for training models to act as useful assistants, and benchmarks to compare different assistant models. The wide availability and ease of obtaining new RLHF training data will make it more feasible to build models on new architectures I think.
There generally aren't new techniques when optimizing something ubiquitous. Instead, there are a lot of ways to apply existing techniques to create new and better results. Most ideas are built on top of the same foundational principles.
Yes. And there’s still lots of places where you can get significant speed ups by simply applying those old techniques in a new domain or a novel way. The difference between a naive implementation of an algorithm and an optimised one is often many orders of magnitude. Look at automerge - which went from taking 30 seconds on a simple example to tens of milliseconds.
I think about this regularly when I compile C++ or rust using llvm. It’s an excellent compiler backend. It produces really good code. But it is incredibly slow, and for no good technical reason. Plenty of other similar compilers run circles around it.
Imagine an llvm rewrite by the people who made V8, or chrome or the unreal engine. Or the guy who made luajit or the Go compiler team. I’d be shocked if we didn’t see an order of magnitude speed up overnight. They’d need some leeway to redesign llvm IR of course. And it would take years to port all of llvm’s existing optimisations. But my computer can retire billions of operations per second. And render cyberpunk at 60fps. It shouldn’t take seconds of cpu time to compile a small program.
It's generally true, isn't it? Otherwise we'd have ground breaking discoveries every day about some new and fastest way to do X.
The way I see it, mathematicians have been trying (and somewhat succeeding every 5~ years) to prove faster ways to do matrix multiplications since the 1970s. But this is only in theory.
If you want to implement the theory, you suddenly have many variables you need to take care of such as memory speed, cpu instructions, bit precision, etc. So in practice, an actual implementation of some theory likely have more room to improve. It is also likely that LLM's can help figure out how to write a more optimal implementation.
The chart confused me because I expected to see performance numbers of CUDA-L2 compared to the others, but instead it shows a chart showing the speedup percentage of CUDA-L2 over the others. In some sense, the bar chart effectively inverts the performance of torch.matmul and cuBLAS with how much percentage it shows. 0% on the bar chart would only mean equal performance.
They have a moat defined by being well known in the AI industry, so they have credibility and it wouldn't be hard for anything they make to gain traction. Some unknown player who replicates it, even if it was just as good as what SSI does, will struggle a lot more with gaining attention.
Agreed. But it can be a significant growth boost. Senior partners at high-profile VCs will meet with them. Early key hires they are trying to recruit will be favorably influenced by their reputation. The media will probably cover whatever they launch, accelerating early user adoption. Of course, the product still has to generate meaningful value - but all these 'buffs' do make several early startup challenges significantly easier to overcome. (Source: someone who did multiple tech startups without those buffs and ultimately reached success. Spending 50% of founder time for six months to raise first funding is a significant burden (working through junior partners and early skepticism) vs 20% of founder time for three weeks.)
reply