More

alyxya · 2026-01-04T00:09:16 1767485356

Given the decrease in the benchmark score from the correction, I don't think you can assume they didn't check a single output. Clearly the model is still very capable and the model cheating its results didn't affect most of the benchmark.

alyxya · 2026-01-03T03:25:47 1767410747

I don't get the anti-LLM sentiment because plenty of trends continue to show steady progress with LLMs over time. Sure, you can poke at some dumb things LLMs do as evidence of some fundamental issue, but the frontier capabilities continue to amaze people. I suspect the anti-LLM sentiment comes from people who haven't given a serious chance at seeing all the things they're capable of for themselves. I used to be skeptical, but I've changed my mind quite a bit over the past year, and there are many others who've changed their stance towards LLMs as well.

D-Machine · 2026-01-03T04:53:48 1767416028

Or, people who've actually trained and used models in domains where "stuff on the internet" is of no relevance to what you are actually doing realize the profound limitations to what these LLMs actually do. They are amazing, don't get me wrong, but not so amazing in many specific contexts.

nutjob2 · 2026-01-03T03:32:30 1767411150

People who think that "steady progress" will continue forever have no basis for their assumption.

You have a ad-hominem attack and your own personal anecdote with, which are not an argument for LLMs.

alyxya · 2026-01-03T03:45:19 1767411919

It'll steadily continue the same way Moore's law has continued for a while. I don't think people question the general trend in Moore's law besides the point where it's nearing the limit of physics. It's a lot harder to claim LLMs don't work as a universal claim, whereas claiming something is possible for LLMs only needs some evidence.

nutjob2 · 2026-01-03T08:37:44 1767429464

Yes, LLMs will continue to progress until they hit the limits of LLMs.

The idea that LLMs will reach AGI is entirely speculative, not least because AGI is undefined and speculative.

stevenhuang · 2026-01-03T06:08:22 1767420502

Lecun has already been proven wrong countless times over the years regarding his predictions of what LLMs can or cannot do. While LLMs continue to improve, he has yet to produce anything of practical value from his research. The salt is palpable, and for this he's memed for a reason.

alyxya · 2025-12-27T00:41:13 1766796073

I think the most neutral solution right now is having multiple competing models as different perspectives. We already see this effect in social media algorithms amplifying certain biases and perspectives depending on the platform.

alyxya · 2025-12-19T04:46:55 1766119615

I don’t think the two kinds of vibe coding are entirely separate. There’s a spectrum of how much context you care to understand yourself, and it’s feasible to ask a lot of questions to gain more understanding or let loose and give more discretion to the LLM.

alyxya · 2025-12-10T17:26:12 1765387572

Minor thing that bothers me is that I can't scroll through the things like in the deep sea or space elevator.

alyxya · 2025-12-09T05:04:13 1765256653

I’ve had a hard time parsing what exactly the paper is trying to explain. So far I’ve understood that their comparison seems to be models within the same family and same weight tensor dimensions, so they aren’t showing a common subspace when there isn’t a 1:1 match between weight tensors in a ViT and GPT2. The plots showing the distribution of principal component values presumably does this on every weight tensor, but this seems to be an expected result that the principal component values shows a decaying curve like a log curve where only a few principal components are the most meaningful.

What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.

seeknotfind · 2025-12-09T05:25:45 1765257945

Yeah, it sounds platonic the way it's written, but it seems more like a hyped model compression technique.

alyxya · 2025-12-07T16:34:47 1765125287

The hardest part about making a new architecture is that even if it is just better than transformers in every way, it’s very difficult to both prove a significant improvement at scale and gain traction. Until google puts in a lot of resources into training a scaled up version of this architecture, I believe there’s plenty of low hanging fruit with improving existing architectures such that it’ll always take the back seat.

tyre · 2025-12-07T20:11:35 1765138295

Google is large enough, well-funded enough, and the opportunity is great enough to run experiments.

You don't necessarily have to prove it out on large foundation models first. Can it beat out a 32b parameter model, for example?

swatcoder · 2025-12-07T20:48:32 1765140512

Do you think there might be an approval process to navigate when experiments costs might run seven or eight digits and months of reserved resources?

While they do have lots of money and many people, they don't have infinite money and specifically only have so much hot infrastructure to spread around. You'd expect they have to gradually build up the case that a large scale experiment is likely enough to yield a big enough advantage over what's already claiming those resources.

dpe82 · 2025-12-08T08:01:09 1765180869

I would imagine they do not want their researchers unnecessarily wasting time fighting for resources - within reason. And at Google, "within reason" can be pretty big.

howdareme · 2025-12-08T10:34:35 1765190075

I mean looking antigravity, jules & gemini cli, they have have no problem with their developers fighting for resources

nl · 2025-12-08T11:45:49 1765194349

I mean you'd think so, but...

> In fact, the UL2 20B model (at Google) was trained by leaving the job running accidentally for a month.

https://www.yitay.net/blog/training-great-llms-entirely-from...

p1esk · 2025-12-07T19:36:31 1765136191

Until google puts in a lot of resources into training a scaled up version of this architecture

If Google is not willing to scale it up, then why would anyone else?

8note · 2025-12-08T01:44:48 1765158288

chatgpt is an example on why.

falcor84 · 2025-12-08T15:23:39 1765207419

You think that this might be another ChatGPT/Docker/Hadoop case, where Google comes up with the technology but doesn't care to productize it?

nickpsecurity · 2025-12-07T22:52:25 1765147945

But, it's companies like Google that made tools like Jax and TPU's saying we can throw together models with cheap, easy scaling. Their paper's math is probably harder to put together than an alpha-level prototype which they need anyway.

So, I think they could default on doing it for small demonstrators.

m101 · 2025-12-08T00:15:56 1765152956

Prove it beats models of different architectures trained under identical limited resources?

UltraSane · 2025-12-07T17:11:18 1765127478

Yes. The path dependence for current attention based LLMs is enormous.

patapong · 2025-12-07T19:10:11 1765134611

At the same time, there is now a ton of data for training models to act as useful assistants, and benchmarks to compare different assistant models. The wide availability and ease of obtaining new RLHF training data will make it more feasible to build models on new architectures I think.

alyxya · 2025-12-04T22:09:49 1764886189

There generally aren't new techniques when optimizing something ubiquitous. Instead, there are a lot of ways to apply existing techniques to create new and better results. Most ideas are built on top of the same foundational principles.

josephg · 2025-12-05T13:04:20 1764939860

Yes. And there’s still lots of places where you can get significant speed ups by simply applying those old techniques in a new domain or a novel way. The difference between a naive implementation of an algorithm and an optimised one is often many orders of magnitude. Look at automerge - which went from taking 30 seconds on a simple example to tens of milliseconds.

I think about this regularly when I compile C++ or rust using llvm. It’s an excellent compiler backend. It produces really good code. But it is incredibly slow, and for no good technical reason. Plenty of other similar compilers run circles around it.

Imagine an llvm rewrite by the people who made V8, or chrome or the unreal engine. Or the guy who made luajit or the Go compiler team. I’d be shocked if we didn’t see an order of magnitude speed up overnight. They’d need some leeway to redesign llvm IR of course. And it would take years to port all of llvm’s existing optimisations. But my computer can retire billions of operations per second. And render cyberpunk at 60fps. It shouldn’t take seconds of cpu time to compile a small program.

slashdave · 2025-12-05T00:02:39 1764892959

I am not sure about that. However, what is clear is that if there is a new technique, it will not be found by this LLM.

CapsAdmin · 2025-12-05T01:33:47 1764898427

It's generally true, isn't it? Otherwise we'd have ground breaking discoveries every day about some new and fastest way to do X.

The way I see it, mathematicians have been trying (and somewhat succeeding every 5~ years) to prove faster ways to do matrix multiplications since the 1970s. But this is only in theory.

If you want to implement the theory, you suddenly have many variables you need to take care of such as memory speed, cpu instructions, bit precision, etc. So in practice, an actual implementation of some theory likely have more room to improve. It is also likely that LLM's can help figure out how to write a more optimal implementation.

alyxya · 2025-12-04T22:06:29 1764885989

The chart confused me because I expected to see performance numbers of CUDA-L2 compared to the others, but instead it shows a chart showing the speedup percentage of CUDA-L2 over the others. In some sense, the bar chart effectively inverts the performance of torch.matmul and cuBLAS with how much percentage it shows. 0% on the bar chart would only mean equal performance.

alyxya · 2025-11-25T22:12:02 1764108722

They have a moat defined by being well known in the AI industry, so they have credibility and it wouldn't be hard for anything they make to gain traction. Some unknown player who replicates it, even if it was just as good as what SSI does, will struggle a lot more with gaining attention.

baxtr · 2025-11-25T22:23:35 1764109415

Being well known doesn’t qualify as a moat.

mrandish · 2025-11-25T22:37:40 1764110260

Agreed. But it can be a significant growth boost. Senior partners at high-profile VCs will meet with them. Early key hires they are trying to recruit will be favorably influenced by their reputation. The media will probably cover whatever they launch, accelerating early user adoption. Of course, the product still has to generate meaningful value - but all these 'buffs' do make several early startup challenges significantly easier to overcome. (Source: someone who did multiple tech startups without those buffs and ultimately reached success. Spending 50% of founder time for six months to raise first funding is a significant burden (working through junior partners and early skepticism) vs 20% of founder time for three weeks.)

baxtr · 2025-11-25T23:11:17 1764112277

Yes, I am not debating that it gets you a significant boost.

I’m personally not aware of a strong correlation with real business value created after the initial boost phase. But surely there must be examples.