Because you cite is about: > in-context learning LLMs have no concept of the sym...

pama · 2025-01-03T17:29:19 1735925359

My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.

nyrikki · 2025-01-03T18:27:07 1735928827

Littlestone and Warmuth make the connection to compression in1986, which was later shown to be equivalent to VC dimensionally or PAC learnablilty.

Look into DBScan, OPTICs for far closer lenses on how clustering works in modern ML commercial ML, KNN not the only form of clustering.

But it is still in-context, additional compression that depends on a decider function, or equivalently a composition linearized set shattering parts.

pama · 2025-01-03T19:28:29 1735932509

I am very familiar with these and other clustering methods in modern ML, and have been involved in inventing and publishing some such methods myself in various scientific contexts. The paper I cited above only used 3 nearest neighbors as one baseline IIRC; that is why I mentioned KNN. However, even boosted trees failed to reduce the loss as much as the algorithm learned from the data by the decoder transformer.

nyrikki · 2025-01-03T17:00:00 1735923600

Here is a fairly good lecture series on graduate level complexity theory that will help understand parts. At least why multiple iterations help but why they also aren't the answer to super human results.

https://youtube.com/playlist?list=PLm3J0oaFux3b8Gg1DdaJOzYNs...

pama · 2025-01-03T18:06:14 1735927574

Thanks for the tip, though I’m not sure how complexity theory will explain the impossibility of superhuman results. The main advantage ML methods have over humans is that they train much faster. Just like humans, they get better with more training. When they are good enough, they can be used to generate synthetic data, especially for cases like software optimization, when it is possible to verify the ground truth. A system could only be correct once in a thousand times to be useful for generating training data as long as we can reliably eliminate all failures. Modern LLM can be better than that minimal requirement for coding already and o1/o3 can probably handle complicated cases. There are differences between coding and games (where ML is already superhuman in most instances) but they start to blur once the model has a baseline command of language, a reasonable model of the world, and the ability to follow desired specs.

nyrikki · 2025-01-03T18:32:09 1735929129

ML is better than biological neurons in some tasks, they are different contexts.

Almost all the performance of say college tests are purely from the pre-training, pattern finding and detection.

Transformers are limited to DLOGTIME-uniform TC0, they can't even do the Boolean circuit value problem.

The ability to use the properties of BPP, does help.

Understanding the power of, and limitations of iteration and improving approximations requires descriptive complexity theory IMHO.

lanstin · 2025-01-03T19:25:12 1735932312

I read a book on recursively enumerable degrees once, which IIRC was a sort of introduction to complexity classes of various computable functions, but I never imagined it having practical use; so this post is eye-opening. I've been nattering about how the models are largely finding separating hyperplanes after non-linear transformations have been done, but this approach where the AI solving ability can't be more complex than the complexity class allows is an interesting one.

pama · 2025-01-03T19:36:51 1735933011

The discussion cannot go deeper than the current level, unfortunately. One thing to not forget when thinking about decoder transformer models is that there is no limitation to having parts of the output / input stream be calculated by other circuits if it helps the cause. Eg send a token to use a calculator, compute and fill the answer; send a token to compile and run a code and fill the stream with the results. The complexity class of the main circuit might not need be much more complicated than the 200-level deep typical architectures of today as long as they can have access to memory and tools. You can call this system something else if you prefer (decoder-transformer-plus-computer), but that is what people interact with in ChatGPT, so not sure I agree that complexity theory limits the superhuman ability. Humans are not good with complexity.

pama · 2025-01-03T19:07:20 1735931240

I recall early, incomplete speculation about transformers not solving Boolean circuit value problems; what did you think of this work? https://arxiv.org/abs/2402.12875v3

nyrikki · 2025-01-03T19:45:55 1735933555

> However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T

There is a difference between being equivalent to a circuit and prediction of the output of the BVSP.

That is what I was suggesting learning descriptive complexity theory would help with.

pama · 2025-01-03T23:46:04 1735947964

Why does the limit on computational complexity of single decoder transformers matter for obtaining superhuman coding ability? Is there a theory of what level of complexity is needed for the task of coding according to a spec? Or the complexity for translation/optimization of a code? Even if there were, and one could show that a plain decoder transformer is insufficient, you probably only need to add a tool in the middle of the stream processing. Unless you have some specific citation that strongly argues otherwise, I will stick with my speculative/optimistic view on the upcoming technology explosion. To be fair, I always thought coding was at best modest complexity, not super hard compared to other human activities, so I will not make claims of generic superintelligences anytime soon, though I hope they happen in the near term, but I’d be happy if I simply see them in a decade, and I don’t feel partial to any architecture. I just think that attention was a cool idea even before the transformers, and decoder transformers took it to the limit. It may be enough for a lot of superhuman achievements. Probably not for all. We will see.

nyrikki · 2025-01-06T16:31:29 1736181089

Rice's theorem means you can't choose to decide if a program is correct, but you have to choose an error direction and accept the epislon.

The Curry–Howard–Lambek correspondence is possibly a good tool to think about it.

The reason I suggested graduate level complexity theory is because the undergrad curriculum is flawed in that it seems that you can use brute force with a TM to stimulate a NTM with NP.

It is usually taught that NP is the set of decision problems that can be solved by a NTM in polynomial time.

But you can completely drop the NTM and say it is the set of decision problems that are verifiable by a DTM in poly time.

Those are equivalent.

Consider the The Approximate Shortest Vector Problem (GapSVP), which is NP-HARD, and equivalent to predicting the output of a 2 layer NN (IIRC).

Being NPH, it is no longer a decision problem.

Note that for big 0, you still have your scaler term. Repeated operations are typically dropped.

If you are in contemporary scale ML, parallelism is critical to problems being solvable, even with FAANG level budgets.

If you are limited to DLOGTIME-uniform TC0, you can't solve NC1- complete problems, and surely can't do P-complete problems.

But that is still at the syntactic level, software in itself isn't worth anything, it is the value it provides to users that is important.

Basically what you are claiming is that feed forward NN solve the halting problem, in a generalized way.

Training an LLM to make safe JWT refresh code is very different from generalized programming. Mainly because most of the ability for them to do so is from pre-training.

Inference time is far more limited, especially for transformers and this is well established.

https://arxiv.org/abs/2309.06926

Eisenstein · 2025-01-03T21:08:17 1735938497

> they just are dealing with next token prediction.

And nuclear power plants are just heating water.