More

yorwba · 2025-11-06T21:13:33 1762463613

It's true that to train more information into the model you need more trainable parameters, but when people ask for small models, they usually mean models that run at acceptable speeds on their hardware. Techniques like mixture-of-experts allow increasing the number of trainable parameters without requiring more FLOPs, so they're large in one sense but small in another.

And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.

yorwba · 2025-11-06T18:45:34 1762454734

The US lost in the gambling case because their restrictions on foreign websites were stricter than those on domestic ones. The GATS doesn't prohibit countries from regulating trade, they only have to do so in a non-discriminatory manner. Spain isn't blocking foreign websites for copyright infringement that would be legal domestically, so they're in compliance with their obligations.

yorwba · 2025-11-06T11:10:19 1762427419

It's called checked luggage for a reason. They already check every bag, so checking for one more thing is hardly a problem.

yorwba · 2025-11-06T08:57:06 1762419426

The "attention is all you need" paper did not invent attention mechanisms. It showed that existing models that were already using attention could have their non-attention parts removed and still worked. So those other parts were unnecessary and only attention was needed.

yorwba · 2025-11-05T21:40:43 1762378843

Most libraries probably don't stock many books like that. They'd just waste shelf space until they get discarded in the end.

yorwba · 2025-11-05T13:13:07 1762348387

It is a compiler. It is not a compiler for Python, because there are valid Python programs it can't compile and isn't intended to compile.

yorwba · 2025-11-03T21:14:32 1762204472

The memory is model-readable but not model-writable, so you still need to train via backprop to get the memory to store useful data.

yorwba · 2025-11-03T16:29:13 1762187353

That test page doesn't seem to use any features current Chrome doesn't support. Or do you just mean that the appearance isn't identical to the TeX rendering even if you use a font like Latin Modern?

icpmoles · 2025-11-03T18:13:40 1762193620

It improved a little bit from what I remembered (on Chrome it had problems displaying multi-line brackets), it still has some inaccuracies tho

https://imgur.com/a/83lSuYn

yorwba · 2025-11-02T08:08:25 1762070905

One problem with testing one change at a time is that if you can only run a small number of experiments because each one requires many GPU hours to get results, you can also only test a small number of changes. If you can come up with and implement new changes much more easily than you can test them, it would be more efficient to test multiple changes at a time and use some form of Bayesian optimization to find the best combination of changes with as few experiments as possible.

ImageXav · 2025-11-02T12:05:25 1762085125

Agreed. One at a time testing (OAT) has been outdated for almost a century at this point. Factorial and fractional factorial experiments have been around for that long and give detailed insights into the effect of not just single changes but the interaction between changes, which means you can superpower your learnings as many variables in DL do in fact interact.

Or, more modern Bayesian methods if you're more interested in getting the best results for a given hyperparameter sweep.

However, that is not to detract from the excellent effort made here and the great science being investigated. Write ups like this offer so much gold to the community.

empiko · 2025-11-02T15:40:35 1762098035

The number of runs you can afford are not enough to perform Bayesian optimization. Count how many different options they explored in the text and take a guess how many samples you need to start modeling the hyperparameter space.

yorwba · 2025-11-01T16:26:03 1762014363

Sybil attacks are a problem when you care about global properties of permissionless networks. If you only care about local properties in a subnetwork where you hand-pick the nodes, the problem goes away. I.e. you can't use such a scheme to find the best paper in the whole world, but you can use it to rank papers in a small subdiscipline where you personally recognize most of the important authors.