RansomStark's favorites

I think you're close enough that the differences probably aren't too important. But if you want a bit more nuance, then read on. For disclosure, I'm in the second camp here. But I'll also say that I have a lot of very strong evidence to support this position, and that I do this from the perspective of a researcher.

There's a few big problems when making any definite claims about either side. First, we need to know what data the machine is processing when training. I think we all understand that if the data is in training, then testing is not actually testing a model's ability to generalize, but a model's ability to recall. Second, we need to recognize the amount of duplication of data, both exact and semantically.

1) We have no idea because these are proprietary. While LLAMA is more open than GPT, we don't know all the data that went into it (last I checked). Thus, you can't say "this isn't in the data."[0] But we do know some things that are in the data, though we don't know exactly what was filtered out. We're all pretty online people here and I'm sure many people have seen some of the depths of places like Reddit, Medium, or even Hacker News. These are all in the (unfiltered) training data! There's even a large number of arxiv papers, books, publications, and so much more. So you have to ask yourself this: "Are we confident that what we're asking the model to do is not in the data we trained on?" Almost certainly it is, so then the question moves to "Are we confident that what we're asking the model to do was adequately filtered out during training so we can have a fair test?" Regardless of what your position is, I think you can see how such a question is incredibly important and how it would be easy to mess up. And only easier the more data we train on, since it's so incredibly hard to process that data.[1] I think you can see some concerning issues with this filtering method and how it can create a large number of false negatives. They explicitly ignore answers, which is important for part 2. IIRC the GPT-3 paper also used an ngram model to check for dupes. But the most concerning line to me was this one:

  > As can be seen in tables 9 and 10, contamination overall has very little effect on the reported results.

There is a concerning way to read the data here that serves a valid explanation for the results. That the data is so contaminated, the filtering process does not meaningfully remove the contamination and thus does not significantly change the results. If introducing contamination into your data does not change your results you either have a model that has learned the function of the data VERY well and has an extremely impressive form of generalization, OR your data is contaminated in ways you aren't aware of (there are other explanations too btw). There's a clearly simpler answer here.

Second, is about semantic information and contamination[2]. This is when data has the same effective meaning, but uses different ways to express it. "This is a cat" and "este es un gato" are semantically the same but share no similar words. So is "I think there's data spoilage" as well as "There is some concerning issues left to be resolved that bring into question the potential for information leakage." These will not be caught by substrings or ngrams. Yet, training on one will be no different than training on the other once we consider RLHF. The thing here is that in high dimensions, data is very confusing and does not act the way you might expect when operating in 2D and 3D. A mean between two values may or may not be representative depending on the type of distribution (uniform and gaussian, respectively), and we don't have a clue what that is (it is intractable!). The curse of dimensionality is about how it is difficult to distinguish a nearest neighboring point from the furthest neighboring point, because our concept of a metric degrades as we increase dimensionality (just like we lose algebraic structure when going from C (complex) -> H (quaternion) -> O (octonions) (commutativity, then associativity)[3]. Some of this may be uninteresting in the mathematical sense but some does matter too. But because of this, we need to rethink our previous questions carefully. Now we need to ask: "Are we confident that we have filtered out data that is not sufficiently meaningfully different from that in the test data?" Given the complexity of semantic similarity and the fact that "sufficiently" is not well defined, I think this should make anybody uneasy. If you are absolutely confident the answer is "yes, we have filtered it" I would think you a fool. It is so incredibly easy to fool ourselves that any good researcher needs to have a constant amount of doubt (though confidence is needed too!). But neither should our lack of a definite answer here stop progress. But it should make us more careful about what claims we do make. And we need to be clear about this or else conmen have an easy time convincing others.

To me, the common line of research is wrong. Until we know the data and have processed the data with many looking for means of contamination, results like these are not meaningful. They rely on a shaky foundation and often are more looking for evidence to prove reasoning than to consider it might not.

But for me, I think the conversations about a lot of this are quite strange. Does it matter that LLMs can't reason? I mean in some sense yes, but the lack of this property does not make them any less powerful of a tool. If all they are is a lossy compression of the majority of human knowledge with a built in human interface, that sounds like an incredible achievement and a very useful tool. Even Google is fuzzy! But this also tells us what the tool is good for and isn't. That this puts bounds on what we should rely on it for and what we can trust it to do with and without human intervention. I think some are afraid that if LLMs aren't reasoning, then that means we won't get AGI. But at the same time, if they don't reason, then we need to find out why and how to make machines reason if we are to get there. So ignoring potential pitfalls hinders this progress. I'm not suggesting that we should stop using or studying LLMs (we should continue to), but rather that we need to stop putting alternatives down. We need to stop comparing alternatives one-to-one to models that took millions of dollars to do a single training and have been studied by thousands of people for several years against things scrambled together by small labs on a shoestring budget. We'll never be able to advance if the goalpost is that you can't make incremental steps along the way. Otherwise how do you? You got to create something new without testing, convince someone to give you millions of dollars to train it, and then millions more to debug your mistakes and things you've learned along the way? Very inefficient. We can take small steps. I think this goalpost results in obscurification. That because the bar is set so high, that strong claims need to be made for these works to be published. So we have to ask ourselves the deeper questions: "Why are we doing this?"[4]

[0] This might seem backwards but the creation of the model implicitly claims that the test data and training data are segregated. "Show me this isn't in training" is a request for validation.

[1] https://arxiv.org/abs/2303.08774

[2] If you're interested, Meta put out a work on semantic deduplication last year. They mostly focused on vision, but it still shows the importance of what's being argued here. It is probably easier to verify that images are semantically similar than sentences, since language is more abstract. So pixels can be wildly different and the result is visually identical; how does this concept translate with language? https://arxiv.org/abs/2303.09540

[3] https://math.stackexchange.com/questions/641809/what-specifi...

[4] I think if our answer is just "to make money" (or anything semantically similar like "increase share value") then we are doomed to mediocrity and will stagnate. But I think if we're doing these things to better human lives, to understand the world and how things work (I'd argue building AI is, even if a bit abstract), or to make useful and meaningful things, then the money will follow. But I think that many of us and many leading teams and businesses have lost focus on the journey that has led to profits and are too focused on the end result. And I do not think this is isolated to CEOs, I think this similar short sighted thinking can be repeated all the way down the corporate ladder. To a manager focusing on what their bosses explicitly ask for (rather than the intent) to the employee who knows that this is not the right thing to do but does it anyways (often because they know the manager will be unhappy. And this repeats all the way up). All life, business, technology, and creation have immense amounts of complexity to them. Ones we obviously want to simplify as much as possible. But when we hyper focus on any set of rules, no matter how complex, we will be doomed to fail because the environment is always changing and you will never be able to instantly adapt (this is the nature of chaos. Where small perturbations have large changes on the outcome). That doesn't mean we shouldn't try to make rules, but rather it means that rules are to be broken. It's just a matter of knowing when. In the end, this is an example of what it means to be able to reason. So we should be careful to ensure that we create AGI by making machines able to reason and think (to make them "more human") rather than by making humans into unthinking machines. I worry that the latter looks more likely, given that it is a much easier task to accomplish.