The word "attention" has been stretched pretty far to explain what is happening inside a transformer.
What's actually happening is that every token embedding interacts with every other token embedding before it and as the product of this interaction (dot product + softmax) it takes a fraction of every other token embedding and adds it to itself. Technically, it's different transforms/functions of the embedding.
You can view it as every token embedding mixing information from other embeddings into itself. Done ~100 times in parallel ("attention heads"), ~100 times in sequence (layers). As per GPT-3 model (175B).
I'm pretty sure in GPT3.5+ models, this concept of attention holds no longer true. In [1] the author suggests they are using "Intra-tensor sparsity" from a 2021 paper from Google [2].
Details aside, math suggests they _must_ be using some sparse attention method: the memory used by attention is O(l^2 * d) (here O notation is a bit of overkill: it's exactly that number in 8 bit quantization or twice that in 16 bit floats). With l=32k, and d probably in the order of a few k (even "old" BERT model had it at 768, it only went up since), that would be of the order of a few TB. For (a piece of) _a single_ layer, they probably have dozens, if not hundreds, of it. The largest GPUs in commerce have 80GB of memory; there's no way they are really using that many GPUs for each single layer (even leaving aside the fact that runtime would probably be absolutely horrible).
Implicitly or explicitly, the transformer must learn to have both a "summarized" version of attention (that tells it where to focus on) and more "detailed" ones. It doesn't need to be specifically coded, if using sparse attention, the model might for example get different levels of attentions at different layers, somehow, but they probably have coded something a bit smarter than that.
> every token embedding interacts with every other token embedding before it
> it takes a fraction of every other token embedding and adds it to itself.
> every token embedding mixing information from other embeddings into itself
Noting the use of the word every. Phrased this way, calling it "attention" hardly makes sense, as attention is typically focused on something specific at any given time - not always on the entire thing.
While all other tokens are considered, the attention mechanism is putting an individual weight on each one, in a way "paying more attention" to some than others.
Softmax takes a vector of arbitrary values and converts it into probabilities such that all the elements of the vector add to 1. Transformers use this to decide on the fractions.
This could be seen as "soft attention" where in theory there would be few winners for every "attention head".
It's also possible that the only purpose the softmax actually serves (or at least major component of) is that of normalization. Without it, the variance in the internal network dynamics between training samples would be fairly large (some training samples may have tokens that interact heavily, while others not). Making optimization problematic.
Most LLM architectures use “soft attention” where some fractional amount of attention is put on every token. “Hard attention” is the term for what you describe.
Naming is a perpetual problem. My issue is with "hallucination", which everyone takes to be the "problem" with GPT style networks making things up. Never mind that transformers are just trying to predict the next likely token, NOT the truth PLUS that they're trained from the internet. As everyone knows, the internet is not known for correctness and truth. If you want any neural network to figure out the truth independently, it'll obviously need the ability to go out into the real world and even needs to be allowed to experiment for most things.
Hallucination used to mean the following. A basic neural network is:
f(x) = y = repeat(nonlinearity(ax[0] + bx[1] + ...))
And then you adjust a, b, c, ... until y is reasonable, according to the cost function. But look! The very same backpropagation can adjust x[0], x[1] ... with the same cost function and only a small change in the code.
This allows you to reverse the question neural networks answer. Which can be an incredibly powerful way to answer questions.
And that used to be called hallucination in Neural networks. Instead of "change these network weights to transform x into y, keeping x constant" you ask "change x to transform x into y, keeping the network weights constant".
Now it's impossible finding half the papers on the this topic. AARGH!
> My issue is with "hallucination", which everyone takes to be the "problem" with GPT style networks making things up.
It's most famous as a problem with ChatGPT specifically, which is presented as a chat interface where an agent answers your question. In that context, it makes sense to think of confident and detailed wrong answers as hallucinations.
You could say that it's still an LLM underneath, but then you'd be talking about a layer of abstraction beneath what most people interact with. Given that they tap into the mental model of "chat with an agent" heavily with their interface and interactions, having small disclaimers and saying "I'm just an LLM" from time to time aren't sufficient to counter people's intuitions and expectations.
I believe the naming is perfect: the word attention clearly maches the intention of the model structure. It’s still clearly missing more efficiency, but we had to have ChatGPT / ChatGPT4 working and used to make the further research directions clear (increasing context length and decreasing hallucinations).
What's actually happening is that every token embedding interacts with every other token embedding before it and as the product of this interaction (dot product + softmax) it takes a fraction of every other token embedding and adds it to itself. Technically, it's different transforms/functions of the embedding.
You can view it as every token embedding mixing information from other embeddings into itself. Done ~100 times in parallel ("attention heads"), ~100 times in sequence (layers). As per GPT-3 model (175B).