The word "attention" has been stretched pretty far to explain what is happening ...

spi · on May 23, 2023

I'm pretty sure in GPT3.5+ models, this concept of attention holds no longer true. In [1] the author suggests they are using "Intra-tensor sparsity" from a 2021 paper from Google [2].

Details aside, math suggests they _must_ be using some sparse attention method: the memory used by attention is O(l^2 * d) (here O notation is a bit of overkill: it's exactly that number in 8 bit quantization or twice that in 16 bit floats). With l=32k, and d probably in the order of a few k (even "old" BERT model had it at 768, it only went up since), that would be of the order of a few TB. For (a piece of) _a single_ layer, they probably have dozens, if not hundreds, of it. The largest GPUs in commerce have 80GB of memory; there's no way they are really using that many GPUs for each single layer (even leaving aside the fact that runtime would probably be absolutely horrible).

Implicitly or explicitly, the transformer must learn to have both a "summarized" version of attention (that tells it where to focus on) and more "detailed" ones. It doesn't need to be specifically coded, if using sparse attention, the model might for example get different levels of attentions at different layers, somehow, but they probably have coded something a bit smarter than that.

[1] https://kir-gadjello.github.io/posts/gpt4-some-technical-hyp... [2] https://arxiv.org/abs/2111.12763

TeMPOraL · on May 23, 2023

> every token embedding interacts with every other token embedding before it

> it takes a fraction of every other token embedding and adds it to itself.

> every token embedding mixing information from other embeddings into itself

Noting the use of the word every. Phrased this way, calling it "attention" hardly makes sense, as attention is typically focused on something specific at any given time - not always on the entire thing.

grafzhl · on May 23, 2023

While all other tokens are considered, the attention mechanism is putting an individual weight on each one, in a way "paying more attention" to some than others.

pico_creator · on May 23, 2023

Yup, if you follow this definition of attention. It makes sense.

The mixing step, is computed with the trained weights, meaning the model does learn on its own, when/what to emphasise in this "mixing" process.

Hypothetically speaking, if a token does not matter, it mixing weights could end up being literally 0 (or something close to it)

coppsilgold · on May 23, 2023

Softmax takes a vector of arbitrary values and converts it into probabilities such that all the elements of the vector add to 1. Transformers use this to decide on the fractions.

This could be seen as "soft attention" where in theory there would be few winners for every "attention head".

It's also possible that the only purpose the softmax actually serves (or at least major component of) is that of normalization. Without it, the variance in the internal network dynamics between training samples would be fairly large (some training samples may have tokens that interact heavily, while others not). Making optimization problematic.

canjobear · on May 23, 2023

Most LLM architectures use “soft attention” where some fractional amount of attention is put on every token. “Hard attention” is the term for what you describe.

galeos · on May 23, 2023

"...every token embedding interacts with every other token embedding before it"

And, in the case of BERT, every token embedding after it too.

pico_creator · on May 23, 2023

Agreed. IMO - A part of me even argue we should stop calling it attention (but what to call it instead is a mess)

But since this was derived from apple lite attention paper. The name is gonna stick, due to a lack of better alternative

candiodari · on May 23, 2023

Naming is a perpetual problem. My issue is with "hallucination", which everyone takes to be the "problem" with GPT style networks making things up. Never mind that transformers are just trying to predict the next likely token, NOT the truth PLUS that they're trained from the internet. As everyone knows, the internet is not known for correctness and truth. If you want any neural network to figure out the truth independently, it'll obviously need the ability to go out into the real world and even needs to be allowed to experiment for most things.

Hallucination used to mean the following. A basic neural network is:

f(x) = y = repeat(nonlinearity(ax[0] + bx[1] + ...))

And then you adjust a, b, c, ... until y is reasonable, according to the cost function. But look! The very same backpropagation can adjust x[0], x[1] ... with the same cost function and only a small change in the code.

This allows you to reverse the question neural networks answer. Which can be an incredibly powerful way to answer questions.

And that used to be called hallucination in Neural networks. Instead of "change these network weights to transform x into y, keeping x constant" you ask "change x to transform x into y, keeping the network weights constant".

Now it's impossible finding half the papers on the this topic. AARGH!

sundarurfriend · on May 23, 2023

> My issue is with "hallucination", which everyone takes to be the "problem" with GPT style networks making things up.

It's most famous as a problem with ChatGPT specifically, which is presented as a chat interface where an agent answers your question. In that context, it makes sense to think of confident and detailed wrong answers as hallucinations.

You could say that it's still an LLM underneath, but then you'd be talking about a layer of abstraction beneath what most people interact with. Given that they tap into the mental model of "chat with an agent" heavily with their interface and interactions, having small disclaimers and saying "I'm just an LLM" from time to time aren't sufficient to counter people's intuitions and expectations.

xiphias2 · on May 23, 2023

I believe the naming is perfect: the word attention clearly maches the intention of the model structure. It’s still clearly missing more efficiency, but we had to have ChatGPT / ChatGPT4 working and used to make the further research directions clear (increasing context length and decreasing hallucinations).

pico_creator · on May 23, 2023

Haha, yea - naming things is hard

Every-time someone comes up and say "this is not attention, because it does X and not Y"

My response is, ok, what would you call it then? Because no one (including me) seem to to able to find a better term, that fits its use case.