> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.
Memory grows linearly, compute grows quadratically (but with small constant - until ~100k the inference will be still dominated by non-quadratic factors).
Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.
Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.
You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.
Funnyish story: the other night I asked my Pixel 9 to generate an image via Gemini, then I asked it to make a change. It didn't consider the previous context, so I asked it "Are you capable of keeping context?" No matter how clearly I enunciated "context", it always interpreted what I was saying as "contacts." After the 4th try, I said "context, spelled "c-o-n-t-e-x-t" and it replied with "Ah, you meant context! Yes..."
I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model.
I noticed Gemini Flash 2.0 making a lot of phonetic typos like that, yeah. Like instead of Basal Ganglia it said Basil Ganglia.
I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.
I was using the conversational feature of Gemini on my phone the other night and was trying to get it to read a blog post to me. The AI proceeded to tell me (out loud, via voice mode/speech synthesis) that it was a text based model and couldn't read text out loud.
For as amazing as these things are, AGI they are not.
I thought memory requirement grows exponentially with context size?