How does Gemini have such a big context window? I thought memory requirement gro...

semiinfinitely · 2025-03-25T20:11:47 1742933507

TPUs have a network topology better suited for long context than gpus: https://jax-ml.github.io/scaling-book/tpus/#tpu-networking

> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.

mmoskal · 2025-03-25T17:48:58 1742924938

Memory grows linearly, compute grows quadratically (but with small constant - until ~100k the inference will be still dominated by non-quadratic factors).

xiphias2 · 2025-03-25T18:11:12 1742926272

Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.

mmoskal · 2025-03-25T21:37:33 1742938653

Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.

You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.

throitallaway · 2025-03-25T18:37:53 1742927873

Funnyish story: the other night I asked my Pixel 9 to generate an image via Gemini, then I asked it to make a change. It didn't consider the previous context, so I asked it "Are you capable of keeping context?" No matter how clearly I enunciated "context", it always interpreted what I was saying as "contacts." After the 4th try, I said "context, spelled "c-o-n-t-e-x-t" and it replied with "Ah, you meant context! Yes..."

This stuff has a long way to go.

Workaccount2 · 2025-03-25T19:10:19 1742929819

I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model.

Andrex · 2025-03-26T02:10:36 1742955036

That might be considered fine if Google's larger goal is to make money from enterprises/Workspace integration, using consumer launches as splashy PR.

This way they get two rounds of headlines. "Gemini 2.5 released" and later on "Gemini 2.5 coming to all Google accounts."

seunosewa · 2025-03-26T11:38:43 1742989123

Their willingness to integrate depends on their perception of the model quality.

andai · 2025-03-25T18:48:33 1742928513

I noticed Gemini Flash 2.0 making a lot of phonetic typos like that, yeah. Like instead of Basal Ganglia it said Basil Ganglia.

I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.

dcchambers · 2025-03-25T19:00:58 1742929258

I was using the conversational feature of Gemini on my phone the other night and was trying to get it to read a blog post to me. The AI proceeded to tell me (out loud, via voice mode/speech synthesis) that it was a text based model and couldn't read text out loud.

For as amazing as these things are, AGI they are not.

vanviegen · 2025-03-25T21:07:49 1742936869

In its defense: it probably is just a text model that hasn't been told that its output is being read to the user.

ototot · 2025-03-25T17:51:46 1742925106

The Gemini 1.5 tech report do reference some papers about supporting large context window.

swyx · 2025-03-25T22:46:07 1742942767

https://supaiku.com/attention-is-logarithmic