A text token generally represents a portion of a single word, while a vision tok...

A text token generally represents a portion of a single word, while a vision token represents a portion of the entire page, which may include multiple words. This is where the "compression factor" comes from.

The number of bits to represent a text or vision token is the same, since they are both represented as embeddings of a fixed number of dimensions defined by the Transformer (maybe a few thousand for a large SOTA model).

Whether a vision token actually contains enough information to accurately extract (OCR) all the text data from that portion of the image is going to depend on how many pixels that vision token represents and how many words were present in that area of the image. It's just like considering images of the same page of text at different resolutions - a 1024x1024 image vs a 64x64 one, etc. As the resolution decreases so will OCR accuracy. At some point the resolution is insufficient and the words become a blurry mess and OCR accuracy suffers.

This is what DeepSeek are reporting - OCR accuracy if you try to use a single vision token to represent, say, 10 text tokens, vs 20 text tokens. The vision token may have enough resolution to represent 10 tokens well, but not enough for 20.