Chrome feature: Compression dictionary transport with Shared Brotli

jacobgorm · on Sept 22, 2023

I never really understood with Google keeps pushing Brotli. LZ4 and ZStd both predate it by a couple of years, and seem to offer superior performance overall. Is it a Google-NIH thing?

lifthrasiir · on Sept 22, 2023

They were designed more or less independently from each other, and your statement that Zstandard predates Brotli is also wrong (both around 2015). They are both designed to be much more efficient than DEFLATE in both time and size axes, but otherwise have little similarities. Zstandard is notable in its use of ANS, while Brotli is probably the first mainstream format that prominently uses a 2nd-order context model which was previously thought to be too slow. And Brotli is designed for web contents in mind, which made it excellent for small inputs with no prior. Zstandard has no such treatments but have a better performance when I/O can handle it, which makes it excellent for server jobs.

jacobgorm · on Sept 22, 2023

ZStd started out as ZHuff, which was first announced in 2008; see https://encode.su/threads/521-Zhuff-fast-compression and http://fastcompression.blogspot.com/p/zhuff.html

I've been using LZ4 in commercial software since 2011, so unless I am some kind of luminary Google's compression experts must have been aware of both LZ4 and ZHuff when they started working on Brotli.

lifthrasiir · on Sept 22, 2023

By that reckoning I can claim that the origin of Brotli also dates back to 2012 or earlier. :-) Indeed, Brotli authors were also reponsible for the VP8 Lossless codec (2012), Pik (2017) and eventually JPEG XL (2019, in part). The entropy coding scheme is substantially similar among those codecs for this reason.

But really, the large amount of differences between two formats rather suggest that they were designed for different uses in mind, which I believe relate to the maximum possible I/O. Formats designed for the web don't have to have an extremely fast decoder because I/O is the bottleneck: 100 MB/s should be enough, and the earned time can be invested to higher compression ratio which affects the decoder performance.

adgjlsfhk1 · on Sept 22, 2023

otoh, they achieve pretty similar compression ratios and faster decode saves battery life and will let you get on to running the JavaScript earlier

miohtama · on Sept 22, 2023

The performance depends on the data. I have found that for English text and source code Brotli overperforms Zstd. Not sure if this because of the static dictionary or some other tweaks, though. However none of this is static, as both technologies are actively developed.

lifthrasiir · on Sept 22, 2023

Better decoding time performance, I should have said. The use of 2nd-order context model indeed makes Brotli more efficient in compression ratio for many kinds of data.

pmarreck · on Sept 22, 2023

Yes, but hasn't "separate compression dictionary" been a core feature of Zstd since its inception, at least under that name?

lifthrasiir · on Sept 22, 2023

Not even close, zlib had one. (Search for "FDICT" from RFC 1950.) What Zstandard did was a CLI to automatically generate a good enough dictionary from sample inputs.

pmarreck · on Sept 22, 2023

      FDICT (Preset dictionary)
         If FDICT is set, a DICT dictionary identifier is present
         immediately after the FLG byte. The dictionary is a sequence of
         bytes which are initially fed to the compressor without
         producing any compressed output. DICT is the Adler-32 checksum
         of this sequence of bytes (see the definition of ADLER32
         below).  The decompressor can use this identifier to determine
         which dictionary has been used by the compressor.

Well, wow. I have to wonder why this wasn't more utilized, then. There are a ton of contexts (columnar data in databases, for example) where shared-dictionary-based compression might have helped a ton before now.

__alexs · on Sept 22, 2023

Zstd no longer comes with a patent grant which I guess makes it kind of hard to actually use for some people.

jacobgorm · on Sept 22, 2023

Zstd is already used in the Linux kernel, and I would assume that any patents covering Lempel-Ziv and Huffman coding would have expired long ago. The situation around ANS also seems to have been resolved, see https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#Tab... .

__alexs · on Sept 22, 2023

The Linux kernel is stuffed full of potentially patent encumbered code.

secondcoming · on Sept 22, 2023

huh, isn’t zstd the compression used for debian packages?

oynqr · on Sept 22, 2023

Brotli usually compresses plain text better than ZSTD.

out_of_protocol · on Sept 22, 2023

Both with dictionaries? Both without? Any links to benchmarks?

oynqr · on Sept 23, 2023

Brotli comes with a dictionary by default, which is optimized for plaintext use cases. Once you give ZSTD a similar dictionary, its ratios are going to be similar/better.

I won't provide random links, because those can't replace testing on your own dataset.

jauntywundrkind · on Sept 22, 2023

Brotli has two major advantages. First it encodes to the same well known gzip format, so rolling it out didn't require new clients. Second, it didn't require dictionary support.

Where-as if you look at Mozilla's standards-positions for zstd, they defer because it requires a compression dictionary to encode & Moz didn't want to have to "standardize" one specific dictionary. https://github.com/mozilla/standards-positions/issues/105

pmarreck · on Sept 22, 2023

> Moz didn't want to have to "standardize" one specific dictionary

You could have multiple sources of dictionary identified by something like a hash, and cached in the browser. For example you could have a general-purpose HTML dictionary, a general-purpose CSS dictionary and a general-purpose JS dictionary, have canonical/official versions of these hosted somewhere (say, on Github) and distributed wherever, and update them on a semi regular basis (at the cost of re-compression of the associated data); and in turn these could be overrided on perhaps a site by site basis with site-specific versions of these. You could basically associate an overridable but defaulted dictionary specification by MIME type (or none at all in the case of already-compressed data).

Every file would have to be sent with the hash of the dictionary that was used to start compressing it, which would then be retrieved and cached somehow.

I could see this resulting in some fairly significant byte savings, especially for retrieval of many smaller files.

I bet this would also be useful potentially for email clients/servers.

jauntywundrkind · on Sept 22, 2023

> You could have multiple sources of dictionary identified by something like a hash, and cached in the browser.

The browser actually has a very advanced implementation of something like this! They're called urls. The work here is to make it possible to identify dictionaries & retrieve & cache them in the browsers.

Mozilla I think smartly avoided creating a spec that requires some variety of centralized governance of what dictionaries to use, like you describe. Whatever the intent of such a community, it creates biases & will have a hard time figuring out what sets of dictionaries are worth creating, for what intents or code bases, and adapting or not as it discovers underserved code bases.

masklinn · on Sept 22, 2023

> Brotli has two major advantages. First it encodes to the same well known gzip format

It does not. You’re thinking about zopfli. Brotli is an alternative to but not compatible with gzip.

pmarreck · on Sept 22, 2023

The Zstd guy is employed by Facebook, so I'd suspect it's at least in part a NIH thing.

joelthelion · on Sept 22, 2023

Patents, maybe?

Scaevolus · on Sept 22, 2023

They already had a very similar feature. _15 years ago_ with SDCH that got removed in 2018-- https://chromestatus.com/feature/5763176272494592 https://daniel.haxx.se/blog/2008/09/09/shared-dictionary-com...

Presumably they have some reason to think that this will be useful now when it wasn't before?

qingcharles · on Sept 22, 2023

Talked about here:

https://github.com/WICG/compression-dictionary-transport

tyingq · on Sept 22, 2023

Makes me wonder if rsync-style differential download would be more generally useful than "shared dictionary for shared brotli". The implementation is letting you reference a previously downloaded artifact using a hash value, which is the same starting point.

lifthrasiir · on Sept 22, 2023

I think it solves a different problem. One of Brotli's strengths was a built-in dictionary tuned for the web contents, which makes it especially suitable for short data with no known priors. The compressed dictionary transport is same but tuned for each website. In comparison, rsync algorithm would make sense only when the data is long enough (say, more than 1 MB), in which case a custom dictionary wouldn't help much because the compressor already has a lot of patterns to working with. Ideally both can be implemented, where the HTTP Range request can be updated to support rsync-like algorithms.

philsnow · on Sept 22, 2023

Are you referring to courgette (which the Chrome team uses to make shipping patches/updates take far fewer bytes than previously)? https://www.chromium.org/developers/design-documents/softwar...

tyingq · on Sept 22, 2023

Yes, but exposed to everyone. This proposal settles on headers that get passed back and forth that surely some servers would adopt. Headers that established "previously downloaded thing" with a hash. It feels like the process could be exactly the same at the start, but end instead with some data to the client that it could use to do the http range requests. This proposal feels like it could be extended in a fairly easy way to support differential download.

philsnow · on Sept 22, 2023

Interesting! Privacy concerns aside for the moment, I would guess that there is some "optimal" compression dictionary over some transferred data set, and that Google/Chrome of all people would be in a position to calculate it and distribute it along with Chrome.

Then Google could ship that dictionary to all (Chrome) browsers and update it once a month, and ship the same dictionary to anybody who wants to use it on the server side. Browsers can indicate which shared dictionaries they have preloaded and if there's overlap with which dictionaries the server has, the server can just choose one and use that to compress the stream (which seems kind of like ciphersuite negotiation). Compression should be faster and use way less memory if the sender can assume that the receiver already has the same dictionary, and if there's any problem, both sides can just fall back to on-the-fly compression with inlined dictionary like we've always done.

There are almost certainly different optimal dictionaries for different locales / countries, each of which could be calculated and distributed in the same way. Even if the 'wrong' dictionary gets used for a given request, it's (probably) not catastrophic.

I guess it might be possible for an attacker to indicate their client only has one dictionary and request pages that they know are disadvantageous for the server to have to compress with that dictionary. Even then, server-side heuristics can account for a lower-than-expected compression ratio and, again, fall back to on-the-fly dictionary calculation.

pornel · on Sept 22, 2023

It would be a dud. rsync-style download needs to prepare and upload significant amount of data to describe what the client already has. Hashes don't compress, and you want granular ones to reuse as much as possible.

Web is more sensitive to latency than bandwidth, and upload is more restricted than download. So rsync hits all the worst spots, for the least gain. The data that diffs nicely, like HTML, is already tiny and compresses very well. The large files, like videos, don't diff well at all.

There is a way to make rsync-style differential download really quick to request: the client and server need to agree on diffing only from a specific shared dataset… like a dictionary! And you can replace hashes with smaller variable-length integers.

tyingq · on Sept 22, 2023

Not rsync style in the sense of whole directory trees. Differential range-request download of the subsequent version of some specific file. Because this scheme means client and server already know a prior version via the hash.

tyingq · on Sept 22, 2023

Interesting two comment snippet from HN in 2018 about this: https://news.ycombinator.com/item?id=18720554

rasz · on Sept 22, 2023

>The client will store a hash of the uncompressed response

>client will add ...request header as well as a sec-available-dictionary: <SHA-256>

loving it, another thing to track clients with

pimterry · on Sept 22, 2023

From the spec (https://github.com/WICG/compression-dictionary-transport#fin...):

> The existence of a dictionary is effectively a cookie for any requests that match it and should be treated as such:

> * Storage partitioning for dictionary resource metadata should be at least as restrictive as for cookies.

> * Dictionary entries (or at least the metadata) should be cleared any time cookies are cleared.

Given that, I think it offers no more client tracking possibilities than any other existing web tech (cookies, local storage, etc)..

TonyTrapp · on Sept 22, 2023

Maybe no more possibilities than regular cookies, but more stealthy for sure. Cookies are (somewhat) easily reviewable by the user through the web developer UI of most browsers; I doubt there will be any browser UI to managing compression dictionary.

out_of_protocol · on Sept 22, 2023

should be w3c dicts only and number of dicts and versions should be heavily limited. well, something like js,css,html/xml,json,common-web and new version every 5 years i think good enough for everyone