I never really understood with Google keeps pushing Brotli. LZ4 and ZStd both predate it by a couple of years, and seem to offer superior performance overall. Is it a Google-NIH thing?
They were designed more or less independently from each other, and your statement that Zstandard predates Brotli is also wrong (both around 2015). They are both designed to be much more efficient than DEFLATE in both time and size axes, but otherwise have little similarities. Zstandard is notable in its use of ANS, while Brotli is probably the first mainstream format that prominently uses a 2nd-order context model which was previously thought to be too slow. And Brotli is designed for web contents in mind, which made it excellent for small inputs with no prior. Zstandard has no such treatments but have a better performance when I/O can handle it, which makes it excellent for server jobs.
I've been using LZ4 in commercial software since 2011, so unless I am some kind of luminary Google's compression experts must have been aware of both LZ4 and ZHuff when they started working on Brotli.
By that reckoning I can claim that the origin of Brotli also dates back to 2012 or earlier. :-) Indeed, Brotli authors were also reponsible for the VP8 Lossless codec (2012), Pik (2017) and eventually JPEG XL (2019, in part). The entropy coding scheme is substantially similar among those codecs for this reason.
But really, the large amount of differences between two formats rather suggest that they were designed for different uses in mind, which I believe relate to the maximum possible I/O. Formats designed for the web don't have to have an extremely fast decoder because I/O is the bottleneck: 100 MB/s should be enough, and the earned time can be invested to higher compression ratio which affects the decoder performance.
The performance depends on the data. I have found that for English text and source code Brotli overperforms Zstd. Not sure if this because of the static dictionary or some other tweaks, though. However none of this is static, as both technologies are actively developed.
Better decoding time performance, I should have said. The use of 2nd-order context model indeed makes Brotli more efficient in compression ratio for many kinds of data.
Not even close, zlib had one. (Search for "FDICT" from RFC 1950.) What Zstandard did was a CLI to automatically generate a good enough dictionary from sample inputs.
FDICT (Preset dictionary)
If FDICT is set, a DICT dictionary identifier is present
immediately after the FLG byte. The dictionary is a sequence of
bytes which are initially fed to the compressor without
producing any compressed output. DICT is the Adler-32 checksum
of this sequence of bytes (see the definition of ADLER32
below). The decompressor can use this identifier to determine
which dictionary has been used by the compressor.
Well, wow. I have to wonder why this wasn't more utilized, then. There are a ton of contexts (columnar data in databases, for example) where shared-dictionary-based compression might have helped a ton before now.
Zstd is already used in the Linux kernel, and I would assume that any patents covering Lempel-Ziv and Huffman coding would have expired long ago. The situation around ANS also seems to have been resolved, see https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#Tab... .
Brotli comes with a dictionary by default, which is optimized for plaintext use cases. Once you give ZSTD a similar dictionary, its ratios are going to be similar/better.
I won't provide random links, because those can't replace testing on your own dataset.
Brotli has two major advantages. First it encodes to the same well known gzip format, so rolling it out didn't require new clients. Second, it didn't require dictionary support.
Where-as if you look at Mozilla's standards-positions for zstd, they defer because it requires a compression dictionary to encode & Moz didn't want to have to "standardize" one specific dictionary. https://github.com/mozilla/standards-positions/issues/105
> Moz didn't want to have to "standardize" one specific dictionary
You could have multiple sources of dictionary identified by something like a hash, and cached in the browser. For example you could have a general-purpose HTML dictionary, a general-purpose CSS dictionary and a general-purpose JS dictionary, have canonical/official versions of these hosted somewhere (say, on Github) and distributed wherever, and update them on a semi regular basis (at the cost of re-compression of the associated data); and in turn these could be overrided on perhaps a site by site basis with site-specific versions of these. You could basically associate an overridable but defaulted dictionary specification by MIME type (or none at all in the case of already-compressed data).
Every file would have to be sent with the hash of the dictionary that was used to start compressing it, which would then be retrieved and cached somehow.
I could see this resulting in some fairly significant byte savings, especially for retrieval of many smaller files.
I bet this would also be useful potentially for email clients/servers.
> You could have multiple sources of dictionary identified by something like a hash, and cached in the browser.
The browser actually has a very advanced implementation of something like this! They're called urls. The work here is to make it possible to identify dictionaries & retrieve & cache them in the browsers.
Mozilla I think smartly avoided creating a spec that requires some variety of centralized governance of what dictionaries to use, like you describe. Whatever the intent of such a community, it creates biases & will have a hard time figuring out what sets of dictionaries are worth creating, for what intents or code bases, and adapting or not as it discovers underserved code bases.
Makes me wonder if rsync-style differential download would be more generally useful than "shared dictionary for shared brotli". The implementation is letting you reference a previously downloaded artifact using a hash value, which is the same starting point.
I think it solves a different problem. One of Brotli's strengths was a built-in dictionary tuned for the web contents, which makes it especially suitable for short data with no known priors. The compressed dictionary transport is same but tuned for each website. In comparison, rsync algorithm would make sense only when the data is long enough (say, more than 1 MB), in which case a custom dictionary wouldn't help much because the compressor already has a lot of patterns to working with. Ideally both can be implemented, where the HTTP Range request can be updated to support rsync-like algorithms.
Yes, but exposed to everyone. This proposal settles on headers that get passed back and forth that surely some servers would adopt. Headers that established "previously downloaded thing" with a hash. It feels like the process could be exactly the same at the start, but end instead with some data to the client that it could use to do the http range requests. This proposal feels like it could be extended in a fairly easy way to support differential download.
Interesting! Privacy concerns aside for the moment, I would guess that there is some "optimal" compression dictionary over some transferred data set, and that Google/Chrome of all people would be in a position to calculate it and distribute it along with Chrome.
Then Google could ship that dictionary to all (Chrome) browsers and update it once a month, and ship the same dictionary to anybody who wants to use it on the server side. Browsers can indicate which shared dictionaries they have preloaded and if there's overlap with which dictionaries the server has, the server can just choose one and use that to compress the stream (which seems kind of like ciphersuite negotiation). Compression should be faster and use way less memory if the sender can assume that the receiver already has the same dictionary, and if there's any problem, both sides can just fall back to on-the-fly compression with inlined dictionary like we've always done.
There are almost certainly different optimal dictionaries for different locales / countries, each of which could be calculated and distributed in the same way. Even if the 'wrong' dictionary gets used for a given request, it's (probably) not catastrophic.
I guess it might be possible for an attacker to indicate their client only has one dictionary and request pages that they know are disadvantageous for the server to have to compress with that dictionary. Even then, server-side heuristics can account for a lower-than-expected compression ratio and, again, fall back to on-the-fly dictionary calculation.
It would be a dud. rsync-style download needs to prepare and upload significant amount of data to describe what the client already has. Hashes don't compress, and you want granular ones to reuse as much as possible.
Web is more sensitive to latency than bandwidth, and upload is more restricted than download. So rsync hits all the worst spots, for the least gain. The data that diffs nicely, like HTML, is already tiny and compresses very well. The large files, like videos, don't diff well at all.
There is a way to make rsync-style differential download really quick to request: the client and server need to agree on diffing only from a specific shared dataset… like a dictionary! And you can replace hashes with smaller variable-length integers.
Not rsync style in the sense of whole directory trees. Differential range-request download of the subsequent version of some specific file. Because this scheme means client and server already know a prior version via the hash.
Maybe no more possibilities than regular cookies, but more stealthy for sure. Cookies are (somewhat) easily reviewable by the user through the web developer UI of most browsers; I doubt there will be any browser UI to managing compression dictionary.
should be w3c dicts only and number of dicts and versions should be heavily limited. well, something like js,css,html/xml,json,common-web and new version every 5 years i think good enough for everyone