Amazing writeup - I needed this a few months ago :) My impression after my own, ...

retrac · 2025-01-23T15:56:16 1737647776

What you are describing is sometimes called a shared dictionary and it's a great trick for task-specific compression, where you know what data you're going to be compressing ahead of time.

The Brotli algorithm is typical LZ plus a shared dictionary aimed at common web documents and markup. It does work well and fast for HTML. A common criticism is that it's basically targeted at compressing Wikipedia and the dictionary is loaded with a bunch of junk and now every browser needs a copy of that 120 kB of junk some of which will very rarely be used unless you're compressing Wikipedia. (Both "II, Holy Roman" and "Holy Roman Emperor" are tokens in the Brotli dictionary, for example. Whole dictionary here for the curious: https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37... )

svieira · 2025-01-23T16:30:33 1737649833

In fact there is a new feature Chrome is championing (and just shipped) called "Compression dictionary transport" - https://datatracker.ietf.org/doc/draft-ietf-httpbis-compress... / https://chromestatus.com/feature/5124977788977152 that allows any HTTP resource to specify the dictionary it wants to use (including the "use me as the dictionary for future reuqests") which allows a website to use a dictionary that specialized to _its_ contents instead of the contents of something completely different.

genewitch · 2025-01-23T19:36:29 1737660989

Another state machine in the browser what can go wrong

pixl97 · 2025-01-23T22:44:20 1737672260

Heh and if dictionaries can be shared between sites, another potential security leak.

benwills · 2025-01-23T20:08:46 1737662926

If anyone is interested in an example of how ZSTD's dictionary compression performs against standard gzip, a number of years ago I put together an example using some Common Crawl data.

"I was able to achive a random WARC file compression size of 793,764,785 bytes vs Gzip's compressed size of 959,016,011" [0]

In hindsight, I could have written that up and tested it better, but it's at least something.

[0] https://github.com/benwills/proposal-warc-to-zstandard?tab=r...

whizzter · 2025-01-23T15:29:09 1737646149

RocksDB has support for Ztd and preset dictionaries and it makes sense since it has the same kind of level-spans chunking being a fork of LevelDB.

Entries are stored in-memory/logged (instead of put into a b-tree like classic DB's) and then periodically placed in span-files that are "linear" for faster search, however as these span files are built in bulk it makes more sense to compress blocks of them since much data is handled at once (so even if it's linear it's still blocks and reading just produces more variable size blocks by decompression upon read).

pronoiac · 2025-01-23T20:55:39 1737665739

I remember tools that worked with the Wikipedia dumps, in bzip2, and built indexes to allow decent random access. Once you know where the compressed blocks are, and which Wikipedia entries they contain, you could start from a given block, something like 900k, rather than start at the beginning of the file. Compressing roughly a megabyte at a time, rather than a page, is a pretty solid win for compressibility.

MrLeap · 2025-01-23T15:24:14 1737645854

Good thoughts. I'm going to keep this in mind. I've been working on a custom udp netcode for a while. I experimented with LZMAing / RLEing my binary snapshot diffs I send down, and neither felt great, but RLEing beat LZMA for what I was doing so far 100% of the time. Some kind of trained dictionary does sound better.

adgjlsfhk1 · 2025-01-23T15:42:02 1737646922

In general, it's often worth doing transforms like RLE combined with general purpose compression. General compression algorithms don't know about the details of your data and typically have a max window size period, so if RLE compresses your data a lot, it makes LZMA (or most other compressors) will be seeing a giant block of zeros most of the time and won't be able to see nearly as much of the actual data. Running compression after RLE will mean that te giant chunks of zeros will be squashed down so the regular compressor can fit non-trivially compressable data within the window size and more usefully look for improvements.