Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Amazing writeup - I needed this a few months ago :)

My impression after my own, shallower dive is that trainable dictionaries are an underappreciated part of the (or at least my) system design toolkit.

For example, say you're serving Wikipedia - a bunch of pages that are kind of static. In order to minimize disk space, you'll be tempted to compress the content. Compressing the whole corpus gets a good compression ratio, but it means that, to read an arbitrary item, you need to decompress everything (or 50% of everything, on average, I guess).

So to get random access, you compress each page. That's ok, but you get a worse compression ratio because every compressor starts from scratch.

But with Zstandard and a trainable dictionary, you could train a dictionary on a couple pages, then use that dictionary to compress and decompress arbitrary items.

As far as I can tell, that's probably the best of both worlds - close to the compression ratio of compressing the whole corpus with gzip, but the random access of compressing each item individually.

This seems really generalizable - e.g. maybe Facebook has to store a zillion photos that are very rarely accessed, but 10% of them are selfies. If we use a vector search to find clusters of similar items, we can compress those items with a single dictionary.

In fact, taking another step back, it seems like databases ought to offer this out of the box. Just like the concept of an index, it's not always a win and there are a lot of knobs that you might want to tune, but the benefits seem clear.

Maybe all of this already exists, or there's something I'm missing, but I really appreciate article's like OP's that break things down so clearly.



What you are describing is sometimes called a shared dictionary and it's a great trick for task-specific compression, where you know what data you're going to be compressing ahead of time.

The Brotli algorithm is typical LZ plus a shared dictionary aimed at common web documents and markup. It does work well and fast for HTML. A common criticism is that it's basically targeted at compressing Wikipedia and the dictionary is loaded with a bunch of junk and now every browser needs a copy of that 120 kB of junk some of which will very rarely be used unless you're compressing Wikipedia. (Both "II, Holy Roman" and "Holy Roman Emperor" are tokens in the Brotli dictionary, for example. Whole dictionary here for the curious: https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37... )


In fact there is a new feature Chrome is championing (and just shipped) called "Compression dictionary transport" - https://datatracker.ietf.org/doc/draft-ietf-httpbis-compress... / https://chromestatus.com/feature/5124977788977152 that allows any HTTP resource to specify the dictionary it wants to use (including the "use me as the dictionary for future reuqests") which allows a website to use a dictionary that specialized to _its_ contents instead of the contents of something completely different.


Another state machine in the browser what can go wrong


Heh and if dictionaries can be shared between sites, another potential security leak.


If anyone is interested in an example of how ZSTD's dictionary compression performs against standard gzip, a number of years ago I put together an example using some Common Crawl data.

"I was able to achive a random WARC file compression size of 793,764,785 bytes vs Gzip's compressed size of 959,016,011" [0]

In hindsight, I could have written that up and tested it better, but it's at least something.

[0] https://github.com/benwills/proposal-warc-to-zstandard?tab=r...


RocksDB has support for Ztd and preset dictionaries and it makes sense since it has the same kind of level-spans chunking being a fork of LevelDB.

Entries are stored in-memory/logged (instead of put into a b-tree like classic DB's) and then periodically placed in span-files that are "linear" for faster search, however as these span files are built in bulk it makes more sense to compress blocks of them since much data is handled at once (so even if it's linear it's still blocks and reading just produces more variable size blocks by decompression upon read).


I remember tools that worked with the Wikipedia dumps, in bzip2, and built indexes to allow decent random access. Once you know where the compressed blocks are, and which Wikipedia entries they contain, you could start from a given block, something like 900k, rather than start at the beginning of the file. Compressing roughly a megabyte at a time, rather than a page, is a pretty solid win for compressibility.


Good thoughts. I'm going to keep this in mind. I've been working on a custom udp netcode for a while. I experimented with LZMAing / RLEing my binary snapshot diffs I send down, and neither felt great, but RLEing beat LZMA for what I was doing so far 100% of the time. Some kind of trained dictionary does sound better.


In general, it's often worth doing transforms like RLE combined with general purpose compression. General compression algorithms don't know about the details of your data and typically have a max window size period, so if RLE compresses your data a lot, it makes LZMA (or most other compressors) will be seeing a giant block of zeros most of the time and won't be able to see nearly as much of the actual data. Running compression after RLE will mean that te giant chunks of zeros will be squashed down so the regular compressor can fit non-trivially compressable data within the window size and more usefully look for improvements.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: