> To make things harder, zfs send is an all or nothing operation: if interrupted for any reason, e.g. network errors, one would have to start over from scratch.
ZFS absolutely handles resuming transfers [0].
Honestly, articles like this make me doubt companies’ ability to handle what they’re doing. If you’re going to run a DB on ZFS, you’d damn well better know both inside and out. mbuffer is well-known to anyone who has used ZFS for a simple NAS. Also, you can’t use df to accurately measure a ZFS filesystem. df has no idea about child file systems, quotas, compression, file metadata…
It’s also unclear to me why they didn’t just ship the filesystems through nc. Assuming they’re encrypted (which, I mean, I would hope so…) it wouldn’t be any more risky than unencrypted via SSH.
I wonder what would have happened if they created a ZFS snapshot, transferred as "tar over ssh" to the remote host, then created hourly snapshots thereafter and synced those across? It seems they were not aware of this method.
A really cool story. But I have to say, everyone will notice that rewriting a file transfer tool in Rust is a poor use of the engineering time without having first understood the cause of the slowness. It's almost like a blind cult-like trust in Rust.
> So just like with everything else, we decided to write our own, in Rust. After days of digging through Tokio documentation and networking theory blog posts to understand how to move bytes as fast as possible between the filesystem and an HTTP endpoint, we had a pretty basic application that could chunk a byte stream, send it to an object storage service as separate files, download those files as they are being created in real time, re-assemble and pipe them into a ZFS snapshot.
I mean this sounds like a fun engineering project and I suspect I would enjoy writing it very much. While this might be bring me joy personally, as an organization this is still a failure.
Most managers and organizations miss the fact that engineers are often motivated by solving puzzles in ways like this, because it's fun. If you want to accomplish big challenges, quickly, make it fun for the engineer. Working long hours on projects like this doesn't burn engineers out, it sharpens their skills, broadens their knowledge and grows the organizations capabilities long term.
This sort of cultural difference of exploration and letting work be fun is one of the big things that accounts for the differences in velocity between big co's and little co's. Does your work give you energy to the point where not only you love doing it, you want to tell everyone else about it?
This. And also "shiny new thing" vs. "don't touch if it works" -- oftentimes it's fun to rewrite some part of a project even if it works ok. As a manager, I just need to limit the blast scope, and make sure the refactoring reaches its end, not leave it half-baked.
If s3 is the goal, there is gof3r which can be piped into i.e. skip storing mybackup.zfs.bz2 locally.
Overall, I didn't see if they've identified the bottleneck.
My guess pbzip2 is the slowest, ssh second. For compression bw I'd check zstd. For ssh there are various cipher/compression options. Or perhaps skip altogether and use wireguard.
classic fallacy seen on engineering blogs all the time "we cut down {speed, cost} by writing something in {go, rust} without realizing that rewriting is also refactoring from the most explicit set of requirements. None of the tech debt or technical decisions are factored in.
> As of this writing, we could not find any existing tools to send a ZFS file system to S3 and download it from Cloud Storage, in real time. Most tools like z3 are used for backup purposes, but we needed to transfer filesystem chunks as quickly as possible.
Blog says 100Gbit NICs and 200MB/s achieved. Big gap!
Since both endpoints are controlled by you, you should be able to tune the tcp buffers. In either case RTT and iperf3 with dozens concurrent tcp conns would be the first step to determine a baseline for what could be expected.
Author if you are here that data would be very interesting to know.
That remains a mystery for now. We measured upload speeds to S3 to be upwards of 700MB/second (still not 12,800MB/second you'd hope for, but S3 has to write to disk(s)). Download speeds from S3 were low - very low in fact, somewhere around 3-4MB/second when multiple files are downloaded concurrently, and a maximum of again ~30MB/second.
I think you're right and high latency (~40ms) between endpoints triggered some bad TCP behavior. Next time we do something like this, I'll definitely look up how to tune those settings.
> I think you're right and high latency (~40ms) between endpoints triggered some bad TCP behavior.
40 ms, 200MB/s. The BDP limit for those numbers is 7.6 MB (if I’m holding it right) which is very close to tcp_wmem/tcp_rmem max on Debian like distros, so that sounds about right. Linux can be quite stingy with buffers by default. Easy to increase though!
It’s mostly just seething at devs nonstop for creating garbage schema, and then blaming the DB for their shitty query latency. Also, endlessly getting dragged into incidents because “the database is at fault.”
I love watching a systems problem being solved, we’re forced to learn so much during outages and problems like this.
When I read that you were starting to write a rust tool to integrate zfs with s3 because you thought aws was limiting throughput, I nearly yelled out loud! Guess you learned the same lesson I did, once :)
ZFS absolutely handles resuming transfers [0].
Honestly, articles like this make me doubt companies’ ability to handle what they’re doing. If you’re going to run a DB on ZFS, you’d damn well better know both inside and out. mbuffer is well-known to anyone who has used ZFS for a simple NAS. Also, you can’t use df to accurately measure a ZFS filesystem. df has no idea about child file systems, quotas, compression, file metadata…
It’s also unclear to me why they didn’t just ship the filesystems through nc. Assuming they’re encrypted (which, I mean, I would hope so…) it wouldn’t be any more risky than unencrypted via SSH.
[0]: https://openzfs.github.io/openzfs-docs/man/master/8/zfs-send...