We migrated from AWS to GCP with minimal downtime

sgarland · on June 8, 2024

> To make things harder, zfs send is an all or nothing operation: if interrupted for any reason, e.g. network errors, one would have to start over from scratch.

ZFS absolutely handles resuming transfers [0].

Honestly, articles like this make me doubt companies’ ability to handle what they’re doing. If you’re going to run a DB on ZFS, you’d damn well better know both inside and out. mbuffer is well-known to anyone who has used ZFS for a simple NAS. Also, you can’t use df to accurately measure a ZFS filesystem. df has no idea about child file systems, quotas, compression, file metadata…

It’s also unclear to me why they didn’t just ship the filesystems through nc. Assuming they’re encrypted (which, I mean, I would hope so…) it wouldn’t be any more risky than unencrypted via SSH.

[0]: https://openzfs.github.io/openzfs-docs/man/master/8/zfs-send...

shrubble · on June 7, 2024

I wonder what would have happened if they created a ZFS snapshot, transferred as "tar over ssh" to the remote host, then created hourly snapshots thereafter and synced those across? It seems they were not aware of this method.

deepsun · on June 9, 2024

I believe they wrote that 30MB/s was too slow for them.

kccqzy · on June 7, 2024

A really cool story. But I have to say, everyone will notice that rewriting a file transfer tool in Rust is a poor use of the engineering time without having first understood the cause of the slowness. It's almost like a blind cult-like trust in Rust.

> So just like with everything else, we decided to write our own, in Rust. After days of digging through Tokio documentation and networking theory blog posts to understand how to move bytes as fast as possible between the filesystem and an HTTP endpoint, we had a pretty basic application that could chunk a byte stream, send it to an object storage service as separate files, download those files as they are being created in real time, re-assemble and pipe them into a ZFS snapshot.

I mean this sounds like a fun engineering project and I suspect I would enjoy writing it very much. While this might be bring me joy personally, as an organization this is still a failure.

montanalow · on June 7, 2024

Most managers and organizations miss the fact that engineers are often motivated by solving puzzles in ways like this, because it's fun. If you want to accomplish big challenges, quickly, make it fun for the engineer. Working long hours on projects like this doesn't burn engineers out, it sharpens their skills, broadens their knowledge and grows the organizations capabilities long term.

This sort of cultural difference of exploration and letting work be fun is one of the big things that accounts for the differences in velocity between big co's and little co's. Does your work give you energy to the point where not only you love doing it, you want to tell everyone else about it?

deepsun · on June 9, 2024

This. And also "shiny new thing" vs. "don't touch if it works" -- oftentimes it's fun to rewrite some part of a project even if it works ok. As a manager, I just need to limit the blast scope, and make sure the refactoring reaches its end, not leave it half-baked.

broner · on June 7, 2024

Switch your blind faith to Go (s5cmd) and get ~1200 MB/s upload speed!

zfs send tank/pgdata@snapshot | pbzip2 > mybackup.zfs.bz2

s5cmd cp mybackup.zfs.bz2 s3://mygooglebucket/

https://github.com/peak/s5cmd/blob/master/README.md#Benchmar...

attentive · on June 7, 2024

If s3 is the goal, there is gof3r which can be piped into i.e. skip storing mybackup.zfs.bz2 locally.

Overall, I didn't see if they've identified the bottleneck. My guess pbzip2 is the slowest, ssh second. For compression bw I'd check zstd. For ssh there are various cipher/compression options. Or perhaps skip altogether and use wireguard.

tarasglek · on June 8, 2024

gof3r looks amazing, thank you!

bfeynman · on June 7, 2024

classic fallacy seen on engineering blogs all the time "we cut down {speed, cost} by writing something in {go, rust} without realizing that rewriting is also refactoring from the most explicit set of requirements. None of the tech debt or technical decisions are factored in.

GreenWatermelon · on June 7, 2024

The previous paragraph says:

> As of this writing, we could not find any existing tools to send a ZFS file system to S3 and download it from Cloud Storage, in real time. Most tools like z3 are used for backup purposes, but we needed to transfer filesystem chunks as quickly as possible.

So, they didn't rewrite anything.

klabb3 · on June 7, 2024

Blog says 100Gbit NICs and 200MB/s achieved. Big gap!

Since both endpoints are controlled by you, you should be able to tune the tcp buffers. In either case RTT and iperf3 with dozens concurrent tcp conns would be the first step to determine a baseline for what could be expected.

Author if you are here that data would be very interesting to know.

levkk · on June 7, 2024

That remains a mystery for now. We measured upload speeds to S3 to be upwards of 700MB/second (still not 12,800MB/second you'd hope for, but S3 has to write to disk(s)). Download speeds from S3 were low - very low in fact, somewhere around 3-4MB/second when multiple files are downloaded concurrently, and a maximum of again ~30MB/second.

I think you're right and high latency (~40ms) between endpoints triggered some bad TCP behavior. Next time we do something like this, I'll definitely look up how to tune those settings.

klabb3 · on June 7, 2024

> I think you're right and high latency (~40ms) between endpoints triggered some bad TCP behavior.

40 ms, 200MB/s. The BDP limit for those numbers is 7.6 MB (if I’m holding it right) which is very close to tcp_wmem/tcp_rmem max on Debian like distros, so that sounds about right. Linux can be quite stingy with buffers by default. Easy to increase though!

attentive · on June 7, 2024

zfs|zstd|wireguard

kccqzy · on June 7, 2024

100Gbit NIC means nothing. My cheap desktop has a 2.5Gbit NIC; doesn't mean I can actually connect to the internet at 2.5Gbps.

Spooky23 · on June 7, 2024

Awesome story! Made me miss my DBA days long ago.

sowut · on June 7, 2024

gotta be a real masochist to enjoy being a dba

sgarland · on June 8, 2024

It’s mostly just seething at devs nonstop for creating garbage schema, and then blaming the DB for their shitty query latency. Also, endlessly getting dragged into incidents because “the database is at fault.”

zepolen · on June 7, 2024

gotta be a real masochist to not enjoy data integrity

Tell me you're a young one without telling me you're a young one.

Don't worry my friend, there is always mongodb, I hear it's web scale.

sowut · on June 11, 2024

three individual replies packed into one. clear signs of a dba

zepolen · on June 13, 2024

Ah of course microcomments would be so much better!

sowut · on June 24, 2024

as long as they're normalized

Spooky23 · on June 9, 2024

lol! The key is “long ago”. The pain has faded.

threecheese · on June 8, 2024

I love watching a systems problem being solved, we’re forced to learn so much during outages and problems like this. When I read that you were starting to write a rust tool to integrate zfs with s3 because you thought aws was limiting throughput, I nearly yelled out loud! Guess you learned the same lesson I did, once :)