In the author's scenario, there are zero benefits in using NVMe/TCP, as he just ...

vient · on March 12, 2024

> on Linux the maximum pipe buffer size is 64kB

Note that you can increase pipe buffer, I think default maximum size is usually around 1MB. A bit tricky to do from command line, one possible implementation being https://unix.stackexchange.com/a/328364

BuildTheRobots · on March 12, 2024

It's a little grimy, but if you use `pv` instead of `dd` on both ends you don't have to worry about specifying a sensible block size and it'll give you nice progression graphs too.

M_bara · on March 12, 2024

About 9 years ago, I consulted for a company that had a bad internal hack - disgruntled cofounder. Basically, a dead man’s switch was left that copied out the first 20mb of every disk to some bucket and then zeroed out. To recover the data we had to use test disk to rebuilt the partition table… but before doing that we didn’t want to touch the corrupted disks so we ended up copying out about 40tb using rescue flash disks, nectat and drive (some of the servers had a physical raid with all slots occupied so you couldn’t use some free had slots). Something along the lines of dd if=/dev/sdc bs=xxx | gzip | nc -l -p 8888 and the reverse on the other side. It actually worked surprisingly well. One thing of note,try combinations of dd bs to match with sector size - proper sizing had a large impact on dd throughput

tripflag · on March 12, 2024

This use of dd may cause corruption! You need iflag=fullblock to ensure it doesn't truncate any blocks, and (at the risk of cargo-culting) conv=sync doesn't hurt as well. I prefer to just nc -l -p 1234 > /dev/nvme0nX.

mrb · on March 12, 2024

Partial reads won't corrupt the data. Dd will issue other read() until 1MB of data is buffered. The iflag=fullblock is only useful when counting or skipping bytes or doing direct I/O. See line 1647: https://github.com/coreutils/coreutils/blob/master/src/dd.c#...

adrian_b · on March 12, 2024

According to the documentation of dd, "iflag=fullblock" is required only when dd is used with the "count=" option.

Otherwise, i.e. when dd has to read the entire input file because there is no "count=" option, "iflag=fullblock" does not have any documented effect.

From "info dd":

"If short reads occur, as could be the case when reading from a pipe for example, ‘iflag=fullblock’ ensures that ‘count=’ counts complete input blocks rather than input read operations."

tripflag · on March 12, 2024

Thank you for the correction -- it is likely that I did use count= when I ran into this some 10 years ago (and been paranoid about ever since). I thought a chunk of data was missing in the middle of the output file, causing everything after that to be shifted over, but I'm probably misremembering.

fsckboy · on March 14, 2024

thank you for bringing it up! i wasn't even aware of this potential problem, and I use bs= count= and skip= seek= (sk"i"p means "input") through pipes across the net aaaallll the time for decades.

it pretty much seems iflag=fullblock is a requirement if you want the counts to work, even though the failure times might be rare

dezgeg · on March 12, 2024

Isn't `nc -l -p 1234 > /dev/nvme0nX` working by accident (relying on that netcat is buffering its output in multiples of disk block size)?

jasomill · on March 12, 2024

No — the kernel buffers non-O_DIRECT writes to block devices to ensure correctness.

Larger writes will be more efficient, however, if only due to reduced system call overhead.

While not necessary when writing an image with the correct block size for the target device, even partial block overwrites work fine:

  # yes | head -c 512 > foo
  # losetup /dev/loop0 foo
  # echo 'Ham and jam and Spam a lot.' | dd bs=5 of=/dev/loop0
  5+1 records in
  5+1 records out
  28 bytes copied, 0.000481667 s, 58.1 kB/s
  # hexdump -C /dev/loop0
  00000000  48 61 6d 20 61 6e 64 20  6a 61 6d 20 61 6e 64 20  |Ham and jam and |
  00000010  53 70 61 6d 20 61 20 6c  6f 74 2e 0a 79 0a 79 0a  |Spam a lot..y.y.|
  00000020  79 0a 79 0a 79 0a 79 0a  79 0a 79 0a 79 0a 79 0a  |y.y.y.y.y.y.y.y.|
  *
  00000200

Partial block overwrites may (= will, unless the block to be overwritten is in the kernel's buffer cache) require a read/modify/write operation, but this is transparent to the application.

Finally, note that this applies to most block devices, but tape devices work differently: partial overwrites are not supported, and, in variable block mode, the size of individual write calls determines the resulting tape block sizes.

dezgeg · on March 12, 2024

Somehow I had thought even in buffered mode the kernel would only accept block-aligned and sized I/O. TIL.

neuromanser · on March 12, 2024

> # yes | head -c 512 > foo

How about `truncate -s 512 foo`?

mrb · on March 12, 2024

Your exact command works reliably but is inefficient. And it works by design, not accident. For starters, the default block size in most netcat implementations is tiny like 4 kB or less. So there is a higher CPU and I/O overhead. And if netcat does a partial or small read less than 4 kB, when it writes the partial block to the nvme disk, the kernel would take care of reading a full 4kB block from the nvme disk, updating it with the partial data block, and rewriting the full 4kB block to the disk, which is what makes it work, albeit inefficiently.

M95D · on March 12, 2024

I would include bs=1M and oflag=direct for some extra speed.

_flux · on March 12, 2024

> there are zero benefits in using NVMe/TCP, as he just ends up doing a serial block copy using dd(1) so he's not leveraging concurrent I/O

I guess most people don't have faster local network than an SSD can transfer.

I wonder though, for those people who do, does a concurrent I/O block device replicator tool exist?

Btw, you might want also use pv in the pipeline to see an ETA, although it might have a small impact on performance.

crote · on March 12, 2024

I doubt it makes a difference. SSDs are an awful lot better at sequential writes than random writes, and concurrent IO would mainly speed up random access.

Besides, I don't think anyone really has a local network which is faster than their SSD. Even a 4-year-old consumer Samsung 970 Pro can sustain full-disk writes at 2.000M Byte/s, easily saturating a 10Gbit connection.

If we're looking at state-of-the-art consumer tech, the fastest you're getting is a USB4 40Gbit machine-to-machine transfer - but at that point you probably have something like the Crucial T700, which has a sequential write speed of 11.800 MByte/s.

The enterprise world probably doesn't look too different. You'd need a 100Gbit NIC to saturate even a single modern SSD, but any machine with such a NIC is more likely to have closer to half a dozen SSDs. At that point you're starting to be more worried about things like memory bandwidth instead. [0]

[0]: http://nabstreamingsummit.com/wp-content/uploads/2022/05/202...

wolrah · on March 12, 2024

> Besides, I don't think anyone really has a local network which is faster than their SSD. Even a 4-year-old consumer Samsung 970 Pro can sustain full-disk writes at 2.000M Byte/s, easily saturating a 10Gbit connection.

You might be surprised if you take a look at how cheap high speed NICs are on the used market. 25G and 40G can be had for around $50, and 100G around $100. If you need switches things start to get expensive but for the "home lab" crowd since most of these cards are dual port a three-node mesh can be had for just a few hundred bucks. I've had a 40G link to my home server for a few years now mostly just because I could do it for less than the cost of a single hard drive.

bpye · on March 12, 2024

Depending on your sensitivity to power/noise a 40Gb switch can be had somewhat inexpensively too - something like the Brocade ICX6610 costs <200$ on eBay.

asgrdz · on March 14, 2024

This is exactly what I'm looking for. Would you mind sharing what specific 40G card(s) you're using?

wolrah · on March 16, 2024

I'm using Mellanox ConnectX-3 cards, IIRC they're HP branded. They shipped in Infiniband mode and required a small amount of command line fiddling to put them in ethernet mode but it was pretty close to trivial.

They're PCIe 3.0 x8 cards so they can't max out both ports, but realistically no one who's considering cheap high speed NICs cares about maxing out more than one port.

BeeOnRope · on March 12, 2024

In EC2, most of the "storage optimized" instances (which have the largest/fastest SSDs) generally have more advertised network throughput than SSD throughput, by a factor usually in the range of 1 to 2 (though it depends on exactly how you count it, e.g., how you normalize for the full-duplex nature of network speeds and same for SSD).

JoeAltmaier · on March 12, 2024

Can't find corroboration for the assertion 'SSDs are an awful lot better at sequential writes than random writes'.

Doesn't make sense at first glance. There's no head to move, as in an old-style hard drive. What else could make random write take longer on an SSD?

wtallis · on March 12, 2024

The main problem is that random writes tend to be smaller than the NAND flash erase block size, which is in the range of several MB.

You can check literally any SSD benchmark that tests both random and sequential IO. They're both vastly better than a mechanical hard drive, but sequential IO is still faster than random IO.

JoeAltmaier · on March 12, 2024

Seems to be the case, to some degree. My SSD is 8% slower doing random writes. I guess your mileage may vary.

PeterisP · on March 12, 2024

The key aspect is that such memory generally works on a "block" level so making any smaller-than-block write on a SSD requires reading a whole block (which can be quite large), erasing that whole block and then writing back the whole modified block; as you physically can't toggle a bit without erasing the whole block first.

So if large sequential writes mean that you only write full whole blocks, that can be done much faster than writing the same data in random order.

wtallis · on March 12, 2024

In practice, flash based SSDs basically never do a full read-modify-write cycle to do an in-place update of an entire erase block. They just write the new data elsewhere and keep track of the fragmentation (consequently, sequential reads of data that wasn't written sequentially may not be as fast as sequential reads of data that was written sequentially).

RMW cycles (though not in-place) are common for writes smaller than a NAND page (eg. 16kB) and basically unavoidable for writes smaller than the FTL's 4kB granularity

crote · on March 19, 2024

Long story short: changing a bit from 1 to 0 is really easy, but changing it from 0 to 1 is quite difficult and requires an expensive erase operation. The erase works on a full 4k block, so writing a random byte means reading 4k to a buffer, changing one byte, erasing the page, and writing back 4k. Sequential writing means erasing the page once, and writing the bytes in known-pre-erased sections.

Any modern (last few decades) SSD has a lot of sauce on top of it to reduce this penalty, but random writes are still a fair bit slower - especially once the buffer of pre-prepared replacement pages runs out. Sequential access is also just a lot easier to predict, so the SSD can do a lot of speculative work to speed it up even more.

Palomides · on March 12, 2024

dd has status=progress to show bytes read/written now, I just use that

Multrex · on March 12, 2024

Seems awesome. Can you please tell us how to use gzip or lz4 to do the imaging?

zbentley · on March 12, 2024

If you search for “dd gzip” or “dd lz4” you can find several ways to do this. In general, interpose a gzip compression command between the sending dd and netcat, and a corresponding decompression command between the receiving netcat and dd.

For example: https://unix.stackexchange.com/questions/632267

loeg · on March 12, 2024

Agree, but I'd suggest zstd instead of gzip (or lz4 is fine).

exceptione · on March 12, 2024

> That said I didn't know about that Linux kernel module nvme-tcp. We learn new things every day :) I see that its utility is more for mounting a filesystem over a remote NVMe, rather than accessing it raw with dd.

Aside, I guess nvme-tcp would result in less writes as you only copy files in stead of writing the whole disk over?

ItsHarper · on March 12, 2024

Not if you use it with dd, which will copy the blank space too

ItsHarper · on March 12, 2024

Not if you use it with dd, which will copy the blank space too

khaki54 · on March 12, 2024

Yep I've done this and it works in a pinch. 1Gb/s is also a reasonable fraction of SATA speeds too.

wang_li · on March 12, 2024

Would be much better to hook this up to dump and restore. It'll only copy used data and you can do it while the source system is online.

For compression the rule is that you don't do it if the CPU can't compress faster than the network.

szundi · on March 12, 2024

This is exactly that I usually do, it works like a charm

bayindirh · on March 12, 2024

As a sysadmin, I'd rather use NVMe TCP or Clonezilla to do a slow write rather than trying to go 5% faster with more moving parts and chance to corrupt my drive in the process.

Plus, a it'd be well deserved coffee break.

Considering I'd be going at GigE speeds at best, I'd add "oflag=direct" to bypass caching on the target. A bog standard NVMe can write >300MBps unhindered, so trying to cache is moot.

Lastly, parted can do partition resizing, but given the user is not a power user to begin with, it's just me nitpicking. Nice post otherwise.

mrb · on March 12, 2024

NVMe/TCP or Clonezilla are vastly more moving parts and chances to mess up the options, compared to dd. In fact, the author's solution exposes his NVMe to unauthenticated remote write access by any number of clients(!) By comparison, the dd on the source is read-only, and the dd on the destination only accepts the first connection (yours) and no one else on the network can write to the disk.

I strongly recommend against oflag=direct as in this specific use case it will always degrade performance. Read the O_DIRECT section in open(2). Or try it. Basically using oflag=direct locks the buffer so dd will have to wait for the block to be written by the kernel to disk until it can start reading the data again to fill the buffer with the next block, thereby reducing performance.

bayindirh · on March 12, 2024

> the author's solution exposes his NVMe to unauthenticated remote write access by any number of clients(!)

I won't be bothered in a home network.

> Clonezilla are vastly more moving parts

...and one of these moving parts is image integrity and write integrity verification, allowing byte-by-byte integrity during imaging and after write.

> I strongly recommend against oflag=direct as in this... [snipped for brevity]

Unless you're getting a bottom of the barrel NVMe, all of them have DRAM caches and do their own write caching independent of O_DIRECT, which only bypasses OS caches. Unless the pipe you have has higher throughput than your drive, caching in the storage device's controller ensures optimal write speeds.

I can hit theoretical maximum write speeds of all my SSDs (internal or external) with O_DIRECT. When the pipe is fatter or the device can't sustain that speeds, things go south, but this is why we have knobs.

When you don't use O_DIRECT in these cases, you see initial speed surge maybe, but total time doesn't reduce.

TL;DR: When you're getting your data at 100MBps at most, using O_DIRECT on an SSD with 1GBps write speeds doesn't affect anything. You're not saturating anything on the pipe.

Just did a small test:

    dd if=/dev/zero of=test.file bs=1024kB count=3072 oflag=direct status=progress 
    2821120000 bytes (2.8 GB, 2.6 GiB) copied, 7 s, 403 MB/s
    3072+0 records in
    3072+0 records out
    3145728000 bytes (3.1 GB, 2.9 GiB) copied, 7.79274 s, 404 MB/s

Target is a Samsung T7 Shield 2TB, with 1050MB/sec sustained write speed. Bus is USB 3.0 with 500MBps top speed (so I can go %50 of drive speeds). Result is 404MBps, which is fair for the bus.

If the drive didn't have its own cache, caching on the OS side would have more profound effect since I can queue more writes to device and pool them at RAM.

mrb · on March 12, 2024

Your example proves me right. Your drive should be capable of 1000 MB/s but O_DIRECT reduces performance to 400 MB/s.

This matters in the specific use case of "netcat | gunzip | dd" as the compressed data rate on GigE will indeed be around 120 MB/s but when gunzip is decompressing unused parts of the filesystem (which compress very well), it will attempt to write 1+ GB/s or more to the pipe to dd and it would not be able to keep up with O_DIRECT.

Another thing you are doing wrong: benchmarking with /dev/zero. Many NVMe do transparent compression so writing zeroes is faster than writing random data and thus not a realistic benchmark.

PS: to clarify I am very well aware that not using O_DIRECT gives the impression initial writes are faster as they just fill the buffer cache. I am taking about sustained I/O performance over minutes as measured with, for example, iostat. You are talking to someone who has been doing Linux sysadmin and perf optimizations for 25 years :)

PPS: verifying data integrity is easy with the dd solution. I usually run "sha1sum /dev/nvme0nX" on both source and destination.

PPPS: I don't think Clonezilla is even capable of doing something similar (copying a remote disk to local disk without storing an intermediate disk image).

bayindirh · on March 12, 2024

> Your example proves me right. Your drive should be capable of 1000 MB/s but O_DIRECT reduces performance to 400 MB/s.

I noted that the bus I connected the device has 500MBps bandwidth theoretical, no?

To cite myself:

> Target is a Samsung T7 Shield 2TB, with 1050MB/sec sustained write speed. Bus is USB 3.0 with 500MBps top speed (so I can go %50 of drive speeds). Result is 404MBps, which is fair for the bus.

mrb · on March 12, 2024

Yes USB3.0 is 500 MB/s but are you sure your bus is 3.0? It would imply your machine is 10+ years old. Most likely it's 3.1 or newer which is 1000 MB/s. And again, benchmarking /dev/zero is invalid anyway as I explained (transparent compression)

crote · on March 12, 2024

No, it wouldn't imply the machine is 10+ years old. Even a state-of-the-art motherboard like the Gigabyte Z790 D AX (which became available in my country today) has more USB 3 gen1 (5Gbps) ports than gen2 (10Gbps).

The 5Gbps ports are just marketed as "USB 3.1" instead of "USB 3.0" these days, because USB naming is confusing and the important part is the "gen x".

Arnavion · on March 12, 2024

To be clear for everyone:

USB 3.0, USB 3.1 gen 1, and USB 3.2 gen 1x1 are all names for the same thing, the 5Gbps speed.

USB 3.1 gen 2 and USB 3.2 gen 2x1 are both names for the same thing, the 10Gbps speed.

USB 3.2 gen 2x2 is the 20Gbps speed.

The 3.0 / 3.1 / 3.2 are the version number of the USB specification. The 3.0 version only defined the 5Gbps speed. The 3.1 version added a 10Gbps speed, called it gen 2, and renamed the previous 5Gbps speed to gen 1. The 3.2 version added a new 20Gbps speed, called it gen 2x2, and renamed the previous 5Gbps speed to gen 1x1 and the previous 10Gbps speed to gen 2x1.

There's also a 3.2 gen 1x2 10Gbps speed but I've never seen it used. The 3.2 gen 1x1 is so ubiqitous that it's also referred to as just "3.2 gen 1".

And none of this is to be confused with type A vs type C ports. 3.2 gen 1x1 and 3.2 gen 2x1 can be carried by type A ports, but not 3.2 gen 2x2. 3.2 gen 1x1 and 3.2 gen 2x1 and 3.2 gen 2x2 can all be carried by type C ports.

Lastly, because 3.0 and 3.1 spec versions only introduced one new speed each and because 3.2 gen 2x2 is type C-only, it's possible that a port labeled "3.1" is 3.2 gen 1x1, a type A port labeled "3.2" is 3.2 gen2x1, and a type C port labeled "3.2" is 3.2 gen 2x2. But you will have to check the manual / actual negotiation at runtime to be sure.

crote · on March 19, 2024

> There's also a 3.2 gen 1x2 10Gbps speed but I've never seen it used.

It's not intended to be used by-design. Basically, it's a fallback for when a gen2x2 link fails to operate at 20Gbps speeds.

mrb · on March 12, 2024

I didn't mean 5 Gbps USB ports have disappeared, but rather: most machines in the last ~10 years (~8-9 years?) have some 10 Gbps ports. Therefore if he is plugging a fast SSD in a slow 5 Gbps port, my assumption was that he has no 10 Gbps port.

doubled112 · on March 12, 2024

TIL they have been sneaking versions of USB in while I haven't been paying attention. Even on hardware I own. Thanks for that.

ufocia · on March 12, 2024

I wonder how using tee to compute the hash in parallel would affect the overall performance.

mrb · on March 12, 2024

On GigE or even 2.5G it shouldn't slow things down, as "sha1sum" on my 4-year-old CPU can process at ~400 MB/s (~3.2 Gbit/s). But I don't bother to use tee to compute the hash in parallel because after the disk image has been written to the destination machine, I like to re-read from the destination disk to verify the data was written with integrity. So after the copy I will run sha1sum /dev/XXX on the destination machine. And while I wait for this command to complete I might as well run the same command on the source machine, in parallel. Both commands complete in about the same time so you would not be saving wall clock time.

Fun fact: "openssl sha1" on a typical x86-64 machine is actually about twice faster than "sha1sum" because their code is more optimized.

Another reason I don't bother to use tee to compute the hash in parallel is that it writes with a pretty small block size by default (8 kB) so for best performance you don't want to pass /dev/nvme0nX as the argument to tee, instead you would want to use fancy >(...) shell syntax to pass a file descriptor as an argument to tee which is sha1sum's stdin, then pipe the data to dd to give it the opportunity to buffer writes in 1MB block to the nvme disk:

  $ nc -l -p 1234 | tee >(sha1sum >s.txt) | dd bs=1M of=/dev/XXX

But rescue disks sometimes have a basic shell that doesn't support fancy >(...) syntax. So in the spirit of keeping things simple I don't use tee.

usr1106 · on March 12, 2024

It's over 10 years ago that I had to do such operations regularly with rather unreliable networks to Southeast Asia and/or SD cards, so calculating the checksum every time on the fly was important.

Instead of the "fancy" syntax I used

   mkfifo /tmp/cksum
   sha1sum /tmp/cksum &
   some_reader | tee /tmp/cksum | some_writer

Of course under the conditions mentioned throughputs were moderate compared to what was discussed above. So I don't know how it would perform with a more performant source and target. But the important thing is that you need to pass the data through the slow endpoint only once.

Disclaimer: From memory and untested now. Not.at the keyboard.

Dylan16807 · on March 12, 2024

> ...and one of these moving parts is image integrity and write integrity verification, allowing byte-by-byte integrity during imaging and after write.

dd followed by sha1sum on each end is still very few moving parts and should still be quite fast.

bayindirh · on March 12, 2024

Yes, in the laptop and one-off case, that's true.

In a data center it's not (this is when I use clonezilla 99.9% of the time, tbf).

iforgotpassword · on March 12, 2024

I don't see how you can consider the nvme over tcp version less moving parts.

dd is installed on every system, and if you don't have nc you can still use ssh and sacrifice a bit of performance.

  dd if=/dev/foo | ssh dest@bar "cat > /dev/moo"

bayindirh · on March 12, 2024

NVMe over TCP encapsulates and shows me the remote device as is. Just a block device.

I just copy that block device with "dd", that's all. It's just a dumb pipe encapsulated with TCP, which is already battle tested enough.

Moreover, if I have fatter pipe, I can tune dd for better performance with a single command.

darkwater · on March 12, 2024

netcat encapsulates data just the same (although in a different manner), and it's even more battle-tested. NVMe over TCP use case is to actually use the remote disk over the network as it were local. If you just need to dump a whole disk like in the article, dd+netcat (or even just netcat, as someone pointed out) will work just the same.

iforgotpassword · on March 13, 2024

Nvme over TCP encapsulates the entire nvme protocol in TCP, which is way more complex than just sending the raw data. It's the opposite of "a dumb pipe encapsulated in tcp", this is what the netccat approach would be. Heck if you insist on representing the drive as a block device on the remote side you could just as well use NBD, which is just about as many moving parts as nvme over tcp but still a simpler protocol.

ersamsa · on March 12, 2024

Just cat the blockdev to a bash socket

HankB99 · on March 12, 2024

I came here to suggest similar. I usually go with

    dd if=/dev/device | mbuffer to Ethernet to mbuffer dd of=/dev/device

(with some switches to select better block size and tell mbuffer to send/receive from a TCP socket)

If it's on a system with a fast enough processor I can save considerable time by compressing the stream over the network connection. This is particularly true when sending a relatively fresh installation where lots of the space on the source is zeroes.

alpenbazi · on March 12, 2024

Sir, i dont upvote much, but your post deserves a double up, at least

fransje26 · on March 12, 2024

I am not often speechless, but this hit the spot. Well done Sir!

Where does one learn this black art?

ooooo000 · on March 12, 2024

[flagged]

benterix · on March 12, 2024

> Hdjrnrhf Fhjrnrjg Cnn3nrmf Нос3uejr Нирш Юни до края шллш

Is this some new form of Russian Ops here on HN?