There is an interesting related problem - how do you efficiently test if a buffe...

TonyTrapp · on Nov 12, 2021

Assuming that vector instructions are available, shouldn't it be much faster to actually compare the buffer contents against a vector register initialized to all-zeros rather than comparing against some other memory? Or would memcmp automatically optimize that away because of the precondition that the first 16 bytes are already known to be 0?

BeeOnRope · on Nov 12, 2021

It is probably faster, yes (half the number of reads) – but the point of this truck is that you can re-use the (hopefully) vectorized memcmp on every platform with portable code rather than getting on the SIMD ISA treadmill yourself.

rwmj · on Nov 12, 2021

The qemu implementation does indeed do it the hard way. It's a lot of code: https://gitlab.com/qemu-project/qemu/-/blob/master/util/buff...

pm215 · on Nov 12, 2021

Interestingly, we used to have a special-case aarch64 version, but we dropped it because the C version was faster: https://gitlab.com/qemu-project/qemu/-/commit/2250d3a293d36e...

(Might or might not still be true on more modern aarch64 hardware...)

pm215 · on Nov 12, 2021

Does your memset version beat QEMU's plain-old-C fallback version ?

yissp · on Nov 12, 2021

Is the choice of 16 as the "limit" value based on benchmarking? As opposed to just doing something like "!buffer[0] && !memcmp(buffer, buffer + 1, size - 1)" which uses the same principle.

BeeOnRope · on Nov 12, 2021

Not the OP, but 16 has the benefit of keeping both pointers in the comparison 16-byte aligned if the buffer was initially aligned.

This would eliminate split loads and provide a decent speedup.

boibombeiro · on Nov 13, 2021

This, and loop unrolling, are two commons misconception about uarch optimization.

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

Memory alignment is innocuous (others than often compromise code legibility).

Loop unrolling, on the other hand, can slow down the code. Specially in small loops.

See Agner uarch PDF.

dahfizz · on Nov 12, 2021

I wonder if it would be meaningfully faster if you checked the first 16 bytes as uint64_t or uint128_t instead of byte by byte. It would save you 14 or 15 comparisons per function call.

rwmj · on Nov 12, 2021

GCC (-O3) actually unrolls the loop completely into 16 x (cmpb + jne), which I find slightly surprising.

We can't easily use a larger size because we mustn't read beyond the end of the buffer if it's shorter than 16 bytes and not a multiple of 2, 4, etc.

dahfizz · on Nov 12, 2021

Yeah, dealing with the smaller buffers is annoying. You could put buffers < 16 bytes in a separate codepath, but that's trading elegance for pretty minor gains.

dolmen · on Nov 12, 2021

Sources: https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...

https://listman.redhat.com/archives/libguestfs/2017-April/ms...

RustyRussell · on Nov 12, 2021

You want this:

https://rusty.ozlabs.org/?p=560

Hope that helps!

(And yes, the CCAN memeqzero routine is the same as yours above in form).

tkhattra · on Nov 12, 2021

just make sure it's not "overly optimized" - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189 (in this case it was gcc's builtin memcmp that was broken, not glibc's) :)

secondcoming · on Nov 12, 2021

I have the same issue. I want to use SIMD to do ANDs between large buffers and also be able to detect if a buffer is empty (all zeroes) after an AND. It doesn't seem possible to do this without iterating over the entire buffer again because vpand() doesn't affect the eflags register.

omegalulw · on Nov 12, 2021

Clever, I love it. Maybe I'm just dumb, but it took me a lil bit to convince myself it's correct.

_3u10 · on Nov 12, 2021

use the popcnt instruction or the popcnt intrinsic function.

It counts how many bits are set to 1, you’re looking for 0.

You can also cast it into a uint64_t integer and do an equality test. There might be a way to use fused multiply add.

Also are you mmaping the file so you can just read it directly as a single buffer? You should be able to madvise to free pages after they’ve been checked.

PUSHFB can also be used. http://0x80.pl/articles/sse-popcount.html

Essentially you want to vectorize this tho memcmp may already be vectorized and do the cpu detection.

Edit: also… You should be able to load 15 x 256 bits and then test them. Try VPTEST https://www.intel.com/content/www/us/en/develop/documentatio...

brettdeveffect · on Nov 12, 2021

I'd be curious about this in practice. Would it make sense to trade off probing in various places as 0s may be spatially correlated?

tehjoker · on Nov 12, 2021

It would be interesting if there was a way to measure the voltage difference between two memory addresses and if it was equal, the bits would be all one or zero and then you just need to read one byte to see which it is. I don't know how practical that is, but it would be a constant time check.

emmelaich · on Nov 12, 2021

C++ has an bool none() function for bitset. Along with any(), all(),...

I haven't looked at the implementation, but you could test it against yours.

junon · on Nov 12, 2021

There are a number of standard functions that can achieve this, namely in string.h. Performance is a question of course.

rwmj · on Nov 12, 2021

Which functions in particular?

junon · on Nov 13, 2021

    strchr(str, 0) == NULL
    memchr(str, 0, len) == NULL

klyrs · on Nov 13, 2021

Those find the first zero -- not the number of contiguous zeros.

junon · on Nov 13, 2021

Oops, you're right. I was going to reach for `strcspn` but then realized it doesn't work for null bytes.

Huh. OP is right, there is no good function for this in the standard library.

foobiekr · on Nov 12, 2021

Wouldn’t duff’s device be significantly faster here?

bitwize · on Nov 12, 2021

> There's no standard C function for this. My colleague came up with the following nice trick.

One of the big things about C is that there is no standard library function for anything remotely nontrivial. So successfully coding in C relies on "tricks", and snippets and lore that have been passed on over the years.

Rust, meanwhile, has a check_for_all_zeroes crate or something.