Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is an interesting related problem - how do you efficiently test if a buffer contains only zeroes? We use this for automatically sparsifying disk images. There's no standard C function for this. My colleague came up with the following nice trick. It reuses the (presumably already maximally optimized) memcmp function from libc:

https://gitlab.com/nbdkit/nbdkit/-/blob/b31859402d1404ba0433...

  static inline bool __attribute__((__nonnull__ (1)))
  is_zero (const char *buffer, size_t size)
  {
    size_t i;
    const size_t limit = size < 16 ? size : 16;

    for (i = 0; i < limit; ++i)
      if (buffer[i])
        return false;
    if (size != limit)
      return ! memcmp (buffer, buffer + 16, size - 16);

    return true;
  }
Example usage for sparsifying while copying disk images: https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...


Assuming that vector instructions are available, shouldn't it be much faster to actually compare the buffer contents against a vector register initialized to all-zeros rather than comparing against some other memory? Or would memcmp automatically optimize that away because of the precondition that the first 16 bytes are already known to be 0?


It is probably faster, yes (half the number of reads) – but the point of this truck is that you can re-use the (hopefully) vectorized memcmp on every platform with portable code rather than getting on the SIMD ISA treadmill yourself.


The qemu implementation does indeed do it the hard way. It's a lot of code: https://gitlab.com/qemu-project/qemu/-/blob/master/util/buff...


Interestingly, we used to have a special-case aarch64 version, but we dropped it because the C version was faster: https://gitlab.com/qemu-project/qemu/-/commit/2250d3a293d36e...

(Might or might not still be true on more modern aarch64 hardware...)


Does your memset version beat QEMU's plain-old-C fallback version ?


Is the choice of 16 as the "limit" value based on benchmarking? As opposed to just doing something like "!buffer[0] && !memcmp(buffer, buffer + 1, size - 1)" which uses the same principle.


Not the OP, but 16 has the benefit of keeping both pointers in the comparison 16-byte aligned if the buffer was initially aligned.

This would eliminate split loads and provide a decent speedup.


This, and loop unrolling, are two commons misconception about uarch optimization.

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

Memory alignment is innocuous (others than often compromise code legibility).

Loop unrolling, on the other hand, can slow down the code. Specially in small loops.

See Agner uarch PDF.


I wonder if it would be meaningfully faster if you checked the first 16 bytes as uint64_t or uint128_t instead of byte by byte. It would save you 14 or 15 comparisons per function call.


GCC (-O3) actually unrolls the loop completely into 16 x (cmpb + jne), which I find slightly surprising.

We can't easily use a larger size because we mustn't read beyond the end of the buffer if it's shorter than 16 bytes and not a multiple of 2, 4, etc.


Yeah, dealing with the smaller buffers is annoying. You could put buffers < 16 bytes in a separate codepath, but that's trading elegance for pretty minor gains.



You want this:

https://rusty.ozlabs.org/?p=560

Hope that helps!

(And yes, the CCAN memeqzero routine is the same as yours above in form).


just make sure it's not "overly optimized" - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189 (in this case it was gcc's builtin memcmp that was broken, not glibc's) :)


I have the same issue. I want to use SIMD to do ANDs between large buffers and also be able to detect if a buffer is empty (all zeroes) after an AND. It doesn't seem possible to do this without iterating over the entire buffer again because vpand() doesn't affect the eflags register.


Clever, I love it. Maybe I'm just dumb, but it took me a lil bit to convince myself it's correct.


use the popcnt instruction or the popcnt intrinsic function.

It counts how many bits are set to 1, you’re looking for 0.

You can also cast it into a uint64_t integer and do an equality test. There might be a way to use fused multiply add.

Also are you mmaping the file so you can just read it directly as a single buffer? You should be able to madvise to free pages after they’ve been checked.

PUSHFB can also be used. http://0x80.pl/articles/sse-popcount.html

Essentially you want to vectorize this tho memcmp may already be vectorized and do the cpu detection.

Edit: also… You should be able to load 15 x 256 bits and then test them. Try VPTEST https://www.intel.com/content/www/us/en/develop/documentatio...


I'd be curious about this in practice. Would it make sense to trade off probing in various places as 0s may be spatially correlated?


It would be interesting if there was a way to measure the voltage difference between two memory addresses and if it was equal, the bits would be all one or zero and then you just need to read one byte to see which it is. I don't know how practical that is, but it would be a constant time check.


C++ has an bool none() function for bitset. Along with any(), all(),...

I haven't looked at the implementation, but you could test it against yours.


There are a number of standard functions that can achieve this, namely in string.h. Performance is a question of course.


Which functions in particular?


    strchr(str, 0) == NULL
    memchr(str, 0, len) == NULL


Those find the first zero -- not the number of contiguous zeros.


Oops, you're right. I was going to reach for `strcspn` but then realized it doesn't work for null bytes.

Huh. OP is right, there is no good function for this in the standard library.


Wouldn’t duff’s device be significantly faster here?


> There's no standard C function for this. My colleague came up with the following nice trick.

One of the big things about C is that there is no standard library function for anything remotely nontrivial. So successfully coding in C relies on "tricks", and snippets and lore that have been passed on over the years.

Rust, meanwhile, has a check_for_all_zeroes crate or something.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: