There is an interesting related problem - how do you efficiently test if a buffer contains only zeroes? We use this for automatically sparsifying disk images. There's no standard C function for this. My colleague came up with the following nice trick. It reuses the (presumably already maximally optimized) memcmp function from libc:
Assuming that vector instructions are available, shouldn't it be much faster to actually compare the buffer contents against a vector register initialized to all-zeros rather than comparing against some other memory? Or would memcmp automatically optimize that away because of the precondition that the first 16 bytes are already known to be 0?
It is probably faster, yes (half the number of reads) – but the point of this truck is that you can re-use the (hopefully) vectorized memcmp on every platform with portable code rather than getting on the SIMD ISA treadmill yourself.
Is the choice of 16 as the "limit" value based on benchmarking? As opposed to just doing something like "!buffer[0] && !memcmp(buffer, buffer + 1, size - 1)" which uses the same principle.
I wonder if it would be meaningfully faster if you checked the first 16 bytes as uint64_t or uint128_t instead of byte by byte. It would save you 14 or 15 comparisons per function call.
Yeah, dealing with the smaller buffers is annoying. You could put buffers < 16 bytes in a separate codepath, but that's trading elegance for pretty minor gains.
I have the same issue. I want to use SIMD to do ANDs between large buffers and also be able to detect if a buffer is empty (all zeroes) after an AND. It doesn't seem possible to do this without iterating over the entire buffer again because vpand() doesn't affect the eflags register.
use the popcnt instruction or the popcnt intrinsic function.
It counts how many bits are set to 1, you’re looking for 0.
You can also cast it into a uint64_t integer and do an equality test. There might be a way to use fused multiply add.
Also are you mmaping the file so you can just read it directly as a single buffer? You should be able to madvise to free pages after they’ve been checked.
It would be interesting if there was a way to measure the voltage difference between two memory addresses and if it was equal, the bits would be all one or zero and then you just need to read one byte to see which it is. I don't know how practical that is, but it would be a constant time check.
> There's no standard C function for this. My colleague came up with the following nice trick.
One of the big things about C is that there is no standard library function for anything remotely nontrivial. So successfully coding in C relies on "tricks", and snippets and lore that have been passed on over the years.
Rust, meanwhile, has a check_for_all_zeroes crate or something.
A long time ago, as I was working with the Nintendo SDK for the DS console I wondered if the provided memcpy implementation was optimal.
Turned out it was quite slow.
I replaced it with an Intel hand optimized version made for the StrongARM, and replaced the prefetch opcode by a simple load because this opcode was not supported by the arch of the CPU of this console.
50% faster, this is quite significant for such a low-level, already optimized routine, used extensively in many stages of a game engine.
I think that we should never assume that standard implementations are optimal, trust but verify.
Modern compilers have quite a deep understanding of memcpy, and they will recognize the pattern and put in optimal assembly (on x86, probably "rep movsb" or whatever), even if you don't literally call memcpy. This is why the GCC implmentation of memcpy is, like, trivial: [1]. The compiler will recognize that this is a memcpy and sub the better implementation.
I wonder though: it seems to me that memory bandwidth should far and away be the limiting factor for a memcpy, so I would think even a straight-forward translation of the "trivial" implementation wouldn't be that far off from an "optimal" one. I guess memory prefetching would make a difference, but would minimizing the number of loads/stores (or unrolling the loop) really matter that much?
Only on recent x86, and with a long list of caveats. Look up discussion about erms online.
> I wonder though: it seems to me that memory bandwidth should far and away be the limiting factor for a memcpy, so I would think even a straight-forward translation of the "trivial" implementation wouldn't be that far off from an "optimal" one. I guess memory prefetching would make a difference, but would minimizing the number of loads/stores (or unrolling the loop) really matter that much?
Memory bandwidth is often the limiting factor, but not always. But your simple byte-by-byte loop is not going to get anywhere near saturating that; you'll need to unroll and use vector instructions, which might dispatch slower but copy several orders of magnitude more data.
If you think that a specific routine or algorithm is memory bound, you should always do a quick benchmark to check this assumption.
In practice everything is memory bound because of course the CPU is faster than memory, but you'd be surprised by how difficult it can be to reach the full CPU capacity.
"Memory bound" or "Network bound" are way too frequently used as poor excuses by lazy coders.
Sadly I don't have a link, but as far as I remember rep movsb was always hilariously slow. So memcpy implementations tried to optimize copies using half a page of vector instructions with size and alignment tests, which of course killed the CPUs instruction cache.
Yes, a compiler would at least add a combo of rep movsq + rep movsd + rep movsw to the mix before finishing the final remainder with rep movsb. Vector instructions might help tremendously too.
From my experience, good prefetching and pre-alignment changes a lot of things/
Compiler optimized memcpy are good for small copies that will be inlined, but copying big chunks is an other story and I've seen non-marginal differences depending on implementation.
The most difficult problem is that each implementation is usually tuned for a specific CPU and might be sub-optimal with a different brand or revision...
memset is something JEDEC SDRAM standard should of implemented on a hardware level back in 1993. Why even bother writing to ram byte by byte when we could of had dedicated command to fill up to whole row (8-16kbit per chip, 8-32KB per DIMM) at a time with _single command_. Safe zero fill memory allocation would be free and standard.
For background: https://faculty-web.msoe.edu/johnsontimoj/EE4980/files4980/m... Since 1993 ram chips have integrated state machines receiving and interpreting higher level commands. They also have wide sense amplifier banks being loaded/stored all at once.
Modern microcontrollers can have DMA units that you can program to, among other things, do a memset or even a memcpy when the memory bus happens to be idle, and they’ll interrupt you when they’re done. The design point is different (a microcontroller application can be limited by processor cycles but rarely by memory bus bandwidth), but I still wonder why PCs don’t have anything like that.
Programming Microcontrollers was such an interesting and different experience, designing code to be asynchronous in regards to memory operations was a whole 'nother level of arranging code.
Likewise for doing copies from external RAM to internal SRAM, it was slow enough compared to the 1 cycle latency accessing SRAM, and CPU cycles were precious enough, that code copying lots of memory from external memory was designed to stop execution and let other code run and resume once the copy was finished.
We were able to get some serious speed out of the 96mhz CPU because we optimized everything around our memory bus.
On a related note, Windows has a dedicated kernel thread solely for zeroing out freed memory, so, a new page allocation won't worry about zeroing the memory itself.
Just implement a driver for your memory controller and update all software to use syscall into kernel (about 10k total instructions per syscall), which will perform memset or memcpy, then measure performance improvement and tell it to us.
This reminds me of an interesting problem. In oral speech, I frequently say "wouldn't've" and "couldn't've", but in text form both look completely asinine and aren't generally even recognized by spellcheckers.
Language is pretty fun. My favored multi-contraction is y'all'r'nt, similar in flavor of feeling natural/fun to say and use, but looking really ridiculous written out (not even sure I've even done it right...)
I feel like there's also something in this topic that relates to things like "going to" getting reduced to "gonna".
What's the use of filling your ram with zeros when the data needs to be on L1, L2 or L3? Unless you are memsetting hundreds of MBs of memory, memset/memcpy in practice need to be handled by the cpu or something very close to it.
Zen has CLZERO which can clear a cacheline in one go, but not sure how good it is.
This would be a CPU command that works with the RAM controller rather than something you control yourself (kernels to my knowledge don’t talk directly to the controller beyond maybe some basic power management, if that).
There is a definite need to do hundreds of MB - the Linux kernel has a background thread that does nothing but zero out pages. What do you think happens to the GBs of RAM freed by closing Chrome? Once it’s made available in one spot, no reason others could use it (eg a hardened malloc implementation, etc).
I'm not sure about Linus's objections, but I've found that DMA accelerators for time sharing systems with general workloads haven't reaped benefits, as the overhead of multiplexing and synchronizing with them kills most of their benefits. At that point it's easier to blit memory yourself.
Such ram capability would result in implementing hybrid compressed caches. Why waste whole cache line for storing zeroes when you can have dedicated compressed representation.
PPC has an instruction to 'load' a line ignoring its previous contents (just set up the cache state). useful in any case when you know you're going to overwrite the whole thing.
I wonder if someone was zeroing enough memory, where the memory is a private anonymous mapping, they might use madvise() with MADV_DONTNEED, which in linux will effectively zero the pages.
It returns the memory to the OS, and will pagefault on later accesses remapping them as zero-filled. It works in pages. Sizes smaller than a page result in whatever else is in the same page getting nuked.
If you don't immediately reuse the whole cache, it might spread out the zeroing/remapping over time, rather than in a single large go. Imagine some testing would be in order to see if a syscall + mapping changes ( require reloading with TLB for the process ? ) would be smaller than a straight run of writing zeros at some point.
IIRC, the zeroing is not something you can expect from non-linux madvise implementations.
In all fairness it needs to be said that the libc's implementation has to consider portability to more "exotic" architectures. For example, not every CPU allows to make unaligned 32-bit or 64-bit writes, or it takes a huge penalty for such writes.
Can't look right now, but you might not have benchmarked against libc, but against an optimized version included in Raspbian (https://github.com/simonjhall/copies-and-fills). I'm not sure if that's still active in the latest Raspberry Pi OS releases.
Is that 32 bit ARM code on 64 bit kernel? I thought ARM (since v6) allows unaligned access, although it might have to be emulated through the kernel which is going to be super-slow.
Calls to memset outside of the benchmark may be of heterogenous sizes, which may heavily affect branch prediction since every branch relates to size.
I'm not saying it would go either way, just a big flaw to consider with the benchmarking method where it is doing only repeated calls of the same size only.
It is suprising the GCC version does an integer multiply, if I am reading right (several cycles, unless it is cheaper for uint32 * char).
There's a bunch of things that make benchmarking memset and similar functions really hard:
Measuring the time _repeated_ small calls to memset usually doesn't make any sense, even when the lengths are heterogeneous; this results in an instruction stream that's almost all memset, but for small memsets you almost always have lots of "other stuff" mixed in in real use. This can lead you to a suboptimal implementation.
You have to factor in what distribution of sizes actually occurs in a live system. I haven't looked at this for a decade or so, but the last time I checked the majority of system-wide time in memset (on macOS running diverse applications) was spent in length-4096 calls, and the next highest spike was (perversely) length-zero. A system implementation has to balance the whole system's needs; a memset for just your program can certainly do better. Dtrace or similar tooling is invaluable to gather this information.
As with any benchmarking, the only way to actually know is to swap out the implementation and measure real app / system performance. All that said, Nadav's implementation looks pretty plausible. It's branchier than I would like, and doesn't take advantage of specialized instructions for large buffers, but for some input distributions that's a very reasonable tradeoff, and I don't doubt that it's competitive with system memsets.
As far as real-world performance goes, this paper claims (and shows) that code size is the relevant aspect of mem* functions, and concludes that `rep stosb` is optimal in practice, even though it obviously loses to exotic hand-rolled memset and memcmp in microbenchmarks.
rep stos _is_ worth using, but that paper makes no mention of it (it does show that in their use rep cmps beat a more complicated memcmp implementation).
Also repeatedly zeroing the same memory can have different performance characteristics than zeroing different memory each time. Haven't checked what the benchmark does though.
The claim that the new implementation is faster fails to do the sort of benchmarks that systems developers must look into to justify this kind of change. A benchmark that runs the new memset implementation repeatedly in a loop ends up priming the branch predictor, trace cache and all sorts of things, often making it look better in testing that it does in the system as a whole. This kind of microbenchmark is semi-useful uring development, but is actually insufficient to justify the changes being adopted by a libc or kernel project. Cold caches and branch mispredicts are a major issue for memset/memcpy in a real world system, and other benchmarks need to be run - everything from SPEC to TPCC. I know as I have seen it with my own eyes. Using SSE memset looked promising in microbenchmarks but ended up having problems in a number of real world workloads due to the expensive floating point register save / restores in the kernel outweighing the benefits.
On x86 the situation is in some ways worse. Quite a few x86 CPUs have had atrociously bad implementation of the string instructions. As a result, some high performance systems rolled their own memset/memcpy implementations. That results in feedback to the CPU designers failing to prioritize further optimize those string instructions. Thankfully, string instructions have kept getting better, so the general recommendation today is to just use the string instructions.
I have a memset that's not only 10x faster than glibc, but also secure. The trick is to bypass the generic asm, and let the compiler optimize it, esp. with constant sizes and known alignments. Esp. with clang.
I do it because nobody else implemented a secure memset. What they call secure is just avoiding that the compiler ignores it. A secure memset also cleans the caches with a memory barrier, so that meltdown cannot read it.
explicit_bzero and it's numerous variants are not only insecure, but also slow. (byte wise!)
(I have not yet publicized the implementation because I still need to check its performance in real-world applications. But quipping '370% in microbenchmarks with ideal branch-prediction and caching' seems appropriate considering that's no less than what the linked post does.
That being said, preliminary instrumentation indicates the tradeoffs made were correct.
My main point, however, is that for such low-level, essential subroutines, assembly remains the correct implementation language; c is still inadequate.)
This is undefined behavior under C99 §6.3.2.3 Paragraph 7.
"If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined."
On the x86, in the P4 times the best performing bulk operations essentially required using SIMD, and that SIMD hated you unless you aligned your memory accesses. The result was horrible bloated code to handle leading and trailing data and thus also a need to split off the implementations for small sizes. The unaligned access penalty is much lower now, and REP-prefixed operations have microcoded implementations that use the maximum memory access width (which you can’t do otherwise without SIMD instructions).
I’m curious about what the referenced code compiles down to, actually, because not only could GCC be auto-vectorizing it, it could be replacing it with a REP STOSQ or indeed a call to memset.
I'm no asm expert, but it doesn't look like a lot of vector instructions in the gcc compilation of this, while the clang compilation seems to have more calls with the 128-bit xmm registers (at least on x86_64.) You can also just see visibly how many more instructions the gcc version outputs.
Thank you! Indeed GCC does not use SIMD here unless you set -O3 (... I seem to remember this enables some vectorization?) or allow it to use AVX with -mavx or -march=x86-64-v3. For some reason I’m unable to get it to use plain SSE (always available on x86-64) with any -mtune setting or even with -march=x86-64-v2.
https://gitlab.com/nbdkit/nbdkit/-/blob/b31859402d1404ba0433...
Example usage for sparsifying while copying disk images: https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...