A 100LOC C impl of memset, that is faster than glibc's

rwmj · on Nov 12, 2021

There is an interesting related problem - how do you efficiently test if a buffer contains only zeroes? We use this for automatically sparsifying disk images. There's no standard C function for this. My colleague came up with the following nice trick. It reuses the (presumably already maximally optimized) memcmp function from libc:

https://gitlab.com/nbdkit/nbdkit/-/blob/b31859402d1404ba0433...

  static inline bool __attribute__((__nonnull__ (1)))
  is_zero (const char *buffer, size_t size)
  {
    size_t i;
    const size_t limit = size < 16 ? size : 16;

    for (i = 0; i < limit; ++i)
      if (buffer[i])
        return false;
    if (size != limit)
      return ! memcmp (buffer, buffer + 16, size - 16);

    return true;
  }

Example usage for sparsifying while copying disk images: https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...

TonyTrapp · on Nov 12, 2021

Assuming that vector instructions are available, shouldn't it be much faster to actually compare the buffer contents against a vector register initialized to all-zeros rather than comparing against some other memory? Or would memcmp automatically optimize that away because of the precondition that the first 16 bytes are already known to be 0?

BeeOnRope · on Nov 12, 2021

It is probably faster, yes (half the number of reads) – but the point of this truck is that you can re-use the (hopefully) vectorized memcmp on every platform with portable code rather than getting on the SIMD ISA treadmill yourself.

rwmj · on Nov 12, 2021

The qemu implementation does indeed do it the hard way. It's a lot of code: https://gitlab.com/qemu-project/qemu/-/blob/master/util/buff...

pm215 · on Nov 12, 2021

Interestingly, we used to have a special-case aarch64 version, but we dropped it because the C version was faster: https://gitlab.com/qemu-project/qemu/-/commit/2250d3a293d36e...

(Might or might not still be true on more modern aarch64 hardware...)

pm215 · on Nov 12, 2021

Does your memset version beat QEMU's plain-old-C fallback version ?

yissp · on Nov 12, 2021

Is the choice of 16 as the "limit" value based on benchmarking? As opposed to just doing something like "!buffer[0] && !memcmp(buffer, buffer + 1, size - 1)" which uses the same principle.

BeeOnRope · on Nov 12, 2021

Not the OP, but 16 has the benefit of keeping both pointers in the comparison 16-byte aligned if the buffer was initially aligned.

This would eliminate split loads and provide a decent speedup.

boibombeiro · on Nov 13, 2021

This, and loop unrolling, are two commons misconception about uarch optimization.

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

Memory alignment is innocuous (others than often compromise code legibility).

Loop unrolling, on the other hand, can slow down the code. Specially in small loops.

See Agner uarch PDF.

dahfizz · on Nov 12, 2021

I wonder if it would be meaningfully faster if you checked the first 16 bytes as uint64_t or uint128_t instead of byte by byte. It would save you 14 or 15 comparisons per function call.

rwmj · on Nov 12, 2021

GCC (-O3) actually unrolls the loop completely into 16 x (cmpb + jne), which I find slightly surprising.

We can't easily use a larger size because we mustn't read beyond the end of the buffer if it's shorter than 16 bytes and not a multiple of 2, 4, etc.

dahfizz · on Nov 12, 2021

Yeah, dealing with the smaller buffers is annoying. You could put buffers < 16 bytes in a separate codepath, but that's trading elegance for pretty minor gains.

dolmen · on Nov 12, 2021

Sources: https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...

https://listman.redhat.com/archives/libguestfs/2017-April/ms...

RustyRussell · on Nov 12, 2021

You want this:

https://rusty.ozlabs.org/?p=560

Hope that helps!

(And yes, the CCAN memeqzero routine is the same as yours above in form).

tkhattra · on Nov 12, 2021

just make sure it's not "overly optimized" - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189 (in this case it was gcc's builtin memcmp that was broken, not glibc's) :)

secondcoming · on Nov 12, 2021

I have the same issue. I want to use SIMD to do ANDs between large buffers and also be able to detect if a buffer is empty (all zeroes) after an AND. It doesn't seem possible to do this without iterating over the entire buffer again because vpand() doesn't affect the eflags register.

omegalulw · on Nov 12, 2021

Clever, I love it. Maybe I'm just dumb, but it took me a lil bit to convince myself it's correct.

_3u10 · on Nov 12, 2021

use the popcnt instruction or the popcnt intrinsic function.

It counts how many bits are set to 1, you’re looking for 0.

You can also cast it into a uint64_t integer and do an equality test. There might be a way to use fused multiply add.

Also are you mmaping the file so you can just read it directly as a single buffer? You should be able to madvise to free pages after they’ve been checked.

PUSHFB can also be used. http://0x80.pl/articles/sse-popcount.html

Essentially you want to vectorize this tho memcmp may already be vectorized and do the cpu detection.

Edit: also… You should be able to load 15 x 256 bits and then test them. Try VPTEST https://www.intel.com/content/www/us/en/develop/documentatio...

brettdeveffect · on Nov 12, 2021

I'd be curious about this in practice. Would it make sense to trade off probing in various places as 0s may be spatially correlated?

tehjoker · on Nov 12, 2021

It would be interesting if there was a way to measure the voltage difference between two memory addresses and if it was equal, the bits would be all one or zero and then you just need to read one byte to see which it is. I don't know how practical that is, but it would be a constant time check.

emmelaich · on Nov 12, 2021

C++ has an bool none() function for bitset. Along with any(), all(),...

I haven't looked at the implementation, but you could test it against yours.

junon · on Nov 12, 2021

There are a number of standard functions that can achieve this, namely in string.h. Performance is a question of course.

rwmj · on Nov 12, 2021

Which functions in particular?

junon · on Nov 13, 2021

    strchr(str, 0) == NULL
    memchr(str, 0, len) == NULL

klyrs · on Nov 13, 2021

Those find the first zero -- not the number of contiguous zeros.

junon · on Nov 13, 2021

Oops, you're right. I was going to reach for `strcspn` but then realized it doesn't work for null bytes.

Huh. OP is right, there is no good function for this in the standard library.

foobiekr · on Nov 12, 2021

Wouldn’t duff’s device be significantly faster here?

bitwize · on Nov 12, 2021

> There's no standard C function for this. My colleague came up with the following nice trick.

One of the big things about C is that there is no standard library function for anything remotely nontrivial. So successfully coding in C relies on "tricks", and snippets and lore that have been passed on over the years.

Rust, meanwhile, has a check_for_all_zeroes crate or something.

stephc_int13 · on Nov 12, 2021

A long time ago, as I was working with the Nintendo SDK for the DS console I wondered if the provided memcpy implementation was optimal.

Turned out it was quite slow.

I replaced it with an Intel hand optimized version made for the StrongARM, and replaced the prefetch opcode by a simple load because this opcode was not supported by the arch of the CPU of this console.

50% faster, this is quite significant for such a low-level, already optimized routine, used extensively in many stages of a game engine.

I think that we should never assume that standard implementations are optimal, trust but verify.

OskarS · on Nov 12, 2021

Modern compilers have quite a deep understanding of memcpy, and they will recognize the pattern and put in optimal assembly (on x86, probably "rep movsb" or whatever), even if you don't literally call memcpy. This is why the GCC implmentation of memcpy is, like, trivial: [1]. The compiler will recognize that this is a memcpy and sub the better implementation.

I wonder though: it seems to me that memory bandwidth should far and away be the limiting factor for a memcpy, so I would think even a straight-forward translation of the "trivial" implementation wouldn't be that far off from an "optimal" one. I guess memory prefetching would make a difference, but would minimizing the number of loads/stores (or unrolling the loop) really matter that much?

[1]: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy....

saagarjha · on Nov 12, 2021

> on x86, probably "rep movsb" or whatever

Only on recent x86, and with a long list of caveats. Look up discussion about erms online.

> I wonder though: it seems to me that memory bandwidth should far and away be the limiting factor for a memcpy, so I would think even a straight-forward translation of the "trivial" implementation wouldn't be that far off from an "optimal" one. I guess memory prefetching would make a difference, but would minimizing the number of loads/stores (or unrolling the loop) really matter that much?

Memory bandwidth is often the limiting factor, but not always. But your simple byte-by-byte loop is not going to get anywhere near saturating that; you'll need to unroll and use vector instructions, which might dispatch slower but copy several orders of magnitude more data.

stephc_int13 · on Nov 12, 2021

If you think that a specific routine or algorithm is memory bound, you should always do a quick benchmark to check this assumption.

In practice everything is memory bound because of course the CPU is faster than memory, but you'd be surprised by how difficult it can be to reach the full CPU capacity.

"Memory bound" or "Network bound" are way too frequently used as poor excuses by lazy coders.

josefx · on Nov 12, 2021

> on x86, probably "rep movsb" or whatever)

Sadly I don't have a link, but as far as I remember rep movsb was always hilariously slow. So memcpy implementations tried to optimize copies using half a page of vector instructions with size and alignment tests, which of course killed the CPUs instruction cache.

sedatk · on Nov 13, 2021

Yes, a compiler would at least add a combo of rep movsq + rep movsd + rep movsw to the mix before finishing the final remainder with rep movsb. Vector instructions might help tremendously too.

benttoothpaste · on Nov 12, 2021

Always hilariously slow? That must have been before Ivy Bridge.

bjourne · on Nov 13, 2021

Hilariously slow before Intel engineers decided to optimize the shit out of the construction. :)

stephc_int13 · on Nov 12, 2021

From my experience, good prefetching and pre-alignment changes a lot of things/

Compiler optimized memcpy are good for small copies that will be inlined, but copying big chunks is an other story and I've seen non-marginal differences depending on implementation.

The most difficult problem is that each implementation is usually tuned for a specific CPU and might be sub-optimal with a different brand or revision...

gavinray · on Nov 12, 2021

This is brilliant and really interesting/neat, thanks for posting.

rasz · on Nov 12, 2021

memset is something JEDEC SDRAM standard should of implemented on a hardware level back in 1993. Why even bother writing to ram byte by byte when we could of had dedicated command to fill up to whole row (8-16kbit per chip, 8-32KB per DIMM) at a time with _single command_. Safe zero fill memory allocation would be free and standard.

For background: https://faculty-web.msoe.edu/johnsontimoj/EE4980/files4980/m... Since 1993 ram chips have integrated state machines receiving and interpreting higher level commands. They also have wide sense amplifier banks being loaded/stored all at once.

mananaysiempre · on Nov 12, 2021

Modern microcontrollers can have DMA units that you can program to, among other things, do a memset or even a memcpy when the memory bus happens to be idle, and they’ll interrupt you when they’re done. The design point is different (a microcontroller application can be limited by processor cycles but rarely by memory bus bandwidth), but I still wonder why PCs don’t have anything like that.

com2kid · on Nov 12, 2021

Programming Microcontrollers was such an interesting and different experience, designing code to be asynchronous in regards to memory operations was a whole 'nother level of arranging code.

Likewise for doing copies from external RAM to internal SRAM, it was slow enough compared to the 1 cycle latency accessing SRAM, and CPU cycles were precious enough, that code copying lots of memory from external memory was designed to stop execution and let other code run and resume once the copy was finished.

We were able to get some serious speed out of the 96mhz CPU because we optimized everything around our memory bus.

sedatk · on Nov 13, 2021

On a related note, Windows has a dedicated kernel thread solely for zeroing out freed memory, so, a new page allocation won't worry about zeroing the memory itself.

drran · on Nov 12, 2021

Just implement a driver for your memory controller and update all software to use syscall into kernel (about 10k total instructions per syscall), which will perform memset or memcpy, then measure performance improvement and tell it to us.

LgWoodenBadger · on Nov 12, 2021

Fwiw, and OT, but “could’ve” == “could have” and “should’ve” == “should have.” In no scenario would it be “could of” or “should of.”

Rendello · on Nov 13, 2021

This reminds me of an interesting problem. In oral speech, I frequently say "wouldn't've" and "couldn't've", but in text form both look completely asinine and aren't generally even recognized by spellcheckers.

akamoonknight · on Nov 13, 2021

Language is pretty fun. My favored multi-contraction is y'all'r'nt, similar in flavor of feeling natural/fun to say and use, but looking really ridiculous written out (not even sure I've even done it right...)

I feel like there's also something in this topic that relates to things like "going to" getting reduced to "gonna".

gpderetta · on Nov 12, 2021

What's the use of filling your ram with zeros when the data needs to be on L1, L2 or L3? Unless you are memsetting hundreds of MBs of memory, memset/memcpy in practice need to be handled by the cpu or something very close to it.

Zen has CLZERO which can clear a cacheline in one go, but not sure how good it is.

vlovich123 · on Nov 12, 2021

This would be a CPU command that works with the RAM controller rather than something you control yourself (kernels to my knowledge don’t talk directly to the controller beyond maybe some basic power management, if that).

There is a definite need to do hundreds of MB - the Linux kernel has a background thread that does nothing but zero out pages. What do you think happens to the GBs of RAM freed by closing Chrome? Once it’s made available in one spot, no reason others could use it (eg a hardened malloc implementation, etc).

monocasa · on Nov 12, 2021

If you find yourself doing this a lot, there's write combining memory to coalesce writes to be more friendly to the RAM controller.

Additionally, CLZERO ends up doing very similar work since the resulting cache flush os seen by the RAM controller as a block write.

gpderetta · on Nov 12, 2021

Interesting that you mention linux, because Linus has very very strong opinions about this :)

xvmt · on Nov 12, 2021

What is his opinion about this?

gpderetta · on Nov 12, 2021

He strongly believes that something like rep stos/rep mov is the right interface for memset/memcpy and off-core accelerators (like DMA) are misguided.

Something1234 · on Nov 12, 2021

His reasoning or rant for this?

monocasa · on Nov 12, 2021

I'm not sure about Linus's objections, but I've found that DMA accelerators for time sharing systems with general workloads haven't reaped benefits, as the overhead of multiplexing and synchronizing with them kills most of their benefits. At that point it's easier to blit memory yourself.

kwertyoowiyop · on Nov 12, 2021

What are they?

rasz · on Nov 12, 2021

Such ram capability would result in implementing hybrid compressed caches. Why waste whole cache line for storing zeroes when you can have dedicated compressed representation.

On a similar note part of ATI 2000 https://en.wikipedia.org/wiki/HyperZ was fast Z clear, today a norm on every GPU.

convolvatron · on Nov 12, 2021

PPC has an instruction to 'load' a line ignoring its previous contents (just set up the cache state). useful in any case when you know you're going to overwrite the whole thing.

xgkickt · on Nov 12, 2021

I used dcbz extensively back on the Wii.

FpUser · on Nov 12, 2021

>"Unless you are memsetting hundreds of MBs of memory"

Not hundreds but in one of my apps I do have 10th MB of continuous cache that has to be zeroed before use / reuse.

knome · on Nov 12, 2021

I wonder if someone was zeroing enough memory, where the memory is a private anonymous mapping, they might use madvise() with MADV_DONTNEED, which in linux will effectively zero the pages.

It returns the memory to the OS, and will pagefault on later accesses remapping them as zero-filled. It works in pages. Sizes smaller than a page result in whatever else is in the same page getting nuked.

If you don't immediately reuse the whole cache, it might spread out the zeroing/remapping over time, rather than in a single large go. Imagine some testing would be in order to see if a syscall + mapping changes ( require reloading with TLB for the process ? ) would be smaller than a straight run of writing zeros at some point.

IIRC, the zeroing is not something you can expect from non-linux madvise implementations.

renox · on Nov 14, 2021

Interesting but you must also take care of the CPU caches..

_zoltan_ · on Nov 12, 2021

*should have

nynyny7 · on Nov 12, 2021

In all fairness it needs to be said that the libc's implementation has to consider portability to more "exotic" architectures. For example, not every CPU allows to make unaligned 32-bit or 64-bit writes, or it takes a huge penalty for such writes.

nynyny7 · on Nov 12, 2021

What do I mean by "not every CPU allows to make unaligned 32-bit or 64-bit writes"? Let's test the code (as of commit eac67b6) on a Raspberry Pi 400:

  pi@rasppi400:~/memset_benchmark $ uname -a
  Linux rasppi400 5.10.63-v8+ #1459 SMP PREEMPT Wed Oct 6 16:42:49 BST 2021 aarch64 GNU/Linux
  
  pi@rasppi400:~/memset_benchmark $ ./bench_memset
  size, alignment, offset, libc, local
  0, 16, 0, 1237452, 834116, 1.483549,
  1, 16, 0, 1612697, 945325, 1.705971,
  2, 16, 0, 1779538, 945320, 1.882472,
  3, 16, 0, 1557081, 945324, 1.647140,
  4, 16, 0, 1779527, 889736, 2.000062,
  5, 16, 0, 1557103, 1000940, 1.555641,
  6, 16, 0, 1779551, 1000944, 1.777873,
  7, 16, 0, 1557111, 1000945, 1.555641,
  8, 16, 0, 1334654, 889723, 1.500078,
  Bus error
  
  pi@rasppi400:~/memset_benchmark $ gdb ./bench_memset
  [...]
  (gdb) run
  Starting program: /home/pi/memset_benchmark/bench_memset
  size, alignment, offset, libc, local
  0, 16, 0, 1557105, 722928, 2.153887,
  1, 16, 0, 1557103, 889797, 1.749953,
  2, 16, 0, 1557107, 889849, 1.749855,
  3, 16, 0, 1557108, 889759, 1.750033,
  4, 16, 0, 1557117, 889789, 1.749985,
  5, 16, 0, 1557110, 889745, 1.750063,
  6, 16, 0, 1557116, 889754, 1.750052,
  7, 16, 0, 1557110, 889758, 1.750038,
  8, 16, 0, 1557109, 889803, 1.749948,
  
  Program received signal SIGBUS, Bus error.
  
  small_memset (n=<optimized out>, c=<optimized out>, s=0x29690)
      at /home/pi/memset_benchmark/src/lib.c:33
  33          *((uint64_t *)last) = val8;

dividuum · on Nov 12, 2021

Can't look right now, but you might not have benchmarked against libc, but against an optimized version included in Raspbian (https://github.com/simonjhall/copies-and-fills). I'm not sure if that's still active in the latest Raspberry Pi OS releases.

rwmj · on Nov 12, 2021

Is that 32 bit ARM code on 64 bit kernel? I thought ARM (since v6) allows unaligned access, although it might have to be emulated through the kernel which is going to be super-slow.

On SPARC you have no choice, align or die!

nynyny7 · on Nov 12, 2021

Yes it is 32 bit code on a 64 bit kernel. I didn't debug what the instruction is that ultimately causes the bus error.

  pi@rasppi400:~/memset_benchmark $ uname -a
  Linux rasppi400 5.10.63-v8+ #1459 SMP PREEMPT Wed Oct 6 16:42:49 BST 2021 aarch64 GNU/Linux
  pi@rasppi400:~/memset_benchmark $
  pi@rasppi400:~/memset_benchmark $ file ./bench_memset
  ./bench_memset: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=ebeb69b6cb9664d78c1256a2c862f3d28f11e15e, with debug_info, not stripped

nynyny7 · on Nov 12, 2021

PS: It's STRD, which as far as I understand the Arm Architecture Reference Manual always requires word alignment.

  Program received signal SIGBUS, Bus error.
  small_memset (n=<optimized out>, c=<optimized out>, s=0x29690)
      at /home/pi/memset_benchmark/src/lib.c:33
  33          *((uint64_t *)last) = val8;
  1: x/i $pc
  => 0x11c8c <local_memset+1560>: strd    r0, [r7, #-8]
  (gdb) info registers
  r0             0x0                 0
  r1             0x0                 0
  r2             0x0                 0
  r3             0x1475              5237
  r4             0x5f5e100           100000000
  r5             0x11674             71284
  r6             0x29690             169616
  r7             0x29699             169625

dolmen · on Nov 12, 2021

In all fairness we need the fastest memset on every architecture. Whatever the cost of maintenance.

gpderetta · on Nov 12, 2021

glibc has hand crafted assembler implementations of memcpy (often specialized for specific size ranges) for many architectures.

asdfasgasdgasdg · on Nov 12, 2021

Does glibc not have feature detection and conditional compilation for cases like this? That is surprising to me.

rwmj · on Nov 12, 2021

It does. Each subdirectory of sysdeps/ can contain specific implementations per platform, arch, etc. eg: the aarch64 assembler memset is:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar...

There's also the "ifunc" mechanism which can be used to make the choice at runtime, eg:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar...

cma · on Nov 12, 2021

Calls to memset outside of the benchmark may be of heterogenous sizes, which may heavily affect branch prediction since every branch relates to size.

I'm not saying it would go either way, just a big flaw to consider with the benchmarking method where it is doing only repeated calls of the same size only.

It is suprising the GCC version does an integer multiply, if I am reading right (several cycles, unless it is cheaper for uint32 * char).

BeeOnRope · on Nov 12, 2021

Yes, it's hard to take the benchmark results seriously in light of the failure to use anything other than a single size at a time.

stephencanon · on Nov 12, 2021

There's a bunch of things that make benchmarking memset and similar functions really hard:

Measuring the time _repeated_ small calls to memset usually doesn't make any sense, even when the lengths are heterogeneous; this results in an instruction stream that's almost all memset, but for small memsets you almost always have lots of "other stuff" mixed in in real use. This can lead you to a suboptimal implementation.

You have to factor in what distribution of sizes actually occurs in a live system. I haven't looked at this for a decade or so, but the last time I checked the majority of system-wide time in memset (on macOS running diverse applications) was spent in length-4096 calls, and the next highest spike was (perversely) length-zero. A system implementation has to balance the whole system's needs; a memset for just your program can certainly do better. Dtrace or similar tooling is invaluable to gather this information.

As with any benchmarking, the only way to actually know is to swap out the implementation and measure real app / system performance. All that said, Nadav's implementation looks pretty plausible. It's branchier than I would like, and doesn't take advantage of specialized instructions for large buffers, but for some input distributions that's a very reasonable tradeoff, and I don't doubt that it's competitive with system memsets.

jeffbee · on Nov 12, 2021

As far as real-world performance goes, this paper claims (and shows) that code size is the relevant aspect of mem* functions, and concludes that `rep stosb` is optimal in practice, even though it obviously loses to exotic hand-rolled memset and memcmp in microbenchmarks.

https://storage.googleapis.com/pub-tools-public-publication-...

ckennelly · on Nov 13, 2021

https://research.google/pubs/pub50338.pdf goes into more depth on the mem* libc functions and principles for the implementations in llvm libc.

stephencanon · on Nov 12, 2021

rep stos _is_ worth using, but that paper makes no mention of it (it does show that in their use rep cmps beat a more complicated memcmp implementation).

gpderetta · on Nov 12, 2021

Also repeatedly zeroing the same memory can have different performance characteristics than zeroing different memory each time. Haven't checked what the benchmark does though.

mhh__ · on Nov 12, 2021

This is why you should benchmark like you test.

Spot regressions early (locally) but make the decisions based on the big the picture.

jstanley · on Nov 12, 2021

I'm curious why the time is so much worse for sizes just slightly larger than 400, but then better again for sizes larger than this?

MauranKilom · on Nov 12, 2021

Could be hitting a critical cache stride. See also https://stackoverflow.com/questions/11413855/why-is-transpos...

marcodiego · on Nov 12, 2021

Great! Now benchmark it with all compiler X architectures supported by glibc.

bcrl · on Nov 13, 2021

The claim that the new implementation is faster fails to do the sort of benchmarks that systems developers must look into to justify this kind of change. A benchmark that runs the new memset implementation repeatedly in a loop ends up priming the branch predictor, trace cache and all sorts of things, often making it look better in testing that it does in the system as a whole. This kind of microbenchmark is semi-useful uring development, but is actually insufficient to justify the changes being adopted by a libc or kernel project. Cold caches and branch mispredicts are a major issue for memset/memcpy in a real world system, and other benchmarks need to be run - everything from SPEC to TPCC. I know as I have seen it with my own eyes. Using SSE memset looked promising in microbenchmarks but ended up having problems in a number of real world workloads due to the expensive floating point register save / restores in the kernel outweighing the benefits.

On x86 the situation is in some ways worse. Quite a few x86 CPUs have had atrociously bad implementation of the string instructions. As a result, some high performance systems rolled their own memset/memcpy implementations. That results in feedback to the CPU designers failing to prioritize further optimize those string instructions. Thankfully, string instructions have kept getting better, so the general recommendation today is to just use the string instructions.

Jyaif · on Nov 12, 2021

I think that Duff's device could be used on lines 65-86 ( https://github.com/nadavrot/memset_benchmark/blob/eac67b6205... )

brrrrrm · on Nov 13, 2021

Is Duff’s device still relevant? Compilers can typically do such unroll logic for you these days.

nly · on Nov 12, 2021

I once maintained a project that had a custom memcpy implementation because they didn't want to link to libc.

They assumed aligned writes were safe and the compiler optimized it incorrectly, resulting in memory corruption.

dathinab · on Nov 12, 2021

But is it still faster if used in real world instead of benachmars?

Given it's comparatively huge impl. it probably massively messes with the instruction cache of the rest of your program or am I overlooking something?

jvanderbot · on Nov 12, 2021

Is this one of those times where you ignore 2% of the edge cases or legacy compatibilities and get a bunch of extra performance?

rurban · on Nov 12, 2021

I have a memset that's not only 10x faster than glibc, but also secure. The trick is to bypass the generic asm, and let the compiler optimize it, esp. with constant sizes and known alignments. Esp. with clang.

errcorrectcode · on Nov 12, 2021

Don't do this. Security. explicit_bzero() won't be optimized away.

rurban · on Nov 14, 2021

I do it because nobody else implemented a secure memset. What they call secure is just avoiding that the compiler ignores it. A secure memset also cleans the caches with a memory barrier, so that meltdown cannot read it.

explicit_bzero and it's numerous variants are not only insecure, but also slow. (byte wise!)

Only safelibc has a secure memset_s. https://github.com/rurban/safeclib/blob/master/tests/perf_me...

moonchild · on Nov 12, 2021

Also in assembly, up to 370% the speed of glibc - https://github.com/moon-chilled/fancy-memset

moonchild · on Nov 13, 2021

(I have not yet publicized the implementation because I still need to check its performance in real-world applications. But quipping '370% in microbenchmarks with ideal branch-prediction and caching' seems appropriate considering that's no less than what the linked post does.

That being said, preliminary instrumentation indicates the tradeoffs made were correct.

My main point, however, is that for such low-level, essential subroutines, assembly remains the correct implementation language; c is still inadequate.)

brrrrrm · on Nov 12, 2021

The author, Nadav, has a lot of performance gems out there. Another good one explaining how to implement a fast matrix multiply: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...

teddyh · on Nov 12, 2021

> The implementation of memset in this repository is around 25% faster than the glibc implementation. This is the geometric mean for sizes 0..512.

cryptonector · on Nov 12, 2021

> if (n == 0) > return s;

That branch is not needed because memset() is UD if length is 0, but it's nice that it's safer.

Filligree · on Nov 12, 2021

You wouldn't rather have an arguably very slighter faster memset, with the caveat that it might explode in your face?

jmull · on Nov 12, 2021

I’m wonder how well this handles unaligned memory?

That used to be stable stakes for this kind of thing, but maybe it doesn’t matter much anymore?

kevin_b_er · on Nov 12, 2021

Probably poorly. It is a violation to cast an unaligned pointer to an aligned type that the base pointer is not aligned for. And the code looks like it does just that right here: https://github.com/nadavrot/memset_benchmark/blob/main/src/l...

This is undefined behavior under C99 §6.3.2.3 Paragraph 7. "If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined."

The musl code referenced has handling for this.

mananaysiempre · on Nov 12, 2021

On the x86, in the P4 times the best performing bulk operations essentially required using SIMD, and that SIMD hated you unless you aligned your memory accesses. The result was horrible bloated code to handle leading and trailing data and thus also a need to split off the implementations for small sizes. The unaligned access penalty is much lower now, and REP-prefixed operations have microcoded implementations that use the maximum memory access width (which you can’t do otherwise without SIMD instructions).

I’m curious about what the referenced code compiles down to, actually, because not only could GCC be auto-vectorizing it, it could be replacing it with a REP STOSQ or indeed a call to memset.

ninkendo · on Nov 12, 2021

Here's the code in godbolt:

gcc: https://godbolt.org/z/6xG5dKjj9

clang: https://godbolt.org/z/Mh9zozjvK

I'm no asm expert, but it doesn't look like a lot of vector instructions in the gcc compilation of this, while the clang compilation seems to have more calls with the 128-bit xmm registers (at least on x86_64.) You can also just see visibly how many more instructions the gcc version outputs.

mananaysiempre · on Nov 12, 2021

Thank you! Indeed GCC does not use SIMD here unless you set -O3 (... I seem to remember this enables some vectorization?) or allow it to use AVX with -mavx or -march=x86-64-v3. For some reason I’m unable to get it to use plain SSE (always available on x86-64) with any -mtune setting or even with -march=x86-64-v2.