Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A 100LOC C impl of memset, that is faster than glibc's (github.com/nadavrot)
99 points by Q26124 on Nov 12, 2021 | hide | past | favorite | 103 comments


There is an interesting related problem - how do you efficiently test if a buffer contains only zeroes? We use this for automatically sparsifying disk images. There's no standard C function for this. My colleague came up with the following nice trick. It reuses the (presumably already maximally optimized) memcmp function from libc:

https://gitlab.com/nbdkit/nbdkit/-/blob/b31859402d1404ba0433...

  static inline bool __attribute__((__nonnull__ (1)))
  is_zero (const char *buffer, size_t size)
  {
    size_t i;
    const size_t limit = size < 16 ? size : 16;

    for (i = 0; i < limit; ++i)
      if (buffer[i])
        return false;
    if (size != limit)
      return ! memcmp (buffer, buffer + 16, size - 16);

    return true;
  }
Example usage for sparsifying while copying disk images: https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...


Assuming that vector instructions are available, shouldn't it be much faster to actually compare the buffer contents against a vector register initialized to all-zeros rather than comparing against some other memory? Or would memcmp automatically optimize that away because of the precondition that the first 16 bytes are already known to be 0?


It is probably faster, yes (half the number of reads) – but the point of this truck is that you can re-use the (hopefully) vectorized memcmp on every platform with portable code rather than getting on the SIMD ISA treadmill yourself.


The qemu implementation does indeed do it the hard way. It's a lot of code: https://gitlab.com/qemu-project/qemu/-/blob/master/util/buff...


Interestingly, we used to have a special-case aarch64 version, but we dropped it because the C version was faster: https://gitlab.com/qemu-project/qemu/-/commit/2250d3a293d36e...

(Might or might not still be true on more modern aarch64 hardware...)


Does your memset version beat QEMU's plain-old-C fallback version ?


Is the choice of 16 as the "limit" value based on benchmarking? As opposed to just doing something like "!buffer[0] && !memcmp(buffer, buffer + 1, size - 1)" which uses the same principle.


Not the OP, but 16 has the benefit of keeping both pointers in the comparison 16-byte aligned if the buffer was initially aligned.

This would eliminate split loads and provide a decent speedup.


This, and loop unrolling, are two commons misconception about uarch optimization.

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

Memory alignment is innocuous (others than often compromise code legibility).

Loop unrolling, on the other hand, can slow down the code. Specially in small loops.

See Agner uarch PDF.


I wonder if it would be meaningfully faster if you checked the first 16 bytes as uint64_t or uint128_t instead of byte by byte. It would save you 14 or 15 comparisons per function call.


GCC (-O3) actually unrolls the loop completely into 16 x (cmpb + jne), which I find slightly surprising.

We can't easily use a larger size because we mustn't read beyond the end of the buffer if it's shorter than 16 bytes and not a multiple of 2, 4, etc.


Yeah, dealing with the smaller buffers is annoying. You could put buffers < 16 bytes in a separate codepath, but that's trading elegance for pretty minor gains.



You want this:

https://rusty.ozlabs.org/?p=560

Hope that helps!

(And yes, the CCAN memeqzero routine is the same as yours above in form).


just make sure it's not "overly optimized" - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189 (in this case it was gcc's builtin memcmp that was broken, not glibc's) :)


I have the same issue. I want to use SIMD to do ANDs between large buffers and also be able to detect if a buffer is empty (all zeroes) after an AND. It doesn't seem possible to do this without iterating over the entire buffer again because vpand() doesn't affect the eflags register.


Clever, I love it. Maybe I'm just dumb, but it took me a lil bit to convince myself it's correct.


use the popcnt instruction or the popcnt intrinsic function.

It counts how many bits are set to 1, you’re looking for 0.

You can also cast it into a uint64_t integer and do an equality test. There might be a way to use fused multiply add.

Also are you mmaping the file so you can just read it directly as a single buffer? You should be able to madvise to free pages after they’ve been checked.

PUSHFB can also be used. http://0x80.pl/articles/sse-popcount.html

Essentially you want to vectorize this tho memcmp may already be vectorized and do the cpu detection.

Edit: also… You should be able to load 15 x 256 bits and then test them. Try VPTEST https://www.intel.com/content/www/us/en/develop/documentatio...


I'd be curious about this in practice. Would it make sense to trade off probing in various places as 0s may be spatially correlated?


It would be interesting if there was a way to measure the voltage difference between two memory addresses and if it was equal, the bits would be all one or zero and then you just need to read one byte to see which it is. I don't know how practical that is, but it would be a constant time check.


C++ has an bool none() function for bitset. Along with any(), all(),...

I haven't looked at the implementation, but you could test it against yours.


There are a number of standard functions that can achieve this, namely in string.h. Performance is a question of course.


Which functions in particular?


    strchr(str, 0) == NULL
    memchr(str, 0, len) == NULL


Those find the first zero -- not the number of contiguous zeros.


Oops, you're right. I was going to reach for `strcspn` but then realized it doesn't work for null bytes.

Huh. OP is right, there is no good function for this in the standard library.


Wouldn’t duff’s device be significantly faster here?


> There's no standard C function for this. My colleague came up with the following nice trick.

One of the big things about C is that there is no standard library function for anything remotely nontrivial. So successfully coding in C relies on "tricks", and snippets and lore that have been passed on over the years.

Rust, meanwhile, has a check_for_all_zeroes crate or something.


A long time ago, as I was working with the Nintendo SDK for the DS console I wondered if the provided memcpy implementation was optimal.

Turned out it was quite slow.

I replaced it with an Intel hand optimized version made for the StrongARM, and replaced the prefetch opcode by a simple load because this opcode was not supported by the arch of the CPU of this console.

50% faster, this is quite significant for such a low-level, already optimized routine, used extensively in many stages of a game engine.

I think that we should never assume that standard implementations are optimal, trust but verify.


Modern compilers have quite a deep understanding of memcpy, and they will recognize the pattern and put in optimal assembly (on x86, probably "rep movsb" or whatever), even if you don't literally call memcpy. This is why the GCC implmentation of memcpy is, like, trivial: [1]. The compiler will recognize that this is a memcpy and sub the better implementation.

I wonder though: it seems to me that memory bandwidth should far and away be the limiting factor for a memcpy, so I would think even a straight-forward translation of the "trivial" implementation wouldn't be that far off from an "optimal" one. I guess memory prefetching would make a difference, but would minimizing the number of loads/stores (or unrolling the loop) really matter that much?

[1]: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy....


> on x86, probably "rep movsb" or whatever

Only on recent x86, and with a long list of caveats. Look up discussion about erms online.

> I wonder though: it seems to me that memory bandwidth should far and away be the limiting factor for a memcpy, so I would think even a straight-forward translation of the "trivial" implementation wouldn't be that far off from an "optimal" one. I guess memory prefetching would make a difference, but would minimizing the number of loads/stores (or unrolling the loop) really matter that much?

Memory bandwidth is often the limiting factor, but not always. But your simple byte-by-byte loop is not going to get anywhere near saturating that; you'll need to unroll and use vector instructions, which might dispatch slower but copy several orders of magnitude more data.


If you think that a specific routine or algorithm is memory bound, you should always do a quick benchmark to check this assumption.

In practice everything is memory bound because of course the CPU is faster than memory, but you'd be surprised by how difficult it can be to reach the full CPU capacity.

"Memory bound" or "Network bound" are way too frequently used as poor excuses by lazy coders.


> on x86, probably "rep movsb" or whatever)

Sadly I don't have a link, but as far as I remember rep movsb was always hilariously slow. So memcpy implementations tried to optimize copies using half a page of vector instructions with size and alignment tests, which of course killed the CPUs instruction cache.


Yes, a compiler would at least add a combo of rep movsq + rep movsd + rep movsw to the mix before finishing the final remainder with rep movsb. Vector instructions might help tremendously too.


Always hilariously slow? That must have been before Ivy Bridge.


Hilariously slow before Intel engineers decided to optimize the shit out of the construction. :)


From my experience, good prefetching and pre-alignment changes a lot of things/

Compiler optimized memcpy are good for small copies that will be inlined, but copying big chunks is an other story and I've seen non-marginal differences depending on implementation.

The most difficult problem is that each implementation is usually tuned for a specific CPU and might be sub-optimal with a different brand or revision...


This is brilliant and really interesting/neat, thanks for posting.


memset is something JEDEC SDRAM standard should of implemented on a hardware level back in 1993. Why even bother writing to ram byte by byte when we could of had dedicated command to fill up to whole row (8-16kbit per chip, 8-32KB per DIMM) at a time with _single command_. Safe zero fill memory allocation would be free and standard.

For background: https://faculty-web.msoe.edu/johnsontimoj/EE4980/files4980/m... Since 1993 ram chips have integrated state machines receiving and interpreting higher level commands. They also have wide sense amplifier banks being loaded/stored all at once.


Modern microcontrollers can have DMA units that you can program to, among other things, do a memset or even a memcpy when the memory bus happens to be idle, and they’ll interrupt you when they’re done. The design point is different (a microcontroller application can be limited by processor cycles but rarely by memory bus bandwidth), but I still wonder why PCs don’t have anything like that.


Programming Microcontrollers was such an interesting and different experience, designing code to be asynchronous in regards to memory operations was a whole 'nother level of arranging code.

Likewise for doing copies from external RAM to internal SRAM, it was slow enough compared to the 1 cycle latency accessing SRAM, and CPU cycles were precious enough, that code copying lots of memory from external memory was designed to stop execution and let other code run and resume once the copy was finished.

We were able to get some serious speed out of the 96mhz CPU because we optimized everything around our memory bus.


On a related note, Windows has a dedicated kernel thread solely for zeroing out freed memory, so, a new page allocation won't worry about zeroing the memory itself.


Just implement a driver for your memory controller and update all software to use syscall into kernel (about 10k total instructions per syscall), which will perform memset or memcpy, then measure performance improvement and tell it to us.


Fwiw, and OT, but “could’ve” == “could have” and “should’ve” == “should have.” In no scenario would it be “could of” or “should of.”


This reminds me of an interesting problem. In oral speech, I frequently say "wouldn't've" and "couldn't've", but in text form both look completely asinine and aren't generally even recognized by spellcheckers.


Language is pretty fun. My favored multi-contraction is y'all'r'nt, similar in flavor of feeling natural/fun to say and use, but looking really ridiculous written out (not even sure I've even done it right...)

I feel like there's also something in this topic that relates to things like "going to" getting reduced to "gonna".


What's the use of filling your ram with zeros when the data needs to be on L1, L2 or L3? Unless you are memsetting hundreds of MBs of memory, memset/memcpy in practice need to be handled by the cpu or something very close to it.

Zen has CLZERO which can clear a cacheline in one go, but not sure how good it is.


This would be a CPU command that works with the RAM controller rather than something you control yourself (kernels to my knowledge don’t talk directly to the controller beyond maybe some basic power management, if that).

There is a definite need to do hundreds of MB - the Linux kernel has a background thread that does nothing but zero out pages. What do you think happens to the GBs of RAM freed by closing Chrome? Once it’s made available in one spot, no reason others could use it (eg a hardened malloc implementation, etc).


If you find yourself doing this a lot, there's write combining memory to coalesce writes to be more friendly to the RAM controller.

Additionally, CLZERO ends up doing very similar work since the resulting cache flush os seen by the RAM controller as a block write.


Interesting that you mention linux, because Linus has very very strong opinions about this :)


What is his opinion about this?


He strongly believes that something like rep stos/rep mov is the right interface for memset/memcpy and off-core accelerators (like DMA) are misguided.


His reasoning or rant for this?


I'm not sure about Linus's objections, but I've found that DMA accelerators for time sharing systems with general workloads haven't reaped benefits, as the overhead of multiplexing and synchronizing with them kills most of their benefits. At that point it's easier to blit memory yourself.


What are they?


Such ram capability would result in implementing hybrid compressed caches. Why waste whole cache line for storing zeroes when you can have dedicated compressed representation.

On a similar note part of ATI 2000 https://en.wikipedia.org/wiki/HyperZ was fast Z clear, today a norm on every GPU.


PPC has an instruction to 'load' a line ignoring its previous contents (just set up the cache state). useful in any case when you know you're going to overwrite the whole thing.


I used dcbz extensively back on the Wii.


>"Unless you are memsetting hundreds of MBs of memory"

Not hundreds but in one of my apps I do have 10th MB of continuous cache that has to be zeroed before use / reuse.


I wonder if someone was zeroing enough memory, where the memory is a private anonymous mapping, they might use madvise() with MADV_DONTNEED, which in linux will effectively zero the pages.

It returns the memory to the OS, and will pagefault on later accesses remapping them as zero-filled. It works in pages. Sizes smaller than a page result in whatever else is in the same page getting nuked.

If you don't immediately reuse the whole cache, it might spread out the zeroing/remapping over time, rather than in a single large go. Imagine some testing would be in order to see if a syscall + mapping changes ( require reloading with TLB for the process ? ) would be smaller than a straight run of writing zeros at some point.

IIRC, the zeroing is not something you can expect from non-linux madvise implementations.


Interesting but you must also take care of the CPU caches..


*should have


In all fairness it needs to be said that the libc's implementation has to consider portability to more "exotic" architectures. For example, not every CPU allows to make unaligned 32-bit or 64-bit writes, or it takes a huge penalty for such writes.


What do I mean by "not every CPU allows to make unaligned 32-bit or 64-bit writes"? Let's test the code (as of commit eac67b6) on a Raspberry Pi 400:

  pi@rasppi400:~/memset_benchmark $ uname -a
  Linux rasppi400 5.10.63-v8+ #1459 SMP PREEMPT Wed Oct 6 16:42:49 BST 2021 aarch64 GNU/Linux
  
  pi@rasppi400:~/memset_benchmark $ ./bench_memset
  size, alignment, offset, libc, local
  0, 16, 0, 1237452, 834116, 1.483549,
  1, 16, 0, 1612697, 945325, 1.705971,
  2, 16, 0, 1779538, 945320, 1.882472,
  3, 16, 0, 1557081, 945324, 1.647140,
  4, 16, 0, 1779527, 889736, 2.000062,
  5, 16, 0, 1557103, 1000940, 1.555641,
  6, 16, 0, 1779551, 1000944, 1.777873,
  7, 16, 0, 1557111, 1000945, 1.555641,
  8, 16, 0, 1334654, 889723, 1.500078,
  Bus error
  
  pi@rasppi400:~/memset_benchmark $ gdb ./bench_memset
  [...]
  (gdb) run
  Starting program: /home/pi/memset_benchmark/bench_memset
  size, alignment, offset, libc, local
  0, 16, 0, 1557105, 722928, 2.153887,
  1, 16, 0, 1557103, 889797, 1.749953,
  2, 16, 0, 1557107, 889849, 1.749855,
  3, 16, 0, 1557108, 889759, 1.750033,
  4, 16, 0, 1557117, 889789, 1.749985,
  5, 16, 0, 1557110, 889745, 1.750063,
  6, 16, 0, 1557116, 889754, 1.750052,
  7, 16, 0, 1557110, 889758, 1.750038,
  8, 16, 0, 1557109, 889803, 1.749948,
  
  Program received signal SIGBUS, Bus error.
  
  small_memset (n=<optimized out>, c=<optimized out>, s=0x29690)
      at /home/pi/memset_benchmark/src/lib.c:33
  33          *((uint64_t *)last) = val8;


Can't look right now, but you might not have benchmarked against libc, but against an optimized version included in Raspbian (https://github.com/simonjhall/copies-and-fills). I'm not sure if that's still active in the latest Raspberry Pi OS releases.


Is that 32 bit ARM code on 64 bit kernel? I thought ARM (since v6) allows unaligned access, although it might have to be emulated through the kernel which is going to be super-slow.

On SPARC you have no choice, align or die!


Yes it is 32 bit code on a 64 bit kernel. I didn't debug what the instruction is that ultimately causes the bus error.

  pi@rasppi400:~/memset_benchmark $ uname -a
  Linux rasppi400 5.10.63-v8+ #1459 SMP PREEMPT Wed Oct 6 16:42:49 BST 2021 aarch64 GNU/Linux
  pi@rasppi400:~/memset_benchmark $
  pi@rasppi400:~/memset_benchmark $ file ./bench_memset
  ./bench_memset: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=ebeb69b6cb9664d78c1256a2c862f3d28f11e15e, with debug_info, not stripped


PS: It's STRD, which as far as I understand the Arm Architecture Reference Manual always requires word alignment.

  Program received signal SIGBUS, Bus error.
  small_memset (n=<optimized out>, c=<optimized out>, s=0x29690)
      at /home/pi/memset_benchmark/src/lib.c:33
  33          *((uint64_t *)last) = val8;
  1: x/i $pc
  => 0x11c8c <local_memset+1560>: strd    r0, [r7, #-8]
  (gdb) info registers
  r0             0x0                 0
  r1             0x0                 0
  r2             0x0                 0
  r3             0x1475              5237
  r4             0x5f5e100           100000000
  r5             0x11674             71284
  r6             0x29690             169616
  r7             0x29699             169625


In all fairness we need the fastest memset on every architecture. Whatever the cost of maintenance.


glibc has hand crafted assembler implementations of memcpy (often specialized for specific size ranges) for many architectures.


Does glibc not have feature detection and conditional compilation for cases like this? That is surprising to me.


It does. Each subdirectory of sysdeps/ can contain specific implementations per platform, arch, etc. eg: the aarch64 assembler memset is:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar...

There's also the "ifunc" mechanism which can be used to make the choice at runtime, eg:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar...


Calls to memset outside of the benchmark may be of heterogenous sizes, which may heavily affect branch prediction since every branch relates to size.

I'm not saying it would go either way, just a big flaw to consider with the benchmarking method where it is doing only repeated calls of the same size only.

It is suprising the GCC version does an integer multiply, if I am reading right (several cycles, unless it is cheaper for uint32 * char).


Yes, it's hard to take the benchmark results seriously in light of the failure to use anything other than a single size at a time.


There's a bunch of things that make benchmarking memset and similar functions really hard:

Measuring the time _repeated_ small calls to memset usually doesn't make any sense, even when the lengths are heterogeneous; this results in an instruction stream that's almost all memset, but for small memsets you almost always have lots of "other stuff" mixed in in real use. This can lead you to a suboptimal implementation.

You have to factor in what distribution of sizes actually occurs in a live system. I haven't looked at this for a decade or so, but the last time I checked the majority of system-wide time in memset (on macOS running diverse applications) was spent in length-4096 calls, and the next highest spike was (perversely) length-zero. A system implementation has to balance the whole system's needs; a memset for just your program can certainly do better. Dtrace or similar tooling is invaluable to gather this information.

As with any benchmarking, the only way to actually know is to swap out the implementation and measure real app / system performance. All that said, Nadav's implementation looks pretty plausible. It's branchier than I would like, and doesn't take advantage of specialized instructions for large buffers, but for some input distributions that's a very reasonable tradeoff, and I don't doubt that it's competitive with system memsets.


As far as real-world performance goes, this paper claims (and shows) that code size is the relevant aspect of mem* functions, and concludes that `rep stosb` is optimal in practice, even though it obviously loses to exotic hand-rolled memset and memcmp in microbenchmarks.

https://storage.googleapis.com/pub-tools-public-publication-...


https://research.google/pubs/pub50338.pdf goes into more depth on the mem* libc functions and principles for the implementations in llvm libc.


rep stos _is_ worth using, but that paper makes no mention of it (it does show that in their use rep cmps beat a more complicated memcmp implementation).


Also repeatedly zeroing the same memory can have different performance characteristics than zeroing different memory each time. Haven't checked what the benchmark does though.


This is why you should benchmark like you test.

Spot regressions early (locally) but make the decisions based on the big the picture.


I'm curious why the time is so much worse for sizes just slightly larger than 400, but then better again for sizes larger than this?


Could be hitting a critical cache stride. See also https://stackoverflow.com/questions/11413855/why-is-transpos...


Great! Now benchmark it with all compiler X architectures supported by glibc.


The claim that the new implementation is faster fails to do the sort of benchmarks that systems developers must look into to justify this kind of change. A benchmark that runs the new memset implementation repeatedly in a loop ends up priming the branch predictor, trace cache and all sorts of things, often making it look better in testing that it does in the system as a whole. This kind of microbenchmark is semi-useful uring development, but is actually insufficient to justify the changes being adopted by a libc or kernel project. Cold caches and branch mispredicts are a major issue for memset/memcpy in a real world system, and other benchmarks need to be run - everything from SPEC to TPCC. I know as I have seen it with my own eyes. Using SSE memset looked promising in microbenchmarks but ended up having problems in a number of real world workloads due to the expensive floating point register save / restores in the kernel outweighing the benefits.

On x86 the situation is in some ways worse. Quite a few x86 CPUs have had atrociously bad implementation of the string instructions. As a result, some high performance systems rolled their own memset/memcpy implementations. That results in feedback to the CPU designers failing to prioritize further optimize those string instructions. Thankfully, string instructions have kept getting better, so the general recommendation today is to just use the string instructions.


I think that Duff's device could be used on lines 65-86 ( https://github.com/nadavrot/memset_benchmark/blob/eac67b6205... )


Is Duff’s device still relevant? Compilers can typically do such unroll logic for you these days.


I once maintained a project that had a custom memcpy implementation because they didn't want to link to libc.

They assumed aligned writes were safe and the compiler optimized it incorrectly, resulting in memory corruption.


But is it still faster if used in real world instead of benachmars?

Given it's comparatively huge impl. it probably massively messes with the instruction cache of the rest of your program or am I overlooking something?


Is this one of those times where you ignore 2% of the edge cases or legacy compatibilities and get a bunch of extra performance?


I have a memset that's not only 10x faster than glibc, but also secure. The trick is to bypass the generic asm, and let the compiler optimize it, esp. with constant sizes and known alignments. Esp. with clang.


Don't do this. Security. explicit_bzero() won't be optimized away.


I do it because nobody else implemented a secure memset. What they call secure is just avoiding that the compiler ignores it. A secure memset also cleans the caches with a memory barrier, so that meltdown cannot read it.

explicit_bzero and it's numerous variants are not only insecure, but also slow. (byte wise!)

Only safelibc has a secure memset_s. https://github.com/rurban/safeclib/blob/master/tests/perf_me...


Also in assembly, up to 370% the speed of glibc - https://github.com/moon-chilled/fancy-memset


(I have not yet publicized the implementation because I still need to check its performance in real-world applications. But quipping '370% in microbenchmarks with ideal branch-prediction and caching' seems appropriate considering that's no less than what the linked post does.

That being said, preliminary instrumentation indicates the tradeoffs made were correct.

My main point, however, is that for such low-level, essential subroutines, assembly remains the correct implementation language; c is still inadequate.)


The author, Nadav, has a lot of performance gems out there. Another good one explaining how to implement a fast matrix multiply: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...


> The implementation of memset in this repository is around 25% faster than the glibc implementation. This is the geometric mean for sizes 0..512.


> if (n == 0) > return s;

That branch is not needed because memset() is UD if length is 0, but it's nice that it's safer.


You wouldn't rather have an arguably very slighter faster memset, with the caveat that it might explode in your face?


I’m wonder how well this handles unaligned memory?

That used to be stable stakes for this kind of thing, but maybe it doesn’t matter much anymore?


Probably poorly. It is a violation to cast an unaligned pointer to an aligned type that the base pointer is not aligned for. And the code looks like it does just that right here: https://github.com/nadavrot/memset_benchmark/blob/main/src/l...

This is undefined behavior under C99 §6.3.2.3 Paragraph 7. "If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined."

The musl code referenced has handling for this.


On the x86, in the P4 times the best performing bulk operations essentially required using SIMD, and that SIMD hated you unless you aligned your memory accesses. The result was horrible bloated code to handle leading and trailing data and thus also a need to split off the implementations for small sizes. The unaligned access penalty is much lower now, and REP-prefixed operations have microcoded implementations that use the maximum memory access width (which you can’t do otherwise without SIMD instructions).

I’m curious about what the referenced code compiles down to, actually, because not only could GCC be auto-vectorizing it, it could be replacing it with a REP STOSQ or indeed a call to memset.


Here's the code in godbolt:

gcc: https://godbolt.org/z/6xG5dKjj9

clang: https://godbolt.org/z/Mh9zozjvK

I'm no asm expert, but it doesn't look like a lot of vector instructions in the gcc compilation of this, while the clang compilation seems to have more calls with the 128-bit xmm registers (at least on x86_64.) You can also just see visibly how many more instructions the gcc version outputs.


Thank you! Indeed GCC does not use SIMD here unless you set -O3 (... I seem to remember this enables some vectorization?) or allow it to use AVX with -mavx or -march=x86-64-v3. For some reason I’m unable to get it to use plain SSE (always available on x86-64) with any -mtune setting or even with -march=x86-64-v2.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: