From my experience, good prefetching and pre-alignment changes a lot of things/
Compiler optimized memcpy are good for small copies that will be inlined, but copying big chunks is an other story and I've seen non-marginal differences depending on implementation.
The most difficult problem is that each implementation is usually tuned for a specific CPU and might be sub-optimal with a different brand or revision...
Compiler optimized memcpy are good for small copies that will be inlined, but copying big chunks is an other story and I've seen non-marginal differences depending on implementation.
The most difficult problem is that each implementation is usually tuned for a specific CPU and might be sub-optimal with a different brand or revision...