Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Converting integers to decimal strings faster with AVX-512 (lemire.me)
112 points by ibobev on March 29, 2022 | hide | past | favorite | 64 comments


And meanwhile after 9 years of existing intel is still not reliably adding avx 512 to flagship desktop/gaming cpu's, first because of 5+ years of using the same old architecture and process, now because I don't know why they disabled it when it was there. Why was it not a problem for Intel to add mmx, several sse iterations, avx, avx 2, but now 9 years of nothing, even though parallel computing in annoying to program for and depend on gpu's is only gaining in importance?


> Why was it not a problem for Intel to add mmx, several sse iterations, avx, avx 2, but now 9 years of nothing, even though parallel computing in annoying to program for and depend on gpu's is only gaining in importance?

AVX512 is significantly more intensive than previous generation SIMD instruction sets.

It takes up a lot of die space and it uses a massive amount of power.

It can use so much power and is so complex that the clock speed of the CPU is reduced slightly when AVX512 instructions are running. This led to an exaggerated outrage from enthusiast/gaming users who didn't want their clock speed reduced temporarily just because some program somewhere on their machine executed some AVX512 instructions. To this day you can still find articles about how to disable AVX512 for this reason.

AVX512 is also incompatible with having a heterogeneous architecture with smaller efficiency cores. The AVX512 part of the die is massive and power-hungry, so it's not really an option for making efficiency cores.

I think Intel is making the right choice to move away from AVX512 for consumer use cases. Very few, if any, consumer-target applications would even benefit from AVX512 optimizations. Very few companies would actually want to do it, given that the optimizations wouldn't run on AMD CPUs or many Intel CPUs.

It's best left as a specific optimization technique for very specific use cases where you know you're controlling the server hardware.


This part about "a lot of die space" is largely false. The die shots show that AVX-512 adds only a modest amount of area, mostly for the wider register file and 64-byte path to memory.

The "huge power hungry" execution units are in fact largely the same execution units used for AVX and AVX2 - the CPU simply uses 2 adjacent 256-bit AVX/2 units as a single AVX-512 unit: this is the so-called port 0+1 fusion. Only the larger shuffles and the (optional) extra FMA unit on port 5 take significant addition space.

In fact, the shared core die area for SKL (no AVX-512) and SKX (has AVX-512) is the same! There is some dead space on the SKL die due to lack of SKL support but it is small: you can bet Intel didn't ship 100s of millions of SKL and SKL-derivative chips with a big space-wasting AVX-512 unit which couldn't be used on it: they could have made a new, more targeted layout if that was the case.


MMX was 64-bit wide, SSE 128-bit wide and AVX 256-bit wide, and they succeeded each other rapidly (when compared to the situation now). AVX512 is another doubling, so what's the hold up with this one compared to the previous two doublings?

Maybe it's time for AMD to do something about this, like when they took the initiative to create x86-64


The right thing to do IMO would be to have vector instructions with a variable number of operands. Quit adding wider instructions, just tell the instruction how many operands you want to add/multiply and let the CPU take care of it. This could also allow the CPU to decide what's the most efficient way to prefetch, etc., or to run these operations asynchronously.


That's exactly what ARM did with SVE, so it's easy to discuss the pros and cons.

The biggest problem, to my understanding, is that SIMD extensions are mostly targeted with optimized hand-written assembly code. It's fine to say that the SIMD is variable-width, but you still have to run this on a processor where the hardware is not variable-width. Not just the SIMD unit, but also the feedpaths/etc. And some things a "generic" framework might let you do, might have catastrophic performance consequences.

In practice a lot of this will be avoided by just knowing the hardware you're going to target and not doing the things that hurt performance, but that also completely misses the point of variable-width, you might as well just target the hardware directly at that point.

And actually your data may not be "variable-width" either.


It seems to me it's extremely common to work with vectors that are much larger than the host CPU's SIMD width. In that case you could get much better performance letting the CPU schedule a large batch of work rather than trying to micro-optimize for multiple possible targets.

Software is typically compiled for x86-64, but not necessarily for each specific x86-64 processor available. You often just get a generic precompiled binary. That's not really optimized for the machine you're running the software on.

Maybe you want to have some restrictions on the size of the vectors you're working with to make it more convenient for the hardware, like "should be a multiple of 4x64 bits", but I still think there could be big gains with variable width in most cases.


I thought the same but that’s not how that works. It’s variable width at the assembly level. You give it entire buffers with the total length and it internally uses the appropriate width (or tells you the width it used during the loop? I forget that detail). Anyway, it’s not a SW framework but a core part of the CPU implemented in Hw.


That's not how SVE works. Any given SVE CPU still has a fixed vector width at runtime, but the CPU exposes instructions which tell you the vector width and other convenience instructions for performing loop counter math and handling the "final" iteration which may not use a full vector.

Then you need to write your loops in such a way that they work with a variable width: i.e., doing more iterations if the vector width is smaller.


Zen 4 is reported to have AVX-512 support. I'm not sure if that'll be included on Ryzen or not, though; only Epyc is "confirmed" as far as I know.


It would be strange if AMD didn't offer this on Ryzens. First, they don't have a history of locking people out of instruction sets. Second, they need it for being competitive with Intel anyway. Third, they're using the same die for multiple purposes, not like Intel with separate server dies for Xeons.


As the 4th doubling, that's a 16X increase over the base.

512-bit registers, ALUs, data paths, etc, are all really really physically big.


My sense is that it's because it happened about when Intel's litho progress began stalling out. It was designed for a world the uarch folks were expecting with smaller transistors than they ended up getting.


The power/clockspeed hit for AVX512 is less on the most recent Intel CPUs, per previous discussion: https://news.ycombinator.com/item?id=24215022


The size of the die space and the amount of the power have very little to do with AVX-512.

AVX-512 is a better instruction set and many tasks can be done with fewer instructions than when using AVX or SSE, resulting in a lower energy consumption, even at the same data width.

The increase in size and power consumption is due almost entirely to the fact that AVX-512 has both twice the number of registers and double-width registers in comparison with AVX. Moreover the current implementations have a corresponding widening of the execution units and datapaths.

If SSE or AVX would have been widened for higher performance, they would have had the same increases in size and power, but they would have remained less efficient instruction sets.

Even in the worst AVX-512 implementation, in Skylake Server, doing any computation in AVX-512 mode reduces a lot the energy consumption.

The problem with AVX-512 in Skylake Server and derived CPUs, e.g. Cascade Lake, is that those Intel CPUs have worse methods of limiting the power consumption than the contemporaneous AMD Zen. Whatever method was used by Intel, it reacted too slow during consumption peaks. Because of that, the Intel CPUs had to reduce the clock frequency in advance whenever they feared that a too large power consumption could happen in the future, e.g. when they see a sequence of AVX-512 instructions and they fear that more will follow.

While this does not matter for programs that do long computations with AVX-512, when the clock frequency really needs to go down, it handicaps the programs that execute only a few AVX-512 instructions, but enough to trigger the decrease in clock frequency, which slows down the non-AVX-512 instructions that follow.

This was a serious problem for all Intel CPUs derived from Skylake, where you must take care to not use AVX-512 instructions unless you intend to use many of them.

However it was not really a problem of AVX-512 but of Intel's methods for power and die temperature control. Those can be improved and Intel did improve them in later CPUs.

AVX-512 is not the only one that caused such undesirable behaviors. Even in much older Intel CPUs, the same kind of problems appear when you are interested to have maximum single-thread performance, but some random background process starts on another previously idle core. Even if that background process consumes a negligible power, the CPU is afraid that it might start to consume a lot and it reduces drastically the maximum turbo frequency compared with the case when a single core was active, causing the program that interests you to slow down.

This is exactly the same kind of problem, and it is visible especially on Windows, which has a huge quantity of enabled system services that may start to execute unexpectedly, even when you believe that the computer should be idle. Nevertheless, the people got used to this behavior, especially because it was little that they could do about it, so it was much less discussed than the AVX-512 slowdown.


When AVX first came out it did make the process of overclock stability tuning more difficult, because the was a desire to test under both AVX and non-AVX pathological load conditions. Sometimes the change in voltage drop over the die, along with the frequency change, would cause instability in one case but not the other.

Having gone through the process, I could understand how this would be an annoying feature that violated previous assumptions about how Intel CPUs work.


AVX-512 still isn't one instruction set either. It is several that are overlapping, with different feature sets.

I have seen several accounts of them being inefficient (drawing much power, large chip area, dragging down the clock of the entire chip). They have also been criticised for making context switches longer.

There are some good ideas in there though. What I think Intel should do is to redesign it as a single modern instruction set for 128 and 256-bit vectors (or for scalable vector lengths such as where ARM and RISC-V are going). Then express older SSE-AVX2 instructions in the new instructions' microcode, with the goal of replacing SSE-AVX2 in the long term. Otherwise, if Intel doesn't have a instruction set that is modern and feature-complete, I think Intel would be falling behind.


You can already use the AVX-512 instructions with vectors smaller than 512 bits; that, in fact, is one of the extensions to the base microarchitecture, and is present in every desktop/server CPU that has it available (but not the original Xeon Phis, IIRC.) This is called the "Vector Length" extension.

Edit: I noted you said "redesign" which is an important point. I don't disagree in principle, I just point this out to be clear you can use the new ISA with smaller vectors already, to anyone who doesn't know this. No need to throw out baby with bathwater. Maybe their "redesign" could just be ratifying this as a separate ISA feature...

> drawing much power, large chip area, dragging down the clock of the entire chip

The chip area thing has mostly been overblown IMO, the vector unit itself is rather large but would be anyway (you can't get away from this if you want fat vectors), and the register file is shared anyway so it's "only" 2x bigger relative to a 256b one, but still pales in comparison to the L2 or L3. The power considerations are significantly better at least as of Ice Lake/Tiger Lake, having pretty minimal impact on clock speeds/power licensing, but it's worth nothing these have only 1 FMA unit in consumer SKUs and are traditionally low core count designs. Seeing AVX-512 performance/power profiles on Sapphire Rapids or Ice Lake Xeon (dual FMAs, for one) would be more interesting but again it wouldn't match desktop SKUs in any case.

It's also worth noting AVX2 has similar problems e.g. on Haswell machines where it first appeared, and would similarly cause downclocking due to thermal constraints; but at the time the blogosphere wasn't as advanced and efficient at regurgitating wives' tales with zero context or wider understanding, so you don't see it talked about as much.

In any case I suspect Intel is probably thinking about moving in the direction of variable-width vector sizes. But they probably think about a lot of things. It's unclear how this will all pan out, though I do admit AVX-512 overall counts as a bit of a "fumbled bag" for them, so to speak. I just want the instructions! No bigger vectors needed.


> "only" 2x bigger relative to a 256b one

4x bigger; avx512 has twice as many registers.


The number of microarchitectural registers is only loosely related to the number of architectural registers.


Skylake-X has 168 vector registers. Ice and Tiger lake have 224

The connection to the 32 micro architectural registers is quite loose.


Granted. There is still more state to keep track of, though, and I would not be surprised if the physical register file increased in size by more than 2x.


people kinda oversell the complexity of the situation. like ooh big scary venn diagram right?

https://twitter.com/InstLatX64/status/1114141314441011200/ph...

The shorthand is that there's basically "xeon phi" and "everything else". Each of those sets has strictly increasing capabilities with each product generation - Phi Gen2 is a strict superset of Phi Gen1, but Consumer Gen2 doesn't necessarily include all the instructions in Phi Gen1.

There's a few exceptions and nuances (in particular, this is generation, not chronological order), but that's really all you need to know in practice. If you want to refine that a little further, there's "consumer", "server" and "xeon phi" set and each of those is very well-ordered. Basically they just took the consumer line and added some neural network instructions to it for server.

But if you're a consumer, there is only one feature set you need to worry about, and they are strictly ordered within that series. Tiger Lake will be strictly better than Ice Lake, and so on. Well, until Intel just dropped AVX-512 entirely of course, but AMD is adding support in Zen4, so it's not going away or anything.

I get that people enjoy taking the piss, but there were literally 11 different "feature sets" in MMX/SSE that come up in a trivial search, and if you counted every single combination that any chip ever implemented as being a "level" there are most likely 30 or 40 "levels" in MMX/SSE. Some of those even got dropped and abandoned, others became basically mandatory for running modern software. It would be a big scary venn diagram assuming it could even be represented at all.

Today we don't even think about it (except for Phenom II owners, hi guys!), we just found the useful sets of instructions and everything implements them. When was the last time you worried about whether your processor had SSE4 POPCNT? Over time it's a non-issue.

AMD will implement their own subset of features in Zen4 that they think are most useful, maybe add a few of their own, and over time things will settle in exactly the same way that MMX/SSE did.


It costs a lot of area for something that's doesn't move the needle for the majority of desktop applications (browsers, games, office apps). The AVX512 units on Skylake-SP for example are a substantial portion of the area of each core. At some point you have to consider how many cores you could fit in the area used for vector extensions and make a trade-off.


This argument IMO is rather overblown. A quick die shot of Skylake-X indicates to me that even if you completely nerfed every AVX-512 unit appropriately (including the register files) on say, an 8-core or 10-core processor, you're going to get what? 1 or 2 more cores at maximum?[1] People make it sound like 70% of every Skylake-X die just sits there and you could have 50x more cores, when it's not that simple. You can delete tons of features from a modern microprocessor to save space but that isn't always the best way forward.

In any case, I think this whole argument really misses another important thing, which is that the ISA and the vector width are separate here different. If Intel would just give us AVX-512 with 256b or 128b vectors, it would lead to a very big increase in use IMO, without having to have massive register files and vector data paths complicate things. The instruction set improvements are good enough to justify this IMO. (Alder Lake makes it a more complex case since they'd need to do a lot of work to unify Gracemont/Golden Cove before they could even think about that, but I still think it would be great.) They could even just half the width of the vector load/store units relative to the vector size like AMD did with Zen 2 (e.g. 256b vecs are split into 2x128b ops.) I'll still take it

[1] https://twitter.com/GPUsAreMagic/status/1256866465577394181/...


Exactly. Even just one AVX512 ALU instead of the typical two would be an option (which in fact Intel has used on some *lake variants).


> It costs a lot of area for something that's doesn't move the needle for the majority of desktop applications

But that could be a bit of a chicken and egg situation, right?


Absolutely. It's useful in text processing, it's useful in emulation, it's useful for doing bit-math for various encodings and checksumming, etc. The newer revisions don't have any downclocking either.

https://old.reddit.com/r/emulation/comments/lzfpz5/what_are_...

https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

People have made some incredibly broad assertions about the technology as a whole on the basis of a first-gen product that lingered on the market for far too long because of the 10nm woes.

People also hang on Linus's words far, far too much. He's incredibly toxic and willing to opine about stuff he's been disconnected from for 20 years. Lisa Su is putting her money where her mouth is, AMD is making a billion-dollar bet on AVX-512 at a time when Intel has actually abandoned it in the consumer market entirely. Maybe she's doing it just for the server market, but that still means she thinks server-market customers are going to see an actual benefit on normal server tasks, such that it will be worth making every single customer pay more for a larger piece of silicon.

It's not the kind of thing you do "just to win a couple HPC benchmarks". Sorry, I trust the person with the financial stake in this discussion a lot more than I trust Linus's spit-takes.


Those are surprising claims. Browsers and their video/image codecs use SIMD. Other usages in office/productivity apps include Photoshop/Lightroom, video editing, compression, CAD.



They disabled it because it's not compatible with their efficiency-core design for big.LITTLE - right now, you either have to put AVX-512 on both, or neither.

An efficiency core with AVX-512 is basically a Xeon Phi core (and that approach actually probably would be good at the compute-heavy workloads efficiency cores are optimized for), so that's not an absurdity, but they didn't do that. So at that point the answer is "neither", and they turned it off on their big cores. But apparently they left the unit there in the silicon eating up a fairly large amount of space (I think it's around 10-20% of the core).

The real question is why nobody thought of this before... did they have some hardware/software solution to handle the instruction set differences that didn't work out? The fact that it was shoddily disabled before launch, and left in programmers' guides and other documents, speaks that maybe there was something like that and somehow it didn't end up happening.

(You could trap instruction errors and mark the thread as big-core-only, the Linux kernel did that at one point for AVX instructions in general (to save registers/bandwidth/etc they wouldn't save the AVX registers unless a thread actually used AVX). But there's more problems that still have to be solved first - how does software know how many AVX-aware cores are available? How does software know which core their CPUID instruction will be executed on? etc etc.)

If they had a heterogenous-ISA mechanism or something that will be introduced later, why isn't it on the roadmaps for next-gen/etc. And they didn't have those mechanisms, then why is it still on the core on an extremely high-volume consumer architecture?

You could say maybe it's a testbed for Sapphire Rapids, but every company has dedicated test silicon for that, and 10-20% of every single consumer chip sold is a huge waste of money. Sure, they'd have to design a core without it, but Intel absolutely has the money to do that - just like Skylake vs Skylake-X. The unit is designed to be modular so it's not hard, and they'd make it back extremely quickly.

Maybe they just needed some dead silicon to spread out the thermal load a bit? Obviously dead silicon has to be dead, so it doesn't matter if you print AVX-512 or the names of the design team, it's not supposed to be powered up either way.

Anyway that's my take - the "why take it back off" part is straightforward, the answer is big.LITTLE. The real interesting question is "so why is it still on the silicon".


Feels like the answer is Conway’s Law, one way or another.


> now because I don't know why they disabled it when it was there

E-cores (Gracemont) don't support AVX-512, only P-cores (Golden Cove) support AVX-512


Sure, but they deliberately released a BIOS version that prevented people from enabling it when disabling E-cores


To supplement other comments, AVX-512 is an umbrella of a bunch of related stuffs. Some chips has a smaller subset of AVX-512 capabilities and other have more.

Linus has mentioned he want AVX-512 to die a gruesome death. AMD also doesn’t actually support AVX-512 (but when seen run it in 2 cycles of 256-bit wide instruction aka AVX-2.)

The whole AVX-512 situation is just so complicated in hardwares and hence softwares (which subset to enable?) and this is probably the primary reasons for the lack of adoption, even from Intel themselves.

Eg the disabled AVX-512 that you alluded to in recent Intel CPU is because of its hybrid nature and the little CPU aren’t capable of AVX-512 and since they want to same instructions to be able to run on both big and little CPU, the only move is to disable it. Anandtech tested with the little CPU disabled in BIOS, and while Intel said the AVX-512 is fused off, they actually are able to run AVX-512.


Not sure why the AVX-512 situation is complicated. As paulmd commented, in practice all the various avx512* supported by more recent chips (except Phi) are supersets of the previous ones.

github.com/highway compiles the same source code to AVX2 or AVX512, and checks at runtime which to use. I've seen 1.5x speedups from this (even including throttling on 5 year old Skylakes). Why not use what's there?


Caveat 1: My comment was intended to be a supplement to others so it isn't aimed to be self-contained nor exhaustive.

Caveat 2: I am a user of Intel Xeon Phi (KNL specifically) on an HPC system. So my view is biased in this way. It's probably right that the extensions has much less variants from non-Xeon Phi line of products. But, it still carries some weights, because it is a sign that Intel was struggling to strike a balance at least in the beginning. They wasn't sure what to offer, the fact that Xeon Phi line is killed and replaced by GPU is their admission of failure there.

Then vendor other than Intel (AMD) would hesitate to adopt that because even Intel wasn't sure what to offer. And if AMD doesn't pick that up, the usefulness of AVX-512 is in doubt. E.g. in the traditional way of prepping binary releases (many distribution methods still do) is to target the "highest common factor" of supported instructions of reasonably recent CPUs, meaning they would not target AVX-512 at all as the latest CPU from AMD doesn't have it (they later add fake AVX-512 capability.) Now it means that whether AVX-512 is useful to you as an end users then depends on the distributions a lot (and in practice none of the general purpose binaries are targeting this. In the rare case they build using the Intel compilers, it is very easy to ship both, which is not true if using GNU compilers.) And if it ain't going to useful for the majority of users, there's no hurry to upgrade specially for AVX-512.

Then it becomes a disadvantage for Intel to increase the cost of making AVX-512 capable CPUs to compute in the lower ends. Note that there are other comments and articles on this already—AVX-512 is not cheap. It costs significant die area which is cost.

That further complicates the things when Intel is introducing BIG.little design, where in the little CPU, low energy draw and low cost hence smaller die size drives towards having no AVX-512. And since the only way to guarantee any binaries can jump between the BIG and little CPU, they choose to disable AVX-512 on the BIG CPU even if its there. (AnandTech article is much more detailed about this.)

P.S. I'm sure historians would give a better and more balanced accounts on why it fails, but I think it is safe to say the standard is now too fragmented, not necessarily in implementation sans Xeon Phi but in the market, it is probably remains to be a niche.

P.S.2 A fun fact is that in the recent years AMD's CPU are praised by their performance to watt ratio because of the advances in fabrication tech (basically from TSMC's advances). But I did some calculations earlier that if we're only talking about FLOPS, and compares FLOPS per watt with the consideration of AVX-512 units (many Intel CPUs even have 2 AVX-512 units), it is actually comparable if not lower for Intel. (That was an old calculations and with the newest Intel getting on better fab. I'm sure the story has changed already.)

But then no much people if any (I heard none, but I don't read too much except sometimes from AnandTech) has mentioned this and make the comparison "fair" to Intel. Why? Because the end users most of the time aren't going to be benefited by AVX-512 (again because of distributions) so that's often not taken into accounts, sometimes indirectly by performing benchmarks that is not AVX-512 aware.


Oh, you are using Phi :) My viewpoint is likewise biased towards desktop and server systems.

Fair point about heterogeneous cores. I suspect those aren't going away. Perhaps the industry should move to a model where threads are temporarily pinned to their core while running SIMD kernels, and re-checking the current CPU capabilities rather than storing the result. That would allow using both AVX2 and AVX-512 depending on which core is active, and doesn't require OS changes (except to OSX which AFAIK still cannot reliably pin threads), right?

I totally agree that AVX-512 is helpful for power efficiency. Without it, I struggle to see how Intel/AMD are going to compete with current/future ARM and RISC-V.


It looks like they're shifting to efficiency as a target. In their latest 12th gen desktop CPUs, you could re-enable avx512 if you disabled the "efficiency" cores. Then they released a BIOS update with that feature removed. Has there been an instruction set that failed? Itanium maybe?


I don't know where you came up with 9 years. The very first CPU that had these features came out in May 2018. "Tiger Lake" has these features and it is/was Intel's mainstream laptop CPU for the past year or so. Adler Lake, their current generation, lacks these features but I think it's understandable because they had to add AVX, AVX2, BMI, BMI2, FMA, VAES, and a bunch of other junk to the Tremont design in order to make a big/little CPU that works seamlessly. Whether you think they should instead have made a heterogeneous design that is harder to use is another question.


Intel proposed AVX512 in 2013[0], with first appearance on Xeon Phi x200 announced in 2013, launched in 2016[1], and then on Skylake-X released in 2017[2]

[0]: https://en.wikipedia.org/wiki/AVX-512 [1]: https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing [2]: https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Hi...


Programmers aren't going to write codepaths for hardware that doesn't exist, so citing the draft proposal as being when coders would have started writing code is completely off, that's just not how code gets written at all.

Xeon Phi was always a very targeted and limited product for HPC. Nobody was doing tons of performance-sensitive JSON parsing or emulating a PS3 or ARM on a Xeon Phi. Emulator people aren't going to target hardware that doesn't exist in their target market.

In practice the first time AVX-512 was accessible to the general public was Skylake-X, so 2017. Five years ago. And it was a server/HEDT product. And it had some weird performance regressions (like the downclocking or pausing when you use too many heavy instructions) that bumped it back out of a lot of codebases it might have hypothetically been useful.

The first consumer architecture that would have implemented it was massively late due to 10nm delays. So was the 10nm successor to Skylake-X that fixes the downclocking and other shortcomings.

Basically the answer here is "10nm". AVX-512 is a casualty of the eternal 10nm delays. Intel got stuck on Skylake forever, and couldn't push any of their other developments forward. They've had fixes for a while for a ton of the stuff people complain about, they just couldn't manufacture them at scale.

They are only just launching their first post-14nm server platform literally this year, and it's not even Intel 7/10nm ESF, it's Ice Lake based lol.


Actually AVX-512 is considerably older than AVX.

Initially AVX-512 was known as "Larrabee New Instructions".

This instruction set, which included essential features, which have been missing in both earlier and later Intel ISAs, e.g. mask registers and scatter-gather instructions, was developed a few years before 2009, by a team in which many people had been brought from outside Intel.

The "Larrabee New Instructions" have been disclosed publicly in 2009, then the first hardware implementation available outside Intel, "Knights Ferry" was released in May 2010. Due to poor performance against GPUs, it was available only in development systems.

A year later, in 2011, Sandy Bridge was launched, the first Intel product with AVX. Even if AVX had significant improvements over SSE, it was seriously crippled in comparison with the older AVX-512 a.k.a. "Larrabee New Instructions".

It would have been much better for the Intel customers if Sandy Bridge would have implemented a 256-bit version of AVX-512 instead of implementing AVX. However Intel has always attempted to implement as few improvements as possible in each CPU generation, in order to minimize their production costs and maximize their profits. This worked very well for them as long as they did not have serious competition.

The next implementation of AVX-512 (using the name "Intel Many Integrated Cores Instructions"), was in Knights Corner, the first Xeon Phi, launched in Q4 2012. This version made some changes in the encoding of the instructions and it also removed some instructions intended for GPU applications.

The next implementation of AVX-512, which changed again the encoding of the instructions to the one used today, and which changed its name to AVX-512, was in Knights Landing, which was launched in Q2 2016.

With the launch of Skylake Server, in Q3 2017, AVX-512 appeared for the first time in mainstream Intel CPUs, but after removing some sets of instructions previously available on Xeon Phi.

AVX-512 is a much more pleasant ISA than AVX, e.g. by using the mask registers it is much easier to program loops when the length and alignment of data is arbitrary. Unfortunately the support for it is unpredictable, so it is usually not worthwhile to optimize for it.

Hopefully the rumors that Zen 4 supports AVX-512 are true, so its launch might be the real start of widespread use of AVX-512.


:) I'm curious what makes it not worthwhile?

From my point of view it's just an additional compilation pass of your SIMD kernels (written using github.com/highway; disclosure: I am the main author), plus shipping a few tens/hundreds KB extra code, plus ensuring your tests exercise that platform as well.


Also consumer Skylake started shipping in 2015 with some non-functional silicon area reserved for AVX-512 register files.


Skylake did not have IFMA or VBMI. The first microarchitecture with both of those was Cannon Lake, Q2 2018, which practically did not exist in the market, and the first mainstream CPU with both of these was Ice Lake, Q3 2019.


If you're curious what gets generated, Godbolt maps the C code to ASM, and provides explanations of the instructions on hover. Link below processes the source using CLANG, providing a cleaner result than GCC.

[0] https://godbolt.org/z/78KodqxaP


> The code is a bit technical, but remarkably, it does not require a table.

Those constants being used look a lot like a table.


I assume that he referred to the fact that the code does not need an array in memory, with constant values that must be accessed with indexed loads. Using such a table in memory can be relatively slow, as proven by the benchmarks for this method.

There are constants, but they are used directly as arguments for the AVX-512 intrinsics. They must still be loaded into the AVX-512 registers from memory, but they are loaded from locations already known at compile-time and the loads can be scheduled optimally by the compiler, because they are not dependent on any other instructions.

For a table stored in an array in memory, the loads can be done only after the indices are computed and the loads may need to be done in a random order, not in the sequential order of the memory locations. When the latencies of the loads cannot be hidden, they can slow a lot any algorithm.


If your program is sufficiently large and diverse the constants embedded in the instructions are no better than a table load/lookup. The jump to the conversion routine is unpredictable, and the routine will not be in the instruction cache, causing the CPU to stall while waiting for the instructions to load from RAM.

This will never show up in a microbenchmark where the function's instructions are always hot. In fact, a lot of microbenchmarking software "warms up" the code by calling it a bunch of times before starting to time it, to maximize the chances of them being able to ignore this reality.


The AVX-512 constants aren't embedded in the instructions, but the address to load being static means the CPU can start to cache them when the decoder gets to them, possibly even before it knows that the input to the function is.

In contrast, for the scalar code, the CPU must complete the divisions (which'll become multiplications, but even those still have relatively high latency) before it can even begin to look for what to cache.


The code itself sits in memory, which means the "table" is still in memory.


The list of arguments to _mm512_setr_epi64 are just reciprocals of powers of 10, multiplied by 2^52. The scalar code uses division by powers of 10, which'd compile down to similar multiplication constants; you just don't see them because the compiler does that for the scalar code, but you have to do so manually for AVX-512.

And permb_const is a list of indices for the result characters within the vector register - the algorithm works on 64-bit integers, but the result must be a list of bytes, so it picks out every 8th one.


Especially given the fact that the table in the baseline code is only 200 bytes. That's less than four AVX-512 registers!


I'd be curious about use cases where this does or does not make sense. On the one hand, you saved a few nanoseconds on the integer-to-string encoder. But on the other hand you're committed to storing and transferring up to 15 leading zeros, and there must be some cost on the decoder to consuming the leading zeros. So this clearly makes sense on a write-once-read-never application but there must also be a point at which the full lifecycle cost crosses over and this approach is worse.


The article compares padded length 15 output for both cases; removing leading zeroes would have cost for both AVX-512 and regular code.


Yeah, that's the "this approach" to which I refer though. This is a micro-optimization of an approach that I'm not sure has many beneficial applications.


Ah. The article, as I see it, is primarily about AVX-512 vs scalar code, not the 15 digits though. The fixed-length restriction is purely to simplify the challenge.

To remove leading zeroes, you'd need to use one of bsf/lzcnt/pcmpistri, and a masked store, which has some cost, but still stays branchless and will probably easily be compensated by the smaller cache/storage/network usage & shorter decoding.


AVX512 has a compressed store which might be a bit easier than a normal masked store?


compress has less throughput, and probably a decent bit more latency. It would save offsetting the result pointer by the leading zero count, but I don't think that'd be enough to compensate for the slower compress. I don't know where regular masked store is done in the pipeline though, so maybe it has its own latency comparable to that of compress, but I doubt it.


Yeah, 2p5 (two uops on port 5) for a compressed store vs 1p05 (1 uop on either port 0 or port 5) for a masked move. For throughput's sake, shifting the pointer is better.


related to IEEE 754 double-precision floating-point round-trips?


That memcopy version is hilariously bad.

He could at least have made a proper assembly version for a proper comparison.


the memcopies will of course be compiled to unaligned loads & stores. Clang even seems to simplify the multiplications based on the size of the current portion of the argument, so I don't see much that could be improved. Maybe you could squeeze out a dozen or two percent more speed, but not 3.5x faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: