Fond memories of Windows XP x64 on AMD Athlon 64. To think that I actually ran into the problem of no 16-bit executable support, lol. I reverse-engineered my first 64 bit executable with this processor: Photoshop.
Throwing ALUs and load/store units at the problem, specifically, seems like a great and obvious idea in retrospect. It seems to have become conventional wisdom. They don't take much area compared to all the scheduling and bookkeeping hardware that tries to keep them busy.
I was thinking specifically of Apple's M1 and recent upgrades from AMD and Intel, which all "go wider" (more execution units) and it works well for them. In some cases, these units can't even possibly be fed in steady state, only e.g. while catching up from some misprediction or cache miss.
I noticed that K8's the 3-wide pipeline has 3 AGUs alongside the ALUs, which led me to wonder how modern RISC architectures handle address calculations. Do they still have separate AGU circuitry?
This was the first processor family I used for my own builds when I got into the hobby in 2004. I came of age when AMD was on top. I've only had two Intel processors in my main machines over the past 18 years - a Pentium D 805 (cheap dual core vs. AMD's offerings) and a Haswell in 2014 (K10 was a disaster).
Fond memories of Windows XP x64 on AMD Athlon 64. To think that I actually ran into the problem of no 16-bit executable support, lol. I reverse-engineered my first 64 bit executable with this processor: Photoshop.