Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hot Chips 34: AMD’s Instinct MI200 Architecture (chipsandcheese.com)
102 points by ingve on Sept 18, 2022 | hide | past | favorite | 30 comments


Very cool to see the rundowns like this. As someone in the ML space, I’m a little concerned that they may be ceding the AI accelerator market to Nvidia. The fact it’s exposed as two separate GPUs, the focus on fp64, etc, all point to them aiming more at traditional HPC workloads than at AI. To be clear, I don’t think that’s a terrible move. HPC is a large market, and these cards look like they’re going to be great there. My concern is that there is no one really trying to challenge the Nvidia A100/H100+Cuda ecosystem. I’d love to see some better competition, both to bring down prices as well as to spur more innovation.

From what I’ve seen of Biren’s new chip it’s nice but won’t be competitive with any of Nvidia’s chips. Maybe in a few generations it’ll be more compelling, and they can build out their software stack. AMD seems to be aiming directly for HPC. Maybe Intel and their OneAPI strategy will get there? I want to root for Intel, but it’s hard to have a ton of confidence after watching their performance through the late 2010s.


I think, quite frankly, AMD ceded the AI accelerator market to Nvidia the day they decided not to offer consistent compute-API support throughout their product range. ROCm support is limited to a very select number of flagship products, and even flagship products from recent generations are unsupported.

The future users of your highest-end HPC accelerators start with affordable hardware they can develop and test against on their desktop. No sane developer jumps headfirst into your most expensive product (with the highest power/cooling infrastructure requirements) just to test compatibility and get familiar with your APIs.

Nvidia understands this, and you are able to run the same basic algorithms across everything from mid-range desktop GPUs to high-end HPC accelerators (performance varies, of course). Intel somewhat understands this, although has had major missteps in this area (e.g. limited AVX-512 support on desktop processors, poor developer story for Xeon Phi).


I wonder how hard it is to just sell a GPU and say it's CUDA compatible. Google built their own toolchain over PTX, AMD could do the same, and have CUDA compatibility if they wanted. I think the difference here just might be that Google's still buying A100s, while AMD wouldn't be.

HIP/ROCm support should absolutely be better supported on all AMD hardware for more adoption, instead it seems to barely register like OpenACC or Vulkan compute. Intel might have better luck with OpenAPI.


For PTX:

For sm_70 onwards (which is the arch that comes with tensor cores) NVIDIA made the task significantly harder.

Those newer architectures use a separate instruction pointer per thread/lane, for notably supporting C++ atomics across threads in the same warp without deadlocks.

This doesn't match the semantics present on AMD GPUs.

For HIP/ROCm:

I think that they need an abstraction layer that can make a single slice of binary code that is usable across multiple gens.

Compounded by the fact that different dies have different binary slices on the AMD side, so that 6800 XT and 6700 XT run different code slices. ROCm only supports Navi21 cards for RDNA2, not the other ones...

For oneAPI, OpenCL SPIR-V fulfills that role.


Reports are that the 6800XT runs ROCm pretty good now. I don't got the hardware, but it seems like it took AMD a few years to get things sorted over to RDNA / RDNA2.


Navi21 (corresponding to the 6800/6800 XT/6900 XT customer cards) is supported, but 6700 XT and below are not.


I've been talking with some of the folks on the ROCm compiler team about this. It seems that each Navi 2x processor was assigned a unique architecture number just in case an incompatibility was discovered. Nobody I talked to knew of any actual incompatibilities, though nobody had done any comprehensive testing either.

You can tell HSA to pretend your GPU is Navi 21 by setting an environment variable:

    export HSA_OVERRIDE_GFX_VERSION=10.3.0
This is not a configuration that has gone through any QA testing, so I couldn't in good conscience recommend buying a GPU to use in that way. However, if you already have a 6000 series desktop GPU and you always wanted to play around with PyTorch... maybe set that variable and give it a try.


Yeah that's the workaround that some people use.

But you see the catch right? People buy hardware to have support from the manufacturer. The no QA part is very very bad. :/

Nobody wants to be the one troubleshooting issues all the time, and that can alone make an NVIDIA GPU worthwhile to buy over an AMD one.

Hopefully this gets fixed in the future.

and maybe some very big past mistakes too. See the G4ad instance on AWS. That runs on the Navi12 ASIC, which never got (proper) ROCm support. Wouldn't it be awesome if an AWS instance was available widely for people to test their software with ROCm? The hardware is already there...


It works excellently on my 6900xt for another anecdote.


They could still participate in this space if they bring out a 32GB card first that has ROCm support.


The Radeon Pro W6800 has 32 GB of memory and is officially supported by ROCm.

https://www.amd.com/en/products/professional-graphics/amd-ra...


Noted! But I'm not sure if I should get that as a gaming card. The Radeon VII was more explicitly dual-use.


Ah. The W6800 has a very different set of features and performance characteristics. The Radeon VII is a better choice than the W6800 for some workloads, so it's not a clear upgrade for someone like yourself.


Ah, good to know. Yeah, that's why I'm holding out hope for the RX 7950 XT.

The goal is a card that is primarily capable of gaming (VR), but can also pull double duty for training and running moderately large networks. I think AMD systematically underestimate the importance of that niche.


Full disclosure; I am not a pro ML'er. I dabble as a hobby and I am trying to learn as much as possible.

Question; Does the innovation need to happen on the hardware side?

I ask because there is a concerted effort, at least for inference, to reduce memory requirements and provide more access to giant LLMs.[0] It seems like everyone for the past few years has just thrown more and more compute at the wall to see what sticks. Where does it stop? 3 or 4 trillion parameter models that cost a $Billion to train? Is there a law of diminishing returns? Have we reached the apogee or will it be the 15 trillion parameter that ate $100Billion?

Accessibility to all manner of models, for knuckle heads like myself, is relatively new. Maybe I should rephrase, accessibility to interesting and FUN models that mere mortals can use and hack on is relatively new. Stable Diffusion is nothing short of spectacular for just having some fun with ML/DL.

My naïve thought is that, similar to most natural systems, smaller and more specialized models that can be "glued" together as a way forward. In my mind this makes more sense than these ridiculously large language models. That is to say, purpose built or fine--tuned models that can communicate seems like a good idea. GPT-NeoX "communicates" with to Bloom and makes a gaggle of small models that can be fine-tuned further. If regular people with a consumer GPU can train/fine-tune a model at home, this happens a lot faster.

[0] https://arxiv.org/abs/2208.07339


Beyond the comparison to the brain that the other replier gave, you also have to consider that some problems are just big. Really big. Not every AI problem is NLP or CV. For instance, in physical simulations, the representation of the system you’re simulating could be on the order of 10s of gigabytes (a very large mesh, a fine 3D grid, etc.) These are the kinds of problems I work on, and we massively benefit every time a new chip comes out with more memory or increased memory bandwidth, because a single training example, plus it’s activations, the model, and optimizer state, can take about a half a TB of memory or more depending on model size. I’d also point out that as hardware devices get better, it’s good for everyone. It allows the large corporations to train even larger, more capable models, and it allows smaller players to train models they couldn’t have afforded before on smaller budgets. A rising tide lifts all boats, so to speak. Algorithmic and efficiency improvements are also important, but they’re additive, not a replacement. And while combining models is not novel and is certainly an avenue that should and is being explored, the models that have the most generalizability for recombination are these extremely large models! They have the capacity to learn very generalized patterns that are then useful for far more transfer/combination tasks.


Your brain has roughly a quadrillion synapses. These are (in a very rough manner) the closest thing there are to parameters in the ML sense. I think you can probably do better with smarter software, but if you want to match humans on general tasks, I doubt you can do 100x better.

So there is clearly still ways to go.


I agree there is a way to go... I am just curious if the answer lies in a super computer. I don't think it does, but I am also openly admitting my naivety.

Since you brought up the brain, I will roll with some thoughts. Instead of thinking about a hunk of wet meat that contains a quadrillion synapses... wouldn't it make sense to break those into discrete, specialized and smaller hunks of wet meat of say 10 billion? Then we link those discrete units under a unified command which directs information to the correct specialized core and then the next until some arbitrary stopping point was reached. AKA: An ensemble of models.

My point; It seems that throwing more compute at individual problems has probably reached a kind of apogee in time and cost. PaLM, Bloom, GPT-NeoX and GPT-3 all live in silos. Cool story, but you can't use GPT-3 and PaLM at the same time without an absolute ton of overhead right now. At some point, I think there will be a unifying middleware that can utilize these models in there current form. That is, you will not need to train yet another model to combine the networks of one or more discrete models. Just a hunch.... could be wrong.


Custom AI accelerators are in the works that seem destined to beat normal GPUs in those specific workloads by quite a lot (in performance per watt).

If they were to release a design with the same architecture, but with all compute larger than f16 completely gone, they could probably get identical performance from a chip that is significantly smaller. I think both AMD and Nvidia will be releasing this kind of design (with more tweaks I'd guess) in the future.


I’ve had the chance to test a few of the up-and-coming dedicated AI accelerators, and except for the TPU, none of them gave a speed up or perf/watt improvement that was worth the additional complexity of using something that didn’t natively support Cuda. (And the reason the TPU is different is because XLA is a first class citizen for TF/JAX, and has pretty good support in pytorch)

And I definitely see some future for FP16/BF16 or INT8 devices for inference, but I don’t know how widespread such devices will become for training. For many types of models, they simply won’t converge if they don’t at least use a mixed precision scheme. And there are certain problems where even for inference you get a significant benefit from using FP32 - for instance, I work on building DL surrogate models for physical simulations. We get better results with the increased dynamic range and precision.


What might be needed from NN accelerator designers is not so much perf per watt as fast links between chips, between chips and memory, and between nodes in a cluster, so that a large distributed cluster of chips, each with its own memory, would appear to a user as a single chip with a huge amount of memory. Imagine 10TB/s links everywhere in such a cluster - not only training a large model would be vastly simpler, it would also be a lot more efficient (e.g. no need for data parallel model replication). In theory you could have a truly unified and shared memory space, with little need for synchronization.


Strongly agree. Bandwidth and latency are the two big limiting factors always though, and unfortunately they tend to improve slower than compute. I think right now the fastest infiniband you can achieve is 3-400 GB/s, and compare that to the H100 HBM3 memory bandwidth at up to 3 TB/s. So I don’t see a time in the near future where we can naively treat our server fleets as a truly unified machine.

On the other hand, work being done in the big three DL packages for making distributed training easier has been quite nice. I know the least about TF, but their dtensor looks promising. JAX’s entire distributed paradigm using pmap/xmap etc make certain classes of models very easy to distribute. The one I’m following most closely though is Pytorch and their sharded tensor, and it looks like they’re planning on implementing native distributed ops powered by their RPC framework, which should make full tensor- and model-parallelism significantly easier.


Technically 8 H100CNX cards on Gen5 128GBs backplane can get one card a third of the HBM bandwidth. Given that Gen 6 and 7 are following much more quickly than the Gen 3 > 4 development, we may be there in a couple of years. I'm hoping that Gen 6/7 will have significant effects on the cost of the high end by making specialized boards less attractive and leveraging commodity switching.


I think neuromorphic stuff like Rain and others will speed past Nvidia so fast it won’t make a difference. AI screams for analog.


I think for inference, analogue makes a ton of sense, and within a pretty short timeline (10 years maybe?) we’ll see it deployed for those workloads.

For training, I’m certainly interested but not at all convinced that it will dominate. I work on multiple projects right now where the reduced dynamic range of analogue signals would be a complete non-starter given the problem domain. I’m not sure how they get around that.


There will be no hw product able to run inference in purely analog domain in foreseeable future. There’s simply no good practical way to store analog signals yet. And unfortunately activations must be stored in models like transformers or convolutional networks.

We already have mixed signal accelerators (e.g. Mythic) which are not much faster than digital competition (for many reasons).

From this point of view there’s no difference between inference and training, especially as FP8 format is being adopted by Nvidia and others.


The big problem with analog is power - we have 40 years of building gates that are on or off and use virtually no power in those states because no current flows - the problem with analog is ohm's law you need to have current flowing to make analog work


I am very thankfull that the sites exists and does those architecture deep dives, ever since Ian and Andre? left anandtech it has lacked in that territory and this site is a good replacement.


Curious about workloads for these, I came across a stunning video simulating airflow around the space shuttle using 4x AMD Instinct MI250 for the computation: https://www.youtube.com/watch?v=5AzxwQpng0M


The HSA-rebirth news is pretty good. CDNA3 + Zen (Zen3? Zen4??) with shared HBM packages means extremely fast GPU / CPU communications, the likes we've probably never seen before.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: