CPUs are still king at the scale Dreamworks/Pixar/etc operate at, GPUs are faste...

polishdude20 · on March 15, 2023

So the idea is one CPU can have hundreds of gigabytes of ram at a time and the speed of the cpu is no problem because you can scale the process over as many CPUs as you want?

kranke155 · on March 15, 2023

Fundamentally yes.

Big Studios are CPU Farms. Small Studios and Indie Artists like myself a Lot of us have moved to GPU.

polishdude20 · on March 16, 2023

Is that because work is usually divided by frame? And usually a frame for these big movies uses more than typical GPU VRAM?

chris37879 · on March 17, 2023

Its more to do with the movies using higher fidelity assets than you'd typically use for a game, which is what GPUs are made for. In a movie, a single element of a scene might have as many polygons as an entire character in a video game, and its because of the differences in how you 'film' them.

Imagine a brick wall rendered for a video game vs one rendered for a movie. The one for the game is probably going to be a plane with a couple textures on it because the wall is something that's a background element that the player isn't going to be up close with. Whereas in the movie, the wall is more likely to be made of a handful of individually modeled bricks with much more detailed surface textures because maybe the director wants the camera to be really close to the surface of that wall and pull out to a wider shot, so that means the individual brick you start zoomed in on might have a 4k texture for itself alone, whereas the entire wall in the video game could easily be a single 4k texture since the player doesn't get close enough to notice the missing detail.

Now multiply that level of detail across every rendered thing in the scene, because the director may want to reframe the shot, or you need realistic lighting to sell that a rendered thing is integrated with filmed footage and 'real'. Every little bit of that detail adds more data you have to track. So in my wall example, you might have 2 or 3 4k textures vs literally hundreds for all the bricks, grout, defects, chipped faces, etc of a movie quality wall.

kranke155 · on March 16, 2023

The work is always divided per frame yes. Everything is baked so that happens easily even with water simulations, they’ve been reduced down to some form of geometry or something similar that can be rendered frame by frame.

The VRAM is indeed one of the main issues. But as someone else I believe cost per final pixel is still lower in CPU. That was particularly true during the GPU shortage.

pixelpoet · on March 15, 2023

What's your opinion on renderers such as Redshift which explicitly target production rendering and support out of core rendering on GPUs? See e.g. https://www.maxon.net/en/redshift/features?categories=631816 (Disclosure: I work on this.)

virtualritz · on March 15, 2023

As someone said above: GPUs are fine & faster as long as your scene stays simple. As soon as you hit a certain scene complexity ceiling, they become much slower that CPU renderers.

I would also argue that for this specific task, i.e. offline rendering such frames, the engineering overhead to make stuff work on GPUs is better spent making stuff faster and scale more efficiently on CPUs.[1]

I worked in blockbuster VFX for 15 years. It's been a while but I have network of people in that industry, many working on these renderers. The above is kinda the consensus whenever I talk to them.

[1] With the aforementioned caveat: if the stuff you work on is always under that complexity ceiling targeting GPUs can certainly make sense.

bhouston · on March 15, 2023

So we just need GPUs with 128GB of ram then? Or move towards the Apple M-series design where CPU+GPU both have insanely fast access to all ram...

jsheard · on March 15, 2023

It's easier said than done, there's consistently a huge gulf between CPU and GPU memory limits. Even run-of-the-mill consumer desktops can run 128GB of RAM, which exceeds even the highest end professional GPUs VRAM, and the sky is the limit with workstation and server platforms. AMD EPYC can support 2TB of memory!

jb1991 · on March 16, 2023

It’s not just about memory. The path tracing algorithm is a natural fit for CPU threads but very difficult to design for efficient use of GPU threads. It’s very easy to leave many of your GPUs threads idle due to the divergence, or overflowing the registers, and any number of other things that are very natural to path tracing.

aseipp · on March 15, 2023

GPUs need RAM that can handle a lot of bandwidth, so that all of the execution units can remain constantly fed. For bandwidth, there is both a width and a rate of transfer (often bounded by the clock speed) which combined yield the overall bandwidth, i.e. a 384-bit bus at XXXX million transfers/second. It will never matter how much compute or RAM they have if these don't align and you can't feed the cores. Modern desktop DDR has bandwidth that is too low for this, in general, on desktop platforms, given the compute characteristics of a modern GPU, which has shitloads of compute. Despite all that, signal integrity on parallel RAM interfaces has very tight tolerances. DDR sockets are very carefully placed with this in mind on motherboards, for instance. GDDR, which most desktop class graphics cards use instead of normal DDR, has much higher bandwidth (e.g. GDDR6x offers 21gbps/pin while DDR5 is only around 4.8gbp/s total) but even tighter interface characteristics than that. That's one reason why you can't socket GDDR: the physical tolerances required for the interface are extremely tight and the signal integrity required means a socket is out of the question.

Here is an example, go compare the RAM interfaces between an Nvidia A100 with HBM versus an Nvidia 3080, and see how this impacts performance. On compute-bound workloads, an Nvidia A100 will absolutely destroy a 3080 in terms of overall efficiency. One reason for this is because the A100 will have a memory interface that is 3-4x wider which is absolutely vital for lots of workloads. That means 3-4x the amount of data can be fed into execution units in the same clock cycle. That means you can clock the overall system lower, and that means you're using less power, while achieving similar (or better) performance. The only way a 3080 with a 256-bit bus can compare to a A100 with a 1024-bit bus is by pushing the clocks higher (thus increasing the rate of transfers/second), but that causes more heat and power usage, and it scales very poorly in practice e.g. a 10% clock speed increase might result in a measly 1-2% improvement.

So now, a bunch of things fall out of these observations. You can't have extremely high-bandwidth RAM, today, without very tight interface characteristics. For desktops and server-class systems, CPUs don't need bandwidth like GPUs, so they can get away with sockets. That has some knock on benefits; CPU memory can benefit from economies-of-scale on selling RAM sticks, for example. Lots of people need RAM sticks so you're in a good spot to buy more. And because sockets exist "in three dimensions", there's a huge increase in "density per square-inch" on the motherboard. If you want a many-core GPU to remain fed, you need soldered RAM which necessitates a fixed SKU for deployment, or you need to cut down on the compute so lower-bandwidth memory can feed things appropriately, negating the reason you went to GPUs in the first place (more parallel compute). Soldered RAM also means that the compute/memory ratios are now fixed forever. One nice thing about a CPU with sockets is that you can more flexibly arbitrage resources over time; if you find a way to speed something up with more RAM, you can just add it assuming you aren't maxed out.

Note that Apple Silicon is designed for lower power profiles; it has good perf/watt, not necessarily overall best performance in every profile. It uses 256 or 512-bit LPDDR5X, and even goes as high as 1024-bit(!!!) on the Max series apparently. But they can't just ignore the laws of physics; at extremely high bandwidth and bus widths you're going to be very subject to signal interface requirements. You have physical limitations that prevent the bountiful RAM sticks that each have multiple, juicy Samsung DDR5 memory chips on them. The density suffers. So Apple is only limited to so much RAM; there's very little way around this unless they start stacking in 3-dimensions or something. That's one of the other reasons they likely have moved to soldered memory for so long now; it simply makes extremely high performance interfaces like this possible.

All in all the economies of scale for RAM sticks combined with their density means that GPUs will probably continue to be worse for workloads that benefit from lots of memory. You just can't meet the combined physical interface and bandwidth requirements at the same density levels.

2channelkrt · on March 16, 2023

I created my account just to reply to this comment.

This is great read. Where can I read more of this stuff? I love the insight you have in there.

brookst · on March 16, 2023

Fantastic comment, thanks k you!

Do you think there’s any hope for UMA on PC / x86 systems? Seems like Intel would have an incentive to offer parts, but would it be possible to remain Windows/legacy OS compatible with a UMA implementation?

isatty · on March 16, 2023

Amazing post - thank you so much for typing that out.

bluedino · on March 16, 2023

Thoughts on HBM that Intel has been touting?

kranke155 · on March 15, 2023

Redshift is cool and i use It and i know many Studios that use It.

It Just feels "Young". As a 3D artist when i see the absurd level of shader detail and Edge case solutions you see in something like Vray, you know that level of software detail only comes from the years its been in the field, taking in and solving customer feedback.

berkut · on March 15, 2023

Those are generally being used on much smaller productions, or at least "simpler" fidelity things (i.e. non-photo CG animation like Blizzards's Overwatch).

So for Pixar/Dreamworks style things (look great, but not photo-real) they're useable and provide a definite benefit in terms of iteration time for lookdev artists and lighters, but it's not there yet in terms of high-end rendering at scale.

aprdm · on March 15, 2023

I think it's mostly a question of price currently. AMD CPUs are much cheaper per pixel produced that GPUs