Oversimplifying a bit, CPUs are smart and can do a lot of different things, and ...

mewse · on April 9, 2021

Yes.. if you're doing something that is just doing the same calculation on a whole lot of different inputs, then GPUs absolutely eat that for breakfast.

In my game, I have a complex procedural generation process that occurs while loading a game. It's not a graphics process, so I originally did it on the CPU. It originally took about three seconds to build the data in parallel across seven background CPU threads on my quad-core processor. But testers who were using low-end dual-core i5s only had one background thread to do that same calculation, and typically reported that the procgen took multiple minutes to complete.

After spending a week refactoring the algorithm to do the same calculation on the GPU instead of the CPU (basically by pretending it was a rendering calculation and writing results out into a "texture" that we could read the results from), calculation times dropped from seconds or even minutes to just fractions of a millisecond, even on low-spec machines.

The calculation that I previously had to hide behind a loading screen was now quick enough that I could freely do it at runtime without even causing a blip to the frame rate. If you've got a problem that they can handle, GPUs are kind of astonishingly fast; even the (by modern standards) low-end ones.

sumtechguy · on April 9, 2021

Very true!

If you look at what most graphic pipelines for 1 pixel the amount of calculations for that 1 pixel is not very much maybe a few dozen instructions. But at 1080 that is a lot more (about 2million times more). GPUs are exceedingly good at doing semi small programs over and over across 2000+ compute units. At best in a CPU you may get 64 if you have a super nice top of the line CPU (reality is 2 or 4). The graphs where that change over happens is going to vary considerably across workloads and instructions used. In most cases currently it heavily favors the GPU. Throw in branching or something like that and CPU might become more favorable. But you still have to try it out.

In the case of this article. They are using hashing/caching which, yeah, should produce a fairly nice speedup. Basically the old speedup trick of do the work once and keep the result. But that probably might not translate very nicely GPU. Oh you could get it to run but it may not be as performant. In the game world it would be like what we used to do with sin/cos and just have a lookup table instead of calling the instruction. We just precalled it and had a copy laying around in an array for the most common cases. So it was just a memory lookup and very little compute and keeping the cached result. BUT that does come at a cost if you have to branch on miss.

Now if you could combine the two ideas. Maybe with some sort of mask to the GPU to say 'do not do any work here as it is done already and work on something else' and pre fill stuff in this could be an even more interesting idea.

nwah1 · on April 9, 2021

True. Of course the need for all that parallel processing can sometimes be replaced with intelligence, such as the new video codecs that leverage machine learning. Maybe CPUs will be better for video if those codecs start to dominate.

fulafel · on April 9, 2021

What you mean by GPU cores here?

In their marketing terminology, the jesters at NVidia are calling SIMD lanes "cores".

For F32 operations, 50 AVX512 cores have 60*(512/32)=800 SIMD lanes. And of course there are about 64 cores per x86 server CPU now at the high end.

SatvikBeri · on April 9, 2021

By cores, I mean what NVidia calls "NVidia CUDA Cores", e.g. there are 5,120 on the Tesla V100.

fulafel · on April 9, 2021

Ok. So translated to CPU terms there would seem to be 80 cores ("SPs") in a V100, that each can do up to 64-lane SIMD/SIMT with FP32 data. [1]

(SIMT for NVidia seems to be just a programming model that compiles to SIMD instructions[2], a bit like what you get with ispc on CPUs)

[1] https://images.nvidia.com/content/volta-architecture/pdf/vol... pages 17 and 10

[2] https://www.realworldtech.com/forum/?threadid=195094&curpost...

SatvikBeri · on April 9, 2021

Yeah, although a "Streaming Multiprocessor" is still less general than a CPU core IIUC.