Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Oversimplifying a bit, CPUs are smart and can do a lot of different things, and a lot of their chip area is devoted to that intelligence. GPUs are "dumb" but their chip area is heavily devoted to absolutely enormous throughput on really big batches. Looking at AWS prices right now, I can pay ~4/hour for 50 4gHz CPU cores or 5,000 1.75 gHz GPU cores. That means the best case scenario for GPUs is ~2,000 times faster. In practice, you'll rarely achieve that, but GPUs can be tremendously faster for extremely parallelizable tasks (which graphics are.)


Yes.. if you're doing something that is just doing the same calculation on a whole lot of different inputs, then GPUs absolutely eat that for breakfast.

In my game, I have a complex procedural generation process that occurs while loading a game. It's not a graphics process, so I originally did it on the CPU. It originally took about three seconds to build the data in parallel across seven background CPU threads on my quad-core processor. But testers who were using low-end dual-core i5s only had one background thread to do that same calculation, and typically reported that the procgen took multiple minutes to complete.

After spending a week refactoring the algorithm to do the same calculation on the GPU instead of the CPU (basically by pretending it was a rendering calculation and writing results out into a "texture" that we could read the results from), calculation times dropped from seconds or even minutes to just fractions of a millisecond, even on low-spec machines.

The calculation that I previously had to hide behind a loading screen was now quick enough that I could freely do it at runtime without even causing a blip to the frame rate. If you've got a problem that they can handle, GPUs are kind of astonishingly fast; even the (by modern standards) low-end ones.


Very true!

If you look at what most graphic pipelines for 1 pixel the amount of calculations for that 1 pixel is not very much maybe a few dozen instructions. But at 1080 that is a lot more (about 2million times more). GPUs are exceedingly good at doing semi small programs over and over across 2000+ compute units. At best in a CPU you may get 64 if you have a super nice top of the line CPU (reality is 2 or 4). The graphs where that change over happens is going to vary considerably across workloads and instructions used. In most cases currently it heavily favors the GPU. Throw in branching or something like that and CPU might become more favorable. But you still have to try it out.

In the case of this article. They are using hashing/caching which, yeah, should produce a fairly nice speedup. Basically the old speedup trick of do the work once and keep the result. But that probably might not translate very nicely GPU. Oh you could get it to run but it may not be as performant. In the game world it would be like what we used to do with sin/cos and just have a lookup table instead of calling the instruction. We just precalled it and had a copy laying around in an array for the most common cases. So it was just a memory lookup and very little compute and keeping the cached result. BUT that does come at a cost if you have to branch on miss.

Now if you could combine the two ideas. Maybe with some sort of mask to the GPU to say 'do not do any work here as it is done already and work on something else' and pre fill stuff in this could be an even more interesting idea.


True. Of course the need for all that parallel processing can sometimes be replaced with intelligence, such as the new video codecs that leverage machine learning. Maybe CPUs will be better for video if those codecs start to dominate.


What you mean by GPU cores here?

In their marketing terminology, the jesters at NVidia are calling SIMD lanes "cores".

For F32 operations, 50 AVX512 cores have 60*(512/32)=800 SIMD lanes. And of course there are about 64 cores per x86 server CPU now at the high end.


By cores, I mean what NVidia calls "NVidia CUDA Cores", e.g. there are 5,120 on the Tesla V100.


Ok. So translated to CPU terms there would seem to be 80 cores ("SPs") in a V100, that each can do up to 64-lane SIMD/SIMT with FP32 data. [1]

(SIMT for NVidia seems to be just a programming model that compiles to SIMD instructions[2], a bit like what you get with ispc on CPUs)

[1] https://images.nvidia.com/content/volta-architecture/pdf/vol... pages 17 and 10

[2] https://www.realworldtech.com/forum/?threadid=195094&curpost...


Yeah, although a "Streaming Multiprocessor" is still less general than a CPU core IIUC.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: