Code Optimization Techniques for Graphics Processing Units

thebigredjay · on Dec 17, 2011

Link to source, which includes slides and examples: http://homepages.dcc.ufmg.br/~fpereira/classes/gpuOpt/

You might also be interested in the work of a prof at the University of Alberta, Jose Nelson Amaral.

A Complete Descritpion of the UnPython and Jit4GPU Framework https://www.cs.ualberta.ca/system/files/tech_report/2011/Gar...

Jit4OpenCL: A Compiler from Python to OpenCL http://webdocs.cs.ualberta.ca/~amaral/thesis/XunhaoLiMSc.pdf

6ren · on Dec 18, 2011

Would it be economical to manufacture a GPU with 100's, 1,000's or even 10,000's of processing elements, but with much lower clock (say, 100MHz)?

Cores are physically quite small; a lower clock rate reduces power issues) and I suspect that it would also increase yeild rates (perhaps by using a thicker structures with a finer process, e.g. 45nm on a 32nm process).

One barrier may be that practitioners have few techniques for such massive parallelism (catch 22). OTOH, it seems certain than manufacturers would have done their sums, and worked out they can deliver greater performance with their present number/clock tradeoff.

wtallis · on Dec 18, 2011

Yield problems hurt superlinearly as you scale up the size of a chip. That's why the fastest graphics cards have two GPUs on board, and why two mid-range cards will usually offer a better price/performance ratio than the biggest, fastest single-GPU card. Besides, the biggest GPUs out there are already little more than arrays of hundreds or thousands of vector processors: AMD's current biggest is 2.6B transistors divided among 1536 shader processors running at 800-900 Mhz, and NVidia's biggest is 3B transistors divided among 512 shader processors running at 1.5Ghz.

That NVidia monster has a die size of about 520 mm^2. At that size, there's already a lot of waste due to the fact that wafers are round and the chips are rectangular, and that can only be reduced by making physically smaller chips. (Rumor has it that by the time NVidia's 529mm^2 GF100 chip was originally supposed to launch, yields were bad enough that they were getting only about 2 usable chips per 300mm wafer. The cost of chip production has a pretty much linear relationship with the number of wafers processed, so that really hurt.)