Geometry shader - a threat for high performance

My suspicion is that the small local memory size in the original 8000 and GT200 series cards (16K) is why geometry shader performance is so poor on these earlier Nvidia architectures. The Fermi generation increased this to 64K. If the geometry shader stores its output vertices in local shared memory (which I suspect it does as every Nvidia CUDA tutorial hammers home the point that you need to use shared memory for optimal performance), then the smaller 16K size is likely limiting the number of threads that can be run at a time.

For example, if a single vertex requires 6 vec4s of storage and you output 3 vertices, each thread would require 6x4x4x3 = 288 bytes of vertex output storage. That means the maximum number of threads that can run on a GT200 simultaneously is 56, while on Fermi this would be 227 (assuming nothing else uses shared memory, which is probably a bit ideal). This leaves the GT200 more sensitive to memory latency. After a certain point, running more threads won’t improve performance, so perhaps that’s why your not seeing any variation with Fermi and the vertex output size. There could be other architectural enhancements as well, such as memory speed and newer instructions that improve geometry shader throughput, but it seems like local memory is a big constraint in this context.

As a test I tried implementing a 1.5M point/6M vertex model, which was drawn with glDrawElements() on 1.5M point vertex arrays and a GS with a 1.6M TBO lookup per primitive for a per-primitive attribute, with a 6M vertex VS-only glDrawArrays() implementation. This promoted points and per-triangle attributes to triangle vertex frequency (all VBOs of 6M-length). Both implementations took ~26ms to draw on a GEForce 670. As a reference, drawing 1.5M point VBOs with glDrawElements() and no GS or per-triangle attribute took ~6ms to draw. So while the GS-case definitely slower than the non-GS case, it is no worse than promoting all attributes to per-triangle-vertex frequency (and takes 1/4 the memory).

Now, with a 20M polygon model, the GS-case beats out the promoted vertex case by 225ms to 357ms (VS draw elements only - 49ms). Of course at this point, both are terribly sluggish so it’s hard to tell the difference when tumbling. For a game context you wouldn’t want either of these models, but for CAD purposes the geometry shader seems pretty useful.