Avoiding redundant vertex transformations

I draw an object consisting of many GL_QUADs using glDrawArrays(). Every vertex is transformed in a rather complex vertex shader, and I really don’t want every vertex to be transformed four times (one time for each GL_QUAD).

Should i look more into VBO? Maybe two passes, one pass transforming the vertices and one pass to draw GL_QUADS using these vertices. How can I do that?

You can try to use pixel/fragment shaders for vertex processing.

Put vertices into rgba32f texture, create rgba32f offscreen buffer, write pixel shader that process vertices from texturem (replacement for expensive vertex shader), render screen aligned quad and you’ll got result in offscreen buffer. Now, grab (readback) offscreen buffer into PBO buffer, and rebind PBO as VBO.

Keep in mind precission issue. NV hw (NV3x, NV4x and G7x) offers 32bit precission in fragment shader while ATI only 24 bit.


yooyo’s proposal may be overkill for you, depending of the number of QUADS you draw, and the complexity of your vertex shader, and if your VS is really your perf bottleneck.

FWIW, if possible, you could perhaps take care to stripify you quads, or at least reorder vertices to improve vertex cache usage, which should reduce the number of redundant vertex transforms - you could take a look at glDrawRangeElements instead of glDrawArrays and perhaps use Nvidia’s NVTriStrip lib…

You could also take a look at VBO’s, this may improve performance.


It’s actually not that many vertices, a few thousand on average, but the vertex shader really is complex. In what range of size is the cache, and how does it work? Are the heavy redundant transformations transparently avoided if the previous results still are in cache?

I draw a quadratic mesh of QUADS, row by row. If I’m sure the data from one entire row fits in the cache, can I assume no redundant transformations will be made?

I’m not sure how glDrawRangeElements would help here, or did you mean I should use it just to re-arrange the drawing order to a more cache friendly?

yooyo’s solution using the fragment shader probably would work otherwise. It won’t be as clean as using the fragment shader, and it would be a hazzle to implement.

If the cache stores not only memory content, but results from runnings of the vertex shader as well, than it’s probably not worth it. Does anyone know if this is the case in the latest GPUs (GeForce 6800, Radeon X800)?

One more thing… Vertex shader can output up to 32 floats (8vec4) to rasterizer (ie… position - vec4, texcoord - vec2, color - vec4, …) but fragment shader can deliver max 4vec4 if you using MRT on latest hardware.

In short… If you need more “fake” varryngs from fragment shader you’ll need to do in in several passes (up to 4vec4 varying in each pass, using MRT or 1vec4 without MRT) and each varying should be stored in separate texure. This mean you have to split your complex vertex shader in several simplier fragment shaders. Finally, readback them all to PBO’s and rebind those PBO’s to VBO’s.

Also… you can render immediate results into texture and later use it as input texture in several passes that actually calculate varyings.

I really don’t know how it will impact performances.


The post-transform cache stores the results of the vertex shader, is around 20 vertices, and it seems to work well to assume a LRU scheme. I don’t think any exact information about size and scheme is available though.

You shouldn’t render quads but use DrawElements instead - otherwise the duplicates won’t be identified.

And, if you have a large grid doing it row-by-row doesn’t give you the best cache utilization - it should be easy to find soem good methods for that. (also better than what is in the more general nvtristrip library)