I’m using instancing to render millions of quads in opengl, and it works okay. The base quad has 6 vertices, and each instance has a vec3 for its position. However, for each instance I need to calculate some vectors based on the instance’s position. If I do this in the vertex shader, I would calculate these vectors 6 times for each instance. Is there a way to calculate them only once per instance?
Before anyone mentions the geometry shader, I used the geometry shader before but I switched to instancing because it was slow. Perhaps there’s a way to use both? And is it beneficial?
Preamble: do not render quads with instancing. Instancing, on various hardware, works best when the size of the instance is not exceedingly small. Small instances can throw away lots of performance on such hardware.
Also… why would your instanced base quad have 6 distinct vertices (as opposed to 4 vertices where 2 are shared)? If you’re going to use instancing, it would be better to use a triangle strip with each instance being a 4-element strip.
As for the main thrust of your question… no. There is no instance shader, and there is no way to make the VS execute some of its code per-instance rather than per-vertex.
However, you really shouldn’t stress about this. If the computation is “based on the instance’s position”, then that means the primary performance concern is reading the per-instance data, not the computation. And GPUs have to be able to efficiently deal with multiple invocations reading the same memory addresses, so you can assume that having 4 invocations read from that address will be not much slower than having 1 invocation read from that address.
So just accept it and move on.
Thanks for the reply.
The computation is doing various cross products and matrix multiplications using the instance’s position and some uniform variables, so I think the real cost is in the computation.
If instancing quads is inefficient, how would you recommend I render millions of them? Is the geometry shader the best option?
Geometry shaders are one option. Another option is to pre-calculate the per-quad data using a compute shader (or a “fake” compute shader, i.e. a vertex shader with transform feedback or a fragment shader with render-to-texture) then have the actual vertex shader read values using an index derived from gl_VertexID.
Geometry shaders are often discouraged on performance grounds, although much of that is due to the fact that they duplicate shared vertices which increases cache consumption and reducing the number of vertices which can be cached. But if you’re rendering disjoint quads (which don’t share vertices with their neighbours) that’s less of an issue.
Which approach works best is likely to depend upon the specifics of the problem and the target hardware.
The obvious way: you write millions of quads to a buffer, computed either by the CPU or some on-GPU process, and then you render the quads in that buffer. People try to over-complicate the matter, but the simplest solution is generally the best here.
Note that this also means that all of those per-instance computations are being done by the CPU/GPU process that generates the vertices, not by the VS.
I don’t know the specific details of your use case, so I can’t say for certain that the obvious method will be faster overall. But it’s generally the place to start from, and you should only search for a faster method if you have some profiling data in hand that requires one (or need to alleviate some CPU burden).
The main thing you’re trying to save with instancing is the amount of data used. That is, if most of your data is per-instance, you don’t have to replicate it 4 times for each quad (and yes, it’s only 4, not 6 ).
If that’s the primary issue, you don’t strictly need to use instancing to fetch per-quad data. You can use
gl_VertexID/4 to compute the quad index, and then you can fetch data from an SSBO which contains your per-instance data. Now obviously, you’re still doing the computations per-vertex, but you’re not using instancing in your rendering operation. Indeed, you don’t even strictly need a vertex buffer in such an instance.