I have a rendering engine with tiles, a frame may consist of a few hundred tiles. Most tiles will be rendered with a single texture, while a few may be rendered with an extra texture in a different unit.
Which is faster:
Using two different shaders where one has two samplers and the other only one, and switch shaders according to which type of tile it is.
Using the same shader all the time, but binding a 1x1 transparent pixel texture to the second texture unit for most of the tiles.
Is there an alternative 3?
I can probably sort the tiles to reduce the number of shader switches in alternative 1, but there are already some other sort criteria on these tiles, so it may not work.
Putting the whole tile set in one render call is unfortunately not an option due to transforms and other uniforms that differ from tile to tile. So you’re saying that in this case it doesn’t matter if I’m switching shaders between tiles?
If values differ from tile to tile, they probably shouldn’t be uniforms.
I’m saying that the main factor is the number of draw calls. The number of shaders sets a lower bound on the number of draw calls (you can’t change shaders within a draw call). But using two shaders and thus two draw calls is likely to be preferable to a hundred draw calls all using the same shader.
Switching shaders between draw calls has some cost relative to the same set of draw calls with the same shader. That cost isn’t necessarily any higher than changing uniforms between draw calls (older hardware optimised uniform branches by compiling different variants of the shader for different values of the uniforms, so changing uniforms may actually switch shaders).
First of all, thanks for taking time to answer, and giving valuable input, it is highly appreciated!
The values that change are typically a transformation matrix for each tile (this is a 3D globe model, and the tiles are placed in a huge coordinate system with a local origin that moves from from to frame), and some texture transformation parameters (scale/translation encoded in a vec4). Given that I need to target OpenGL ES 3.0 and possibly even 2.0, is there a better way to set these values than using uniforms?
[QUOTE=GClements;1289234]I’m saying that the main factor is the number of draw calls. The number of shaders sets a lower bound on the number of draw calls (you can’t change shaders within a draw call). But using two shaders and thus two draw calls is likely to be preferable to a hundred draw calls all using the same shader.
Switching shaders between draw calls has some cost relative to the same set of draw calls with the same shader. That cost isn’t necessarily any higher than changing uniforms between draw calls (older hardware optimised uniform branches by compiling different variants of the shader for different values of the uniforms, so changing uniforms may actually switch shaders).[/QUOTE]
OK, so in my case, given that I can’t reduce the number of draw calls any further, I guess it is it better to switch shaders between tiles than using a dummy texture?
It’s going to depend on your driver, but I suspect it’d be faster to use the same shader with a dummy texture than to switch shaders.
Realizing that you may not be targeting NVidia GPUs and GL drivers, see slide 48 here for the relative costs of various state changes on NVidia GL drivers: Beyond Porting (NVidia). However, check the “OpenGL ES Developer’s Guide” published by the GPU vendor’s you’re targeting for specific details on their hardware and drivers.
Also, I would add this to GClements’ comments: It’s not so much all about the number of draw calls. Yes, assuming the same number and type of pipeline state changes, fewer draw calls can be faster, but that won’t necessarily be faster. It’s really most important to minimize the number of state changes, and to work hard to reduce the number of state changes of a particular type the more expensive that type of state change is (see the above relative cost table for instance). For instance, if you’re changing render targets a lot each frame, particularly on mobile, your performance is going to suffer. And on mobile, you need to be very careful to avoid state changes that will trigger texture ghosting and know exactly which operations on your GLES driver will cause implicit synchronization (i.e. halting your draw thread’s execution until some driver-internal condition is met).
Once you’ve minimized your state changes (e.g. by grouping batches – aka draw calls – with the same state), then consider merging batches with the same state if possible. This can be a win but won’t necessarily be. Example: if you’re sending batches containing lots of triangles that are out of the view frustum down the pipe, you’re going to be wasting lots and lots of GPU cycles letting it cull them out at the triangle level. If you have a lot of content in your scene, it’s better group your batches spatially and then perform coarse-grain culling on the CPU to send only the batches that at least partially overlap the view frustum. To support this, you’ll probably end up with > 1 batch per state permutation. So I wouldn’t obsess about having to have only one batch per unique pipeline state. That’s probably only going to be true for toy scenes, or scenes where you are dynamically generating draw indirect batches (which you probably won’t be). Rendering the 2nd and subsequent batches with the same pipeline state is often pretty efficient, so don’t sweat trying to have only one batch per state combination it if that’s not easily done.
Realizing that you may not be targeting NVidia GPUs and GL drivers, see slide 48 here for the relative costs of various state changes on NVidia GL drivers: Beyond Porting (NVidia).
An important thing to note about this: the presentation is technically NVIDIA-specific. That doesn’t mean the information is useless on other hardware. Just don’t expect all of the performance statistics to line up with other hardware.
Take “vertex format”, for example. AMD’s GCN-based hardware doesn’t have “vertex format” hardware the way NVIDIA’s does. It’s all done through vertex shader logic. This means that vertex format changes will probably be more on the order of magnitude of program changes (the second most expensive on NVIDIA hardware).
Again, that doesn’t mean you should discard the general order of performance there. I would expect vertex buffer binding (pure buffer binding, via separate attribute format) to be faster than vertex format changes on any hardware. Similarly, you should expect binding for major objects like textures or buffers to be somewhat expensive. And ROP changes are almost certainly not cheap on any hardware.
The answer is going to strongly depend on how often the rendering ping-pongs between needed a texture and not. If you are ping-ponging alot, then you are most definitely better off with a single shader and most likely a single texture where when for when the texture values do not matter, make the texture coordinate all the same across those vertices and put a multiplier by zero on the texel fetch to get zero. If you are not ping-ponging, then a separate shader is fine. The two texture approach sounds like a compromise, but it is not really worth the bother since a float-vec4 multiply is tiny next to a texture fetch. For quite a few drivers (like Intel’s open source driver Mesa/i965), changing any texture makes the thing send all the binding tables to the GPU which is sort-of-heavy-ish.
If you need to support ES 2, there aren’t any good options for the more complex possibilities. For desktop OpenGL, you can coalesce chunks of geometry by putting the uniforms into an array (one entry for each chunk) then adding an integer attribute to identify the chunk.
Splitting draw calls to change textures can be avoided by using array textures or arrays of samplers.