I have an OpenGL ES 2 engine that I have it running on Windows and iOS, and I also have it running on DirectX 11.
As soon as I learned about GL_EXT_separate_shader_objects I imagined that I would have huge performance benefits because with this one OpenGL would look and feel more like DirectX9+ and my cross-API code could remain similar while maintaining high performance.
To my surprise, I implemented GL_EXT_separate_shader_objects only to find out that my performance is halved, and my GPU usage dropped from ~95% to 45%. So basically, having them as a monolithic standard program is twice as fast as being separate. This is on an AMD HD7850 under Windows 8 and OpenGL 4.2 Core.
I originally imagined that this extension was created to boost performance by separating constant buffers and shader stages, but it seems it might have been created for people wanting to port DirectX shaders more directly, with disregard to any performance hits.
So my question, is if you have implemented this feature in a reasonable scene, what is your performance difference compared to monolitic shader programs ?
Those “links between shader parts” can be made off-line. That’s the whole point of the program pipeline object: it encapsulates a sequence of programs, so you can do all that verification work up-front rather than at bind-time.
Sounds more like someone’s slacking off at their job of implementing this correctly.
I’m adding some new GL3 features, I’m thinking to post here instead of making a new post.
Next on my list was Uniform Blocks. I had previously worked with DX11 constant buffers, so I thought this was to be an improvement.
I have around 2 out of 10 constants (view and projection matrix) that I update into every shader for every frame and then draw.
To my surprise, if I put all constants into an uniform block and instead of calling glUniformMatrix4fv twice per program, I now do an glBufferSubData once for each program’s (single) uniform block (I know I could share them, I’m just not there yet). What seems odd is that I’m actually having fewer gl calls, but performance is again slashed in half (more like 60% compared to glUniforms). Also, what really stroked me is that samplers can’t be put into uniform blocks so you end up with setting samplers with glUniform, so you basically have to use part of the GL2 pipeline and part of GL3. Should this be normal ? Why would glBufferSubData be twice as slow as 2 glUniform* calls ?
Also, what really stroked me is that samplers can’t be put into uniform blocks so you end up with setting samplers with glUniform*, so you basically have to use part of the GL2 pipeline and part of GL3. Should this be normal ?
It’s because they’re not really uniforms. At least, not in the same way that the non-opaque types are. They’re not pieces of memory that store the value of the texture unit.
has increased my performance a lot. But this is probably mainly because under the hood, there’s some double buffering going on now. Bottom line still is that using uniform blocks is ~10% slower than plain old glUniform calls. I’m now thinking they might only be useful for large arrays or large sets of data.
EDIT : I now just do a single buffer update once per frame and to my surprise the performance is still lower than using glUniform, but the difference is now like 3-5% or so.
Right now, 30 vertex or pixel shaders that I mix and match. I haven’t count them, but I usually have 1 program per material (1 vertex and 1 pixel from that array) and I know I have ~100 materials (sometimes the same program gets used) and it’s also ~100 drawcalls. So it was one buffer update versus 100 * 2(constants) glUniform calls.
That’s most interesting. I haven’t benched marked the difference between uniforms and a uniform buffer. Although if it is only 3-5% with 2 uniforms there may be a cross over point as the number of unique items in the buffer increases. I have about 16, so I would need 16 uniform calls.
He also mentioned that, “I now do an glBufferSubData once for each program’s (single) uniform block”, which is probably not the most efficient way to go about poking with uniform blocks. If you’ve got a large series of objects that use the same uniform blocks, it’s probably more efficient to create an array of uniform blocks within a buffer and change them all in one call (either mapping and writing all the blocks or doing just one glBufferSubData call).
I later added that “I now just do a single buffer update once per frame”, and that’s with glMapBuffer which is like twice as fast as glBufferSubData.
I suppose there is a cross over point, no idea really. I’m anxious to test this out on GLES3 (mobile) GPUs which I assume they will have drivers specifically written for it compared to desktop where they originally had the DX11 driver and (probably?) patched in GL3+ features once they were approved.
I just finished implementing Sampler objects (the ones you use with glBindSampler ) and to my surprise these are also slower than setting (just once) the sampler state with glTexParameter. I assume the speed diff might be from switching out states (since I have textures with and without mipmaps ) but I also thought that if the hardware is designed for DX11 (since I use a HD7850), it might already have a sampler object per individual texture so in hardware it might switch between a lot of objects and now with sampler objects I would minimize that, then again, no idea how the hardware does it so I’m only left to speculation.