I have been working on an image decimator using a shader and I began to benchmark its performance. I have been measuring the execution speed of various aspects of the process and found that glBindTexture() seems to have some kind of deferred processing.
In the very simple test case below the first call takes 0.013 ms but the second 24.59 ms. If the second call to glBindTexture() is omitted then the delay is simply deferred until a future glBindTexture() occurs. Subsequent calls to glBindTexture() are again fast. I’m using Nvidia GPUs.
This was very perplexing because I thought that my code, that sets the Uniforms for the shader, was very slow. It was merely the next place there was a call to glBindTexture(). Is this to be expected?
GLuint image_to_texture_id(IMAGE *image)
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
0, // the 0-th level mipmap
0, // NO borders
GL_RGBA, GL_UNSIGNED_BYTE, image->buffer
Most OpenGL commands have some kind of deferred processing. Apart from commands such as glFinish() and glGet() (and maybe some others) which inherently have to wait for processing to complete, OpenGL functions return as soon as the command (including its data) has been enqueued in the output buffer; they won’t wait for the command to be sent to the GPU or for the GPU to actually execute the command.
Commands which have to copy bulk data to the output buffer (e.g. glTexImage() with a pointer to client memory rather than a buffer offset) may take longer, but the slowest ones will be commands which wait for processing to complete, or any command which is executed when the output buffer is full (as it will have to wait for at least some processing to complete before the current contents of the output buffer can be sent to the GPU, leaving space for new commands).
In your example, the second glBindTexture() is probably waiting for the last part of the texture data to be sent to the hardware.
If you want to time the actual execution time of a function call, call glFinish() immediately beforehand to ensure that the pipeline is empty. But pipelining means that the execution time of function calls is largely irrelevant. If you want to measure the time taken for the GPU to actually execute the command, you need to bracket the command with a timer query (glBeginQuery(GL_TIME_ELAPSED) etc).
The behavior I’m experiencing seems something more than a consequence of the GPUs pipeline. The glTexImage2D() call takes 80ms (I’m texturizing a big image) but whenever the very next glBindTexture() occurs there is a further ‘hit’ of ~25ms. It is true that the execution time of a single OpenGL call is “largely irrelevant” however, in this case, it seems as if there is work left undone from the glTexImage2D() call, which is deferred to (or somehow initiated by) the next glBindTexture(). A ‘normal’ glBindTexture() takes much less than 1ms.
I shall bracket the curiosity with glFinish() and glBeginQuery(GL_TIME_ELAPSED) as you suggest.
But is it specific to glBindTexture()? Or is it specific to glBindTexture() with the texture parameter equal to the texture which has just been defined? Or does it apply to the next OpenGL call of any type?
One plausible behaviour is that glTexImage2D() will copy the command and its basic parameters to the output buffer, then start copying the texture data. It will get some of it into the output buffer immediately, then go into a loop, copying more data as soon as their is space in the output buffer. It will return once it has copied the last chunk of data to the output buffer.
An OpenGL command executed immediately afterwards is likely to find that the output buffer is almost full. If it’s close enough to full that there isn’t space to add the new command, it will have to wait for more data to be consumed before it can complete.
Having said all of that, there may be additional issues with executing glBindTexture() either while the active texture unit is busy with a previous command, or while the texture being bound is still in the process of being defined. Uploading a texture doesn’t inherently involve any processing by the GPU, so the driver may be able to copy the texture data to a different buffer from the command stream and upload it concurrently with other commands. But if those commands try to do anything with either the texture or the texture unit while the texture is still being defined, they may have to wait.
I have liberally placed glFinish() calls to reveal which OpenGL functions are demanding – this helps a whole lot in understanding the performance of my shader. The GPU’s task begins with the conversion of an image to a texture and ends with a pixel read soon after, so the application code always waits for the pipeline to be emptied.
Things are now making much more sense. For example I mistakenly thought that a call to glReadPixels() was taking ~17ms but in fact it is only taking ~3ms and the difference is, in fact, the rendering of my shader.
Thank you for your help.
As an aside my decimator spends:
|setting up FBO rendering 2.42%
|shader rendering 11.22%
So there is little point in my doing further work on my shader code!