ReadPixels performance and pipeline stall

Hi everyone,

In my applications I’m using shaders to performs calculations in parallel on the GPU (I can’t use compute shaders or image load/store, this code path aims to be compatible with older versions of OpenGL).

The DrawCall that makes the shaders execute looks quite fast to me (glDrawArrays in the image), but glReadPixels (I read the image to retrieve the results) is really slow.

As far as I understood commands are buffered before beeing sent to the hardware, but sometimes certain commands require all the previously buffered ones to be executed first (like glReadPixels from a texture).

Is this the reason why it takes so long to execute glReadPixels? (you can see the texture is quite small from the profiler screenshot).

If so, why placing a timer before glReadPixels (CPU side) and after calling glFinish() gives back results in the order of magnitude of the millisecond and not 1/10th of millisecond? (same scenario used for the results shown in the image)

The size of the data being read can be largely irrelevant for glReadPixels performance. You could read a 1x1 block of data and still get the same stalling behaviour.

Two things happen in glReadPixels that can cause it to run slow.

The first is that - as you’ve indicated - all queued-up GL commands must be flushed, executed, and you must wait for them to complete before glReadPixels can run.

The second is that if the parameters you use for glReadPixels mismatch the format of your current read buffer, the driver must do a format conversion. (This is a case where the size of the data being read can make a different, although the pipeline flush will likely still be the major time factor.)

I can’t answer your question about the glFinish call because I can’t see in your screenshot where you make that glFinish call.

1 Like

Here’s a screen with the glFinish()

Unless the framebuffer is GL_RGBA32F, the glReadPixels will perform conversion. Use format and type parameters which match the format of the framebuffer to avoid this.

Calling glGetFramebufferParameter with GL_IMPLEMENTATION_COLOR_READ_FORMAT and GL_IMPLEMENTATION_COLOR_READ_TYPE will return the preferred format and type.

If the call is still slow, then it looks like the issue is with glFinish. According to the standard

But it doesn’t elaborate on what “complete” or “fully realized” mean.

If you’re still having issues, I’d suggest using a PBO and a fence (requires OpenGL 3.2 or the ARB_sync extension).

1 Like

Also worth noting here is that you’re asking the driver to fill 2.6 MB of memory with that glReadPixels() call (411x411 GL_RGBA32F), presumably at interactive rates. That mem/bus bandwidth has a time cost. If your framebuffer uses a GL_RGBA8 color buffer, try reading that format back instead (e.g. GL_RGBA / GL_UNSIGNED_INT_8_8_8_8_REV). That’ll cut the mem B/W by 4X and possibly also cut out a pixel format conversion in the driver (if you’ve chosen a different format to read back).

And as mentioned above, this glReadPixels() will trigger a full pipeline flush. You’ve just done a glFinish() so much of the work “should” be caught up, if the driver handles glFinish() properly (in practice, the glFinish() here is probably useless except for timing purposes). However, there may still be rendering-related work to do that wasn’t explicitly required by the previously-queued commands before the readback can actually occur. For instance, if you’re rendering to an MSAA framebuffer, the glFinish() won’t have done the downsample (unless you’d queued a command before it that requires it to occur). So the glReadPixels() call will have to trigger that and wait on the result. The color buffer also needs to be de-tiled into a linear format, and any required pixel format conversions performed.

I’d also 2nd the PBO recommendation. Even with a blocking copy through a PBO (or even 2), you can more than 2X your effective readback bandwidth (cutting the time required by over 2X). And of course, spacing the readback to PBO and the copy-out from the PBO may let you reduce the total CPU frame time required even further (at the tradeoff of latency, possibly meaning using last-frame data).

1 Like

Thanks to all, from the little testing I’ve done so far it looks like it was a bandwith related problem (conversion doesn’t seem to affect performance too much…but probably I haven’t done enough testing).

However, what could be a reliable way to unpack a float or integer(maybe easier) into 4 unsigned_bytes inside a fragment shader? Since I’m writing for #version 120 it seems like bit shift is not available.

There isn’t one. The GLSL 1.2 spec says

That’s why bitwise operations aren’t supported in 1.2. If you want actual integers, you’ll need to use at least 1.3, which says: