Any methods to achieve fast read back?


I’m currently working on a video-processing application and I need to copy back the data from the GPU to the CPU as fast as possible.

I tried some solutions with PBO and glReadPixels (single or double buffer) and I got these results :

For 1920*1080 RGB-BYTE images on the following configurations :
I7 920 - NV FX 3800
I7 950 - NV GTX 280M

single PBO / RGB - BYTE : 21ms
double PBO / RGB - BYTE : 14-17ms
double PBO / BGRA - BYTE : 12ms

But this is still very slow for me because I must read back 2 of these images at each step of the process. So, for the moment, I’m not able to break the “30 FPS limitation”. :stuck_out_tongue:

Is there any way to achieve a faster read back using anything like CUDA or OpenCL or some OpenGL tricks? (and without involving some new hardware…)

Thanks for your help!

After changing some part of my code, I can manage 9.8ms for read back one texture (BGRA / BYTE). And now, my program can run @t 38Hz.

Is there any way to write directly from a (fragment) shader to a Pixel Buffer Object (PBO)?

This must be faster. Try this:

  1. create two 1920x1080x4 PBO’s.
  2. Render
  3. bind PBO2, lock, memcpy to sysmem, unlock
  4. bind PBO1, glReadPixels
  5. swapbuffers
  6. swap PBO1 & PBO2 names
  7. loop from Step2

Thanks for the answer!

That’s what I am currently using in my code. But, as I said, I have two video stream to process like this for BOTH :

Data (in RAM) > PBO > 2D Texture > FBO (1st processing) > Texture > FBO (2nd processing) > Texture

I need to read back the last texture very fast (60Hz would be very cool). For the moment each ending FBO has a specific color attachment to render (one is GL_COLOR_ATTACHMENT0_EXT and the other is GL_COLOR_ATTACHMENT1_EXT). So I have two PBO for reading, per processing line exactly as you wrote it.

Is there any way to write in a PBO directly ie. without using glReadpixels?


If your processing of the pixels have a very small limited kernel, you can use TransformFeedback and do the processing in a vertex shader, writing directly from one buffer object to another.

Yes, it seems to me that the fragment program is very small : just modifying coordinates for texels…

I will try this!


If you are willing to use either OpenCL or CUDA, both have GL interop capabilities where a GL texture can be used directly by OpenCL or CUDA… there are some rules for the iterop to give well defined outputs (like don’t change the values in GL while a OpenCL or CUDA kernel is using them)… I am like 99.99% sure that if you use CUDA it will be NVIDIA only… I don’t know how well OpenCL works on ATI, or for that matter which generations support it, in NVIDIA for OpenCL or CUDA, one needs GeForce 8 or higher.

Thanks, but either CUDA or OpenCL works with Buffer Object which means I must use glReadpixel… and that’s the problem…

Transform Feedback seem to not fit because they are especially designed for Vertex operation (with GLSL sure, but only vertex program).

I’m also looking at the Texture Buffer Object (TBO) to do the trick…

Why don’t you like vertex programs? What stops you from using TransformFeedback?

You can supply the input either by binding it like vertex attributes or through the textures (including TBO), if you need random access.

I need to process a texture and do it with the Fragment program…

So no vertex buffer, no vertex program… only texture and frame buffers

You are processing some ‘data’, currently via the Texture object. Since you are already using PBO, your data representation is not that ‘clear’. Generally speaking, you just have a piece of memory and you want to process it into another piece of memory.

For that, instead of copying PBO1->Texture1->Texture2->PBO2 you can simply do PBO1->PBO2 (using TF).

Thanks, but either CUDA or OpenCL works with Buffer Object which means I must use glReadpixel… and that’s the problem…

OpenCL can also use (directly) a texture, look here: clCreateFromGLTexture2D and for the greater bits: gl_sharing

and within the OpenCL specification, section 9.8.3.

There is pretty good benchmark tool in NVIDIA Cuda SDK,
you can measure readback speed depending on data block size and various memory allocation types.


First, I must use fragment program to process texture to work with the linear sampler. So, I can’t use TF.

About OpenCL, yes it is a very cool thing! Might be helpful!

I also ran CUDA SDK examples and found that the GPU to CPU bandwith is nearly 3GByte/s. Which is far away from my 500MByte/s…

I hope one of them will work correctly…


I made good experiences with GL_RGBA as format and GL_UNSIGNED_INT_8_8_8_8_REV as type. That even has beaten GL_BGRA and GL_UNSIGNED_BYTE.
Also, don’t forget to set GL_PACK_ALIGNMENT to the highest value that suits your format/type combination.