fastest way to transfer texture data from GPU to RAM and back

Hi everyone,

does anyone know the fastest way to transfer texture data from main mem into the GPU and back from GPU back to main mem?




I think your options are pretty limited there. For uploading texture data, you use glTexImage*, for downloading texture data glReadPixels*.

The problem is, of course, that glReadPixels especially is pretty slow. You might want to perform some tests and benchmarks for finding out which glReadPixels parameters work best on your system. Some people have suggested that limiting the size of the readback area or using certain image tiles (e.g., 64x64 tiles starting at (0,0)) might speed up image readback in OpenGL. However, I do not believe that there is a general rule.

As far as texture upload is concerned, all you can really do is a) take care of efficient texture caching (no multiple uploads) b) use texture compression if available c) use an appropriate texture format (e.g., no RGBA texture for single-channel data).


The whole idea of higher level APIs is to rid the developer of these nuisances. Why do you want to transfer data from GPU to RAM or vice-versa? There are different techniques for different scenarios and as such there are no hard fast rules. Add different implementations from different drivers, and the wide variety of hardware out there, i think that you are just asking for trouble. If you have a specific scenario in mind, for which you need fast transfer then, you should mention it explicity, so that better suggestions come up.

For fast texture uploads on nVidia based AGP cards, I have been able to realize significant improvements by using the glXAllocateMemoryNV (wglAllocateMemoryNV) function to allocate the CPU side memory (instead of using new / malloc):

  AGPmem = (GLubyte *) glXAllocateMemoryNV( gTexWidth * gTexHeight * 3, 0.0, 0.1, 1.0);
  glPixelDataRangeNV( GL_WRITE_PIXEL_DATA_RANGE_NV, gTexWidth * gTexHeight * 3, (void *) AGPmem);                                                    

I think you may be able to get similar readback benefits by allocating memory similarly (read about it) and marking it as GL_READ_PIXEL_DATA_RANGE_NV, etc… (I’ve never done this)

Be warned that everything might have to be “perfect” in order to use the fast path (e.g. byte ordering in memory matches how the card wants it, data dimensions are multiples of 8, etc)

If you are doing this multiple times, you definitely want to allocate a texture first (glTexImage2D), and later copy in data with glTexSubImage2D…

Also be warned that these glTexSubImage calls (on the special memory) don’t block – the memory copy happens in parallel. There are ways to figure out when the card/driver is done with it (nvfence?) – I generally just allocate two pieces of memory for my texture data and alternate between them to avoid corruption.

What are you doing with the data on the CPU side? If you are just pushing it back to the card to use as a texture, you would be way better off using glCopyTexSubImage() instead. (even if you are modifying the data, you would be better to be using a GLSL program to do that)

Finally, if you buy a machine with PCI express, you will find that texture uploads are pretty darned fast as is.


thank you very much for your information…well my question was really a little bit unspecified. To get a little bit more into detail:
I’m writing an application in which I need a small texture which contains the average RGB values of some parts of another (much larger) texture - the parts have different size… Calculating the average in the GPU might be quite slow especially if the parts from which I need the average values are large (due to the different sizes of the parts parallelism will get lost…) - so I was thinking about taking that texture data into the CPU, calc the average there and transfer the results back into the GPU. I have a PCI-express card so it might not take that much time…I was just reading something about PBOs and thought this might be the fastest way, but I never worked with that extension.
Thanks again for your information - I’ll inform you about the results I got…

Doing the average on the CPU is probably the most straightforward way, but if performance is critical you might want to look into using hardware accelerated convolutions. There are limitations to the size of the kernel you can use (depending on graphics card) – but I remember reading an online article that described how to emulate larger kernels with multiple passes. (I can’t seem to find it)

Good luck with your project.

Thank you for the hint. I implemented an averaging multi pass rendering with Cg and FBOs - using a image pyramid and doing an average with the 4 neighbouring texels in each step - at the end this gives me the overall average of all texels and it runs really fast (~300 fps for VGA res on a NV 6600GT), but in my case I need the averages of so many different regions of the texture that it might be the easiest and fastest way to solve it in the CPU.


You should be able to do better. Clear a buffer and render each of the regions into a different area of single buffer. Transfer to texture and use automatic mipmap generation (very fast 2x2 box filter). Read back the highest mipmap single texel. (You may find it faster to render this mipmap level and read back the pixel). Multiply by the buffer area and divide by the sum of the region areas to scale the result.

Use a lower level mip (more texels) if you require higher accuracy and average them on the CPU.

that sounds really a very good solution - I was thinking about that but I’ve never used mipmaps and wasn’t sure if that is really doing the right thing for me. but if you say so I will try it. Thank you very much! I have to use POT TEXTURE_2D for that, right? Mip Mapping is not supported for TEXTURE_REACTANGLE (that’s what I’m using right now), right?

Just want to put that thing I’ve thought of, hoping that could help (maybe not :slight_smile:

On my motherboard bios I can enable/disalbe AGP fast-write. But it seems my driver doesn’t want to enable it. With some readings, I was concluding that this option should run only on 64 bits cpus (don’t know why).
Also I don’t know if there is the same thing with PCI express…