Compress images before reading back from GPU

I need to read back (readpixels through PBOs) huge amounts of data from GPU memory to system memory. Of course this is the biggest bottleneck in my application. Since most of the time the images could be compressed very well even with something like RLE compression, I am thinking about saving time by RLE-compressing them before readback.

Is this at all possible with OpenGL and/or GLSL? And if it isn’t, then can I use Cuda/OpenCL for this purpose and is it even a good idea if I presume that I can achieve (on average) at least a 2:1 compression?

Thanks.

RLE is hard to parallelize so not sure how a cuda/CL kernel could help.
I am no expert on compressed GL textures, but a good guess might be to try glCopyTexSubImage2D to a compressed texture, then glGetCompressedTexImage2D to retrieve the compressed version.
http://www.opengl.org/sdk/docs/man/xhtml/glGetCompressedTexImage.xml

Then there is the choice of usable compression format. All those I am aware of are lossy, so that may not be suitable to you…
These venerable S3TC formats are often available :
http://www.opengl.org/registry/specs/EXT/texture_compression_s3tc.txt
For very modern hardware, there is new better block compressor available :
http://www.opengl.org/registry/specs/ARB/texture_compression_bptc.txt

Thanks for your reply.

I was afraid that it would be difficult to parallelize something like this. However, I do not dare to risk lossy compression.

To do parallel RLE with good performance, you’d probably want to block the data so you can farm out the compression across a large number of cores in parallel.

But before that, it’s also worth measuring what kind of readback performance you’re getting right now. For PCIe v2 x16, you should be getting up to 8 GB/sec (theory) or ~6.4GB/sec (in practice).

This is what I am thinking about. Probably the best solution would be to divide the image to maybe 32-64-128 rectangular blocks for one thread each. The problem is that the size of the blocks RLE-compressed would not be the same. Is it possible to read back data from the GPU using CUDA/Open CL in parallel (i.e. each thread reads back its own block to a system memory buffer big enough to make sure they don’t overlap)? Sorry to pick your minds with something that should be trivial for someone with CUDA/Open CL knowledge, but it would be nice to know if it is worth the effort for me to learn one of those two.

(I haven’t measured the exact read back performance before, I only know that if I only comment out the single glReadPixels of my GL thread and render only to the VGA preview window, it speeds up the render tremendously. I have a ‘benchmark project’ where I render a model as a display list consisting of ~30.000 vertices as many times as playback at 50 or 60 fps allows. I only use mid to high end Geforce cards.)

And for how many RGBA8/BGRA8 pixels ?
Try to render to 2 buffers, ping ponging between both, so while one in rendered to by GPU, the other is read by CPU.

Can you detail the kind of data you are retrieving ? It may benefit from diffent colorspace, such as 2 components, or 3 components with 8 bits for one and 4 bits for two others, etc ?

I read back BGRA8 pixels. I need at least two streams of HD (1920*1080@30fps) and I already use PBO ping-pong. Two streams of HD should be possible, but since my experience tells me that reading back just one already limits what I can render, my idea was to try to limit the number of bytes I need to read back. Colour space compression is a possibility, but that is a kind of lossy compression too, and my images could be compressed much much better with simple RLE.

1 stream (1920*1080@30fps,BGRA8) is 250MB/sec. Even with few years old motherboard/gfx card you can do 2.5GB/sec. Give up the idea of RLE compression.

Plausible solution would be YCrCbA 4:2:2:4. You will get about 190MB/sec per stream. Or maybe YCrCbA 4:1:1:4

Do you mean read data back from the GPU in parallel with kernels executing on the GPU?

In CUDA, yes. Not sure about OpenCL. There’s an excellent description of using streams in CUDA to do exactly this (termed “device overlap”) in the book “CUDA By Example” (Chapter 10). In short, the GPU has a Kernel Engine and a Copy Engine, each of which have their own separate work queues. If you order your tasks into streams properly, both engines can be executing tasks in parallel.

This requires a GPU which supports “device overlap”, which IIRC is any compute capability 1.1 or better GPU (later GeForce 8 chips and newer).

You may be able to use this capability in OpenGL (when available) through multi-part readbacks through PBOs. Not sure about that though.

i.e. each thread reads back its own block to a system memory buffer big enough to make sure they don’t overlap

I’m no guru here, but the only capability I know about for having GPU threads writing back to CPU (host) memory in parallel is to have your kernel writing directly to page-locked (pinned) CPU memory.

(I haven’t measured the exact read back performance before, I only know that if I only comment out the single glReadPixels of my GL thread and render only to the VGA preview window, it speeds up the render tremendously.

I would time whatever approach you are using to ensure you are getting decent download rates. If you can solve your problems merely by improving your download approach, then on-GPU compression might not even be required.

Some recent high-end cards seem to have abnormally slow download rates. Which GPUs are you using?

One cool thing about reading back through page-locked memory is you can reportedly eliminate the extra copies that otherwise can occur. That may be an option for you. But you might post your ping-pong PBO approach (formats used, etc.) before jumping off into OpenCL or vendor-specific APIs.

“Do you mean read data back from the GPU in parallel with kernels executing on the GPU?”

Yes. I am thinking about assigning the task of compressing and downloading a block of pixels to each thread.

“Which GPUs are you using?”

I use G80 or better GPUs. Mainly the Geforce 200 series and I also have a 460 to try the Fermi line. I had some problems with the latter but I haven’t noticed anything wrong with the speed of download.

I tried to separate the lines in my code that are responsible for reading back the images:

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, Pbo[1]);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);
GLuint pb=Pbo[0];
Pbo[0]=Pbo[1];
Pbo[1]=pb;

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, Pbo[0]);
glReadPixels(0,0,XRES,YRES,GL_BGRA_EXT,GL_UNSIGNED_BYTE,0);

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, Pbo[1]);
DWORD* Pixbuffloc = (DWORD*)glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);