Strange thing with glReadPixels

Hi,

I had a few problems with it in the past, and having read quite a few posts about it, I guess I am not alone. Still I thought my solution worked well.

I render to an FBO in HD resolution (1920*1080, 25 FPS). I did not use the PBO route to read back the final buffer holding my rendered frames because I thought the CPU is the one that has to wait for the end result and I would not gain anything by making the read back process async with a PBO. This resolution means reading back 200MB/sec which the PCIe bus is well capable of.

But when the render becomes complicated (lots of polygons, shadows,reflections, etc.) play back starts to stutter, and according to Task manager, the CPU core handling the opengl thread is close to 100%. If I do nothing else but comment out the glReadPixels line, that near 100% load goes back to 2-3%, and play back on VGA looks perfect (which also means that my GPU still did not reach its limits).

I simply cannot explain this. I switched to using two PBOs to make the read back async, and although now CPU load does not climb over 10-11%, play back still stutters sometimes. If I just comment out the glReadPixels with the NULL at the end that AFAIK only involves DMA to fill the PBO, playback becomes perfectly smooth but of course I don’t get anything in my system RAM buffer that holds the data read back.

Can someone explain why this happens and why only when the render becomes complicated? What that has got to do with filling in a buffer using - in theory - only DMA?

Thanks.

“play back still stutters sometimes”
Not very precise. Can you log the millisecondes per frame, showing some of the slow frames ?

The high cpu usage with readpixels is probably because the driver busy loops waiting for the video card to be ready for transfer. Simple readpixels is often not the most optimized gl call :slight_smile: so PBO seem unavoidable in your case.

Can you share your hardware/software spec such as video card, cpu, driver version, OS, etc ?

Thanks for the reply. Hardware is Intel Q9550 CPU, nvidia GTX260 GPU.

I know it is not precise, but it is difficult to measure. The finished frames are transferred to an HD video card, hence the need for glReadPixels. The whole thing is synced to the video card and not the VGA VSync.

I changed it to an async PBO solution and it helps somewhat but it is still a mystery to me why the first part of the PBO transfer (i.e. glReadPixels,0,XRES,YRES,GL_BGRA_EXT,GL_UNSIGNED_BYTE,0) adds a lot to the load of the CPU. I thought this needs no work from the CPU (async).

Do you display rendered image on gtx260 too? Maybe you have vsync turned on, so it wait for retrace spending CPU cycles.

If you are on windows turn off all eye candy stuff (full winfowd drag, shadows under menus, etc). Simple operation like window minimise stops everything for 250 msec even time critical threads.

In your pipeline you have 2 independent workers… HD card and GPU. to avoid any stalls, you have to use doublebuffering on GPU, so when HD card request frame, give it immediatly and post next frame request to GPU render thread. While hd card process frame, GPU prepare next frame.

Yes I do, there is a preview window. Since the render is not synced to the GPU, and 1080i requires me to render at twice the frame rate but only every other frame is displayed on the VGA, I cannot expect playback to look smooth in this window and it isn’t.

The pipeline is more or less what you describe. I have two threads, one for the HD card and one for opengl. Both have different CPU cores assigned, so they really run in parallel. The whole problem happens only when the GPU has complicated stuff to do, even moderately big scenes are perfectly fine. Still, the GPU itself cannot be the bottleneck, without reading back the buffer the render is OK even if I add a lot more for the GPU to render.

Do you know when glReadPixels is used to fill in a PBO as part of an async transfer, does it wait for all commands to be executed first?

If you have bound PBO, glReadPixels returns immediatly, but if you try to lock that PBO then it will stall until all pending operations isnt finished. So… best approach is to use 2 PBO’s. “Post” glReadPixels to first PBO, then map second and copy data to sysmem. Then swap PBO names.

Try to use nonpagable memory (use VirtualAlloc/VirualLock/VirtualFree) for sysmem buffers. Use fast memcpy functions. Move rendering code to worker thread, and leave main thread to UI. Try to turn off thread optimisations in NV control panel…

This is how it looks:

in glthread:

at the ‘start’:

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, Pixbuff[1]);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);
GLuint pbu=Pixbuff[0];
Pixbuff[0]=Pixbuff[1];
Pixbuff[1]=pbu;
Pixbuffloc[0]=NULL;

after render is finished:

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, Pixbuff[0]);
glReadPixels(0,0,XRES,YRES,GL_BGRA_EXT,GL_UNSIGNED_BYTE,0);

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, Pixbuff[1]);
DWORD* Src = (DWORD*) glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB,GL_READ_ONLY);
Pixbuffloc[0]=Src;

in video thread:

DWORD* Src = Pixbuffloc[0];
DWORD* Dest=(DWORD*) BuffFrame[1-GlBuffIx];
int wihe=XRES/2*YRES;

__asm
{
push esi
push edi
mov esi,dword ptr Src
mov edi,dword ptr Dest

mov ecx,wihe

lxz0:
movq mm0,[esi]
add esi,8
movq [edi],mm0
add edi,8
loop lxz0
pop edi
pop esi
EMMS
}

Since it is the video thread that signals the glthread, glunmapbuffer is sure to be executed later than memcopy in video thread.

UI is done in a different thread assigned to a different core, so there should be nothing that ‘disturbs’ the gl and video threads. Also system memory buffers are allocated on 16-byte boundaries.

Do you see anything that might be wrong here? Thanks.

I remember older NV driver had a stall problem after 44.5MB transfered data. My test app shows that after 44.5MB downloaded data it stall for 20msec. Newer drivers is fixed… so maybe you should try to change driver.

I have the latest driver, so it can’t be that.

From this, I assume that there is nothing obviously wrong in my code. This is not good, I hoped I made a mistake or did something wrong.

Try with faster memcpy function. Also add some timing code to your threads just to pinpoint problem.

I might be mistaken, but I thought the one I use with MMX is the fastest (although I haven’t even looked at extensions beyond SSE2).

Anyway, it is in the video thread and it is not a problem.

The problem is - at least it looks like it is - the glReadPixels line. If I comment it out, the problem is gone (of course, I don’t get the data either).