Asynchronous read-back perfs with PBO

Cyril · October 25, 2004, 12:35am

Hello,
I’m trying to achieve an asynchronous read-back of the frame buffer to allow overlaping of pixel transferts and their processing. My application use 4 AUX buffers on a 6800 and I use PBO like this:

  ReadBuffer(AUX0)
  BindBuffer(PBO 0)
  ReadPixel
  For(i=0; i<4; i++)
    if( i<3 )
       ReadBuffer(AUX[i+1])
       BindBuffer(PBO i+1)
       ReadPixel
    endif

    BindBuffer(PBO i)
    MapBuffer
    ProcessPixels(MappedBuffer)
    UnmapBuffer
  EndFor

To sum up, I initialise the reading of pixels into a PBO and then map the previous buffer and process its pixels while the next buffer is filled asynchonously. But it seems to be far more slowest than if I juste Readback each AUX buffer into system memory and then precess the pixels
Did anybody experiences the same problem ? Do you know if asynchronous pixel transferts from the framebuffer is really fonctional ?

yooyo · October 25, 2004, 2:28am

Which pixelformat you use during glReadPixels? You should use GL_BGR or GL_BGRA. Also, glMapBuffer can be a bottleneck because transfer might not be finished and CPU have to wait until transfer is finished.

Maybe you can try to use NV_pixel_data_range and fences. Something like:

  
// Initiate async transfer
for (i=0; i<4; i++)
{
 ReadBuffer(AUX i)
 SetPDRPointer(pointer[i])
 SetFence(fence[i])
 ReadPixels()
}
// do someting usefull on CPU while transfer are not finished

// process transfered data
for (i=0; i<4; i++)
{
 if (TestFence(fence[i])
 {
  // data is transfered!
  // use data in pointer[i]
 }
 else
 {
  // worst case! CPU have to wait :(
  // transfer are not yet finished
  FinishFence(fence[i]) // wait for transfer finish
  // use data in pointer[i]
 }
}

If you code are constantly in else branch then your app is transfer limited. In this case reogranize main loop like:

bool bProcessing = false;
while (!bQuit)
{
 if (bProcessing)
   ProcessDataInSystemMemory();
 RenderScene();
 InitiateTransferFromAUXBuffersToSystemMemory();
 bProcessing = true;
 SwapBuffers();
}

In this case if you enable vsync then app will wait in SwapBuffers until vblank. This time should be enough to DMA finish transfer.

yooyo

Cyril · October 25, 2004, 6:01am

Thank you yooyo, I have tried to GL_BGRA instead of GL_RGBA in glReadPixels but I steel see no change
I have also made a small test with NV_pixel_data_range and fences :

 
...
glReadPixels(0,0, 1, 1, GL_RGBA, GL_UNSIGNED_BYTE, getBuffs[0]); //Force previous drawing
printf("time readbuffers 0 : %d
", glutGet(GLUT_ELAPSED_TIME)-tstart);

for(int i=0; i<4; i++){
  glReadBuffer(buffers[i]);
  glEnable(GL_READ_PIXEL_DATA_RANGE_NV);
	 
  glPixelDataRangeNV(GL_READ_PIXEL_DATA_RANGE_NV, sizeX*sizeY*4, getBuffsPDR[i]);
  glSetFenceNV(fences[i], GL_ALL_COMPLETED_NV);
  glReadPixels(0,0, sizeX/4.0, sizeY/4.0, GL_BGRA, GL_UNSIGNED_BYTE, getBuffsPDR[i]);
	
}

printf("time after read : %d
", glutGet(GLUT_ELAPSED_TIME)-tstart);
/*
for(int i=0; i<10000; i++)
  printf("Foo");

printf("time after foo : %d
",
*/

 glutGet(GLUT_ELAPSED_TIME)-tstart);
for(int i=0; i<4; i++){
  if(glTestFenceNV(fences[i]))
    printf("time fence %d : %d
", i, glutGet(GLUT_ELAPSED_TIME)-tstart);
  else{
    printf("Too early, waiting... %d
", i);
    glFinishFenceNV(fences[i]);
  }
}

glDisable(GL_READ_PIXEL_DATA_RANGE_NV);

But the “time after read” is always 12ms, the time to read-back the 4 buffers and steel no overlapping seems to occur I never go into the else case of the TestFence even if nothing is done after ReadPixel. I allocate getBuffsPDR[i] with “wglAllocateMemoryNV(5000000, 1.0, 0.0, 1.0)”…
I should be doing something wrong but I dont know what…

NiCo1 · October 25, 2004, 12:46pm

Correct me if I’m wrong, but IMHO each call returns directly after starting the DMA transfer, but it does not imply that the buffer can be mapped before the transfer is complete.

So calling readpixels the oldfashioned way is faster then using PBO’s and mapping at full resolution.

If you map the buffers like the example in the EXT_pixel_buffer_object spec, using 2 buffers at half resolution, it allows a buffer to be mapped twice as fast, because the readpixel call finishes twice as fast, allowing the mapping to take place.

Nico

yooyo · October 25, 2004, 12:53pm

Hmmm… glutGet(GLUT_ELAPSED_TIME)-tstart is time from app start? Try to do
int starttime = glutGet(GLUT_ELAPSED_TIME);
//for loop with glReadPixels
int endtime = glutGet(GLUT_ELAPSED_TIME);
int elapsedtime = endtime - starttime;
Then print elapsedtime!

In PDR case it is normal that glReadPixels doesn’t “eat” CPU time because it only initiate pixel transfer and returns immediatly.

Also you can try to allocate one big PDR buffer (4 * imageXimageYsizeof(pixel)) and then call glPixelDataRangeNV(…, imageXimageYsizeof(pixel), calc pointer based on buffer index)

yooyo

Cyril · October 25, 2004, 9:32pm

No tstart is the start time of the frame and the 4 glReadPixel “eat” 12ms, the entire read-back time. So it doesnt seems to just initiate the transfert I have also allocated a big PDR buffer because I wasn’t able to allocate more than one with wglAllocateMemoryNV, the 3 last got NULL, I dont know why. So steel no overlapping
Nico you are right, but in my case I used 4 buffers allowing a buffer to be mapped 4 time faster. But my problem is to be able to process the data already mappable while the remaining is transferd.
I dont know if it is important but I’m rendering into a pbuffer…

Cyril · October 26, 2004, 12:22am

OK with PDR, I have got the glReadPixel to return in 2ms instead of the 22ms of transfert time using glEnableClientState(GL_READ_PIXEL_DATA_RANGE_NV) instead of glEnable and with only one call glPixelDataRange because it seems to flush its previous call.
But there is steel no overlaping because the program always go into the else case of glTestFence (even if the foo loop take more than 200ms) and execute the glFinishFence which return in… 22ms ! The transfert time So all the transfert seems to appen when waiting for the fence and not while the foo loop is running…

yooyo · October 26, 2004, 1:28am

In this case reorganize you main render loop as I suggested in my previuos posts. Transfer speed is same with or without PDR, but PDR allow to CPU do something else while readback.
Depending on your image size it might be finished before next loop iteration. Worst case is that you have to wait for fences even on beginning rendering loop.

Can you tell me image sizes?

yooyo

Cyril · October 26, 2004, 2:05am

The total image size in my last test with PDR and only one ReadPixel is 1024x1024 and in the first test with 4 drawbuffers, each images were 256x256.
But in may last test with PDR, I am not transfert limited because I know the time to transfert the 1024x1024 image from the framebuffer is average 22ms and after initiating the transfert I have made a foo loop during more than 2 seconds, the TestFence steel fail and glFinishFence steel take 22ms to return…

imported_Adrian1 · October 26, 2004, 3:03am

You have hit exactly the same problems as I did.

You need to add a glflush() after your readpixel call to actually start the readpixel. I don’t understand why.

As you have already discovered making gl calls after the readpixel can force the readpixel to finish before processing is continued.

Cyril · October 26, 2004, 3:47am

Yes it works !!! Thank you very much Adrian !
It is also true with PBO, it must have a glFlush after ReadPixel to initiate the transfert.
Its very good but it is steel a litle bit tedious that multiple asynchronous transferts cant be done in the same time, because each ReadPixel flush the previous…
I also wonder if there is a way -other that fences that is NV- to control the termination of a transfert with PBO ?