PBO + glReadPixels not so fast?

Hi,

I’m trying to speed up some code that pulls that from the framebuffer using glReadPixels.

I’ve created two PBO with usage set to GL_STREAM_READ_ARB. My rendering code then alternate between the two PBO and do the following:

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, current);
glReadPixels(…, 0);
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0)

I was under the impression that glReadPixel would return immediately, but these three lines takes about 6 ms for a 1024 x 1024 framebuffer.

Currently the code isn’t doing any map/unmap so the data should never leave the GPU.

I’m using a Quadro FX 3500 with the latest drivers (169.96).

Am I doing something wrong or is this to be expected?

/A.B.

glReadPixels has to wait until the rendering is done until it can start reading, so yes it’s as expected.

Use two PBO’s. Bind 1st, read pixels, bind 2nd, map and copy data to sysmem, unmap, unbind all PBO’s, then swap PBO names. Repeat this every frame.
One more thing. Use GL_BGR or GL_BGRA pixel format.

Good reading on PBO upload/readback - http://www.songho.ca/opengl/gl_pbo.html

PS: Yooyo seems to be little bit tired by answering the same question several times a month :slight_smile:

Exactly what I’m doing. But as I said, it’s not as fast as I expected. I don’t see much improvement over just doing an ordinary glReadPixels.

/A.B.

Then you are doing something wrong…

  1. Render frame
  2. bind pbo1
  3. readback
  4. bind pbo2
  5. copy previous frame from pbo to sysmem… map buffer, copy data to sysmem (use some fast memcpy code)
  6. unmap buffer
  7. unbind pbos
  8. swap pbo1 and pbo2

So… frame will be in sysmem with one frame delay. With PBO readback call is nonblocking call. But map buggers can be blocking call if there is pending operation related to currently binded. So… if you call map buffer too soon it will be blocking call. If there is no pending operations mapbuffers returns very quickly.

Ok, so the question is what am I doing wrong?

The intitation code looks roughly like this:


SomeClass::initPBO() {
  glGenBuffersARB(2, m_ids);
  for (int i = 0; i < 2; ++i) {
    glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, m_ids[i]);
    glBufferDataARB(GL_PIXEL_PACK_BUFFER_ARB, m_width * m_height * 4, 0, GL_STREAM_READ_ARB);
  }
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
  m_active = 0;
}

and the capture code looks like this:


SomeClass::capture() {
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, m_ids[m_active]);
  glReadPixels(0, 0, m_width, m_height, GL_BGRA, GL_UNSIGNED_BYTE, 0);
  glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);
  m_active = 1 - m_active;
}

As you can see, right now I’m not even mapping the buffers (eventually I will of course, otherwise this whole excersize would be kind of pointless), and the code in capture still takes about 6 ms. This could still stall I guess if one were rendering at a high enough framerate? However, my rendering is capped at 15 fps so this shouldn’t be an issue.

The values for m_width and m_height has not changed since I created the buffers so their sizes are still valid.

/A.B

Could I for some reason be getting a PBO in system memory? From what I can see in the spec:

http://www.opengl.org/registry/specs/ARB/pixel_buffer_object.txt

there’s really nothing preventing this, am I wrong?

For the record, I’ve tested Song Ho Ahn’s Asynchronous Read-back example and there I see a very clear difference in read speed when using PBO. From what I can see I’m not doing anything differently in my code, except that I’m using a lot more GPU memory for other things.

/A.B.

Yes.

From what I can see in the spec: […] there’s really nothing preventing this, am I wrong?
No.

Try with GL_STATIC_READ. Check your driver control panel… maybe you have checked some forced AA or such… can you post repro case?

Song Ho Ahn’s demo shows 3.1Mpix/sec on my laptop but I achived 1.6 GB/sec (same as CUDA).

Unfortunately the precompiled binary for Song Ho Ahn’s demo uses a screen size of 256 x 256 and waits for vertical refresh, with a refresh rate of 60 Hz this means the transfer rate will cap at 3.7 Mpixels/s regardless of wether PBO are on or off (The figure 3.1 Mpixels/s suggests you’re using a refresh rate of 50 Hz, correct?)

You will have to recompile the project yourself and increase the buffer sizes and disable vsync. When doing this you will see a clear difference between using PBO and not using PBO.

I’m using the exact same code in my application and I’m not seeing any improvement over not using PBO, in lack of better theories this leads me to believe that I’m getting a system mem PBOs because there’s not enough GPU ram left to allocate the PBOs there.

Any other theories for what could be holding glReadPixels up?

/A.B.

This is my way…


#define valloc(size, prot) VirtualAllocEx(GetCurrentProcess(), NULL, (size), MEM_COMMIT, (prot))
#define vfree(mem)  VirtualFreeEx(GetCurrentProcess(), mem, 0, MEM_RELEASE)
#define vlock(mem, size) VirtualLock((mem), (size))

#define BUFSIZE (4*1024)

// globals
	GLuint m_pbos[NUMR_PBO]; // PBO pool
	int vram2sys;		 // index of PBO used to copy from vram to sysmem
	int gpu2vram;		 // index of PBO used to copy framebuffer to vram
	unsigned char* membuffer = NULL;
	unsigned char* tempbuff = NULL; // used during fast mem copy
	unsigned int memsize;


// call this with size of framebuffer
void InitReadback( int xsize, int ysize)
{
	tempbuff = (unsigned char*)valloc(BUFSIZE, PAGE_READWRITE);	  
	vlock(tempbuff, BUFSIZE);

	memsize = xsize * ysize * 4;

	if (m_pbos[0] == 0)
		glGenBuffers(NUMR_PBO, m_pbos);

	for (int i=0; i<NUMR_PBO; i++)
	{
		glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[i]);
		glBufferData(GL_PIXEL_PACK_BUFFER_ARB, memsize, NULL, GL_STATIC_READ);
	}

	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);
	vram2sys = 0;
	gpu2vram = NUMR_PBO-1;

	if (membuffer != NULL)
	{
		vfree(membuffer);
		membuffer = NULL;
	}

	membuffer = (unsigned char*)valloc(memsize, PAGE_READWRITE);
	vlock(membuffer, memsize);
}

// call this onec per frame or slice...
void ReadBack(int xsize, int ysize)
{
// first.. post read pixels 

	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[gpu2vram]);
	glReadPixels(0, 0, xsize, ysize, GL_BGRA, GL_UNSIGNED_BYTE, 0);

// then copy previous frame from vram to sysmem (membuffer)

	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[vram2sys]);

	void* data = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
	if (data != NULL)
	{
		FastMemCopy(membuffer, data, tempbuff, BUFSIZE, memsize);
	}

	glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);

// unbind PBO
	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

	// shift names
	GLuint temp = m_pbos[0];
	for (int i=1; i<NUMR_PBO; i++)
		m_pbos[i-1] = m_pbos[i];
	m_pbos[NUMR_PBO - 1] = temp;
}

// audiofreak tnx for this
void FastMemCopy(void *dst, const void *src, void *buf, size_t bufsize, size_t nbytes)
{
	__asm 
	{
		mov  esi, src
		mov  edi, dst
		mov  eax, buf
		mov  ebx, bufsize
		bsr  ecx, ebx
		mov  ebx, nbytes
		shr  ebx, cl
main_loop:
		test  ebx, ebx
		jz  main_loop_end
		mov  edx, eax
		mov  ecx, bufsize
		shr  ecx, 7
L1_cache_loop:
		test  ecx, ecx
		jz  L1_cache_loop_end
		prefetchnta [esi + 64 * 10]
		movaps  xmm0, [esi]
		movaps  xmm1, [esi + 16]
		movaps  xmm2, [esi + 32]
		prefetchnta [esi + 64 * 11]
		movaps  xmm3, [esi + 48]
		movaps  xmm4, [esi + 64]
		movaps  xmm5, [esi + 80]
		movaps  xmm6, [esi + 96]
		movaps  xmm7, [esi + 112]

		movaps  [edx], xmm0
		movaps  [edx + 16], xmm1
		movaps  [edx + 32], xmm2
		movaps  [edx + 48], xmm3
		movaps  [edx + 64], xmm4
		movaps  [edx + 80], xmm5
		movaps  [edx + 96], xmm6
		movaps  [edx + 112], xmm7

		add  esi, 128
		add  edx, 128

		sub  ecx, 1
		jmp  L1_cache_loop
L1_cache_loop_end:
		mov  edx, eax
		mov  ecx, bufsize
		shr  ecx, 7
stream_loop:
		test  ecx, ecx
		jz  stream_loop_end
		movaps  xmm0, [edx]
		movaps  xmm1, [edx + 16]
		movaps  xmm2, [edx + 32]
		movaps  xmm3, [edx + 48]
		movaps  xmm4, [edx + 64]
		movaps  xmm5, [edx + 80]
		movaps  xmm6, [edx + 96]
		movaps  xmm7, [edx + 112]

		movntps  [edi], xmm0
		movntps  [edi + 16], xmm1
		movntps  [edi + 32], xmm2
		movntps  [edi + 48], xmm3
		movntps  [edi + 64], xmm4
		movntps  [edi + 80], xmm5
		movntps  [edi + 96], xmm6
		movntps  [edi + 112], xmm7

		add  edx, 128
		add  edi, 128

		sub  ecx, 1
		jmp  stream_loop
stream_loop_end:
		sub  ebx, 1
		jmp  main_loop
main_loop_end:
		sfence
	}
}


The demo, “pboPack” does not measure the performance of glReadPixels() alone. It performs 3 things;

  1. Read pixels from framebuffer with glReadPixels().
  2. Modify the pixels in add().
  3. Draw the modified pixels with glDrawPixels().

You will get pure throughput of glReadPixels() + PBO by disabling the step #2 and #3 in my code.

Also, I’d like to mention that pboPack demo does not use PBO for glDrawPixels() because of OpenGL driver bug. Most video cards are failed on glDrawPixels() + PBO except nVidia Quadro when I release this demo. So, I took it out of the code.

The proper usage of glDrawPixels() with PBO is like this. You may get a better result by replacing glDrawPixels() in my code;


if(pboUsed) // with PBO
{
    glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);
    glDrawPixels(SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);
    glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
}
else // without PBO
{
    glDrawPixels(SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, colorBuffer);
}

I tested today on ATI Radeon X1900 with the above changes and disabling pixel modification block. And, I got this numbers;
(combination of glReadPixels() and glDrawPixels())

256*256
with PBO: 68.6 Mpixels/s = 274.4 MB/s
without PBO: 38.6 Mpixels/s = 154.4 MB/s

512*512
with PBO: 273.7 Mpixels/s = 1094.8 MB/s
without PBO: 63.3 Mpixels/s = 253.2 MB/s

1024*1024
with PBO: 568.7 Mpixels/s = 2274.8 MB/s
without PBO: 79.7 Mpixels/s = 318.8 MB/s

You will get higher numbers if you test glReadPixels() only.

I’m pretty certain that there’s nothing wrong with my PBO code. So I guess my question is what could make glReadPixels stall (when using PBO that is)? So far the only thing I can think of is that I may be getting a software fallback PBO because I’ve used up all the GPU ram on other stuff.

Other theories?

/A.B.

Can you benchmark (profile) following calls:

  • glBindBuffer
  • glReadPixels
  • glMapBuffer
  • glUnmapBuffer.

glBindBuffer should be instant, glReadPixels too. If glReadPixels
stall then something really wrong there. glMapBuffers can stall if pending glReadPixels is not finished.
If you have frequent glReadPxels calls use several PBO’s for that.

I never managed to get glReadPixels any faster with PBO, the 6 ms that were spent in glReadPixels wasn’t a huge problem at the time so I simply left the problem.

Now however, the problem has become more urgent. Since we upgraded to revision 182.08 of nvidias quadro driver the glReadPixels operation takes over 30 ms!

I’ve tried every combination of usage flag (GL_STREAM_READ etc.) and format (GL_BGRA etc.) but with no difference in speed.

I also tried another approach: instead of using two PBO:s, I used two FBO to which I transfered the framebuffer with glCopyTexImage2D, I then used glReadPixels on the FBO which were not currently being copied to. Unfortunately this was exactly as slow.

Does anyone have any theories what could be causing this huge stall?

/A.B.