Low readback performance with PBO , help !!!!!

PixIn · March 16, 2008, 11:51am

Hi

i use a PBO approach to grab a bitmap from my 3d scene . steps are the following :

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo1);

glReadPixels(0, 0,bWIDTH,bHEIGHT,GL_BGRA, GL_UNSIGNED_BYTE, 0);

copymem(glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB,GL_READ_ONLY_ARB)^, buffer,BWidth * BHeigh * 4);

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo2);

glReadPixels(0, 0,pWIDTH,pHEIGHT ,GL_BGRA, GL_UNSIGNED_BYTE, 0);

copymem(glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB)^,bitmapbuf, bWidth * bHeigh * 4);
glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB);

swap(pbo1,pbo2)

with methode i get 15~25 fps low than a direct :

glReadPixels(0, 0,pWIDTH,pHEIGHT ,GL_BGRA, GL_UNSIGNED_BYTE, bitmapbuf);

as i see in many forum the PBO should be more Faster than the glReadPixels one :eek:

my card is : nVidia Geforce 7600 GS
forceware version : 169.21
Bus PCI Express x16
CPU : P4 3.0Ghz

are there any bad implementation in my PBO code, how can i boost it ?

help please

Thanks in Advance

Brolingstanz · March 16, 2008, 11:59am

Is this for a screenshot?

You want to use PBOs when you have something else to do while the transfer is taking place behind the scenes, not when you need the results straight away.

Check out the PBO spec for some common usage scenarios and example code.

mfort · March 16, 2008, 12:10pm

Why are you reading the data twice?
Why are you coping data out of PBO? the PBO memory is just fine.
Do not call MapBuffer just after read pixels. There is no benefit of PBO then.

PixIn · March 16, 2008, 12:14pm

yes it is for a screenshot.

just now i change it to this to acheive a Asynchron readback

[b]glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
glReadPixels(0, 0, imagewidth, imageheight/2, GL_BGRA, GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
glReadPixels(0,imageheight/2,imagewidth,imageheight/2,GL_BGRA,GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

// Process partial images. Mapping the buffer waits for
// outstanding DMA transfers into the buffer to finish.
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
pboMemory1 = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
processImage(pboMemory1);

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
pboMemory2 = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
processImage(pboMemory2);
[/b]
but still get low performance instead the glreadpixels

can someone please guide me to the correct steps

Thanks

yooyo · March 16, 2008, 2:28pm

In current frame bind pbo and do readback.
In next frame map pbo and copy it’s content to sysmem.

Why this? When app call glReadPixels while pbo is bind, then glReadPixels is nonblocking call. But if you try to map pbo buffer soon after glReadPixels then this glMapBuffers will be blocked until glReadPixels is finished.
When to call map buffer is hard to tell, because it depends on underlaying hardware, driver, screen size, chipset, … So the best will be to do that operation (glMapBuffer and memcpy) in next frame.

Also… this pbo memory is not cacheable so do not try to do some weird access pattern. Plain memcpy in sysmem buffer is best approach.

PixIn · March 16, 2008, 3:54pm

Thanks for reply

ok, i am not sure that i get it well but i change my code a bit to :

[b]index = (index + 1) % 2;
nextIndex = (index + 1) % 2;

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[index]);
glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);

GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[nextIndex]);
GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);

glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);[/b]

but still same problem … low performance

any idea, perhaps some code will help me better

Thanks

yooyo · March 16, 2008, 4:23pm

No… thats wrong… see this


// At end of frame before SwapBuffers call
// to use this... just set bDoScreenShot to true.
if (bCopyToSysMem)
{
 bCopyToSysMem = false;
 glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo); 
 GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
 memcpy(sysme, src, imgsize);
 glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0); 
}

if (bDoScreenShot)
{
 glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pbo); 
 glReadPixels(0, 0, SCREEN_WIDTH, SCREEN_HEIGHT, PIXEL_FORMAT, GL_UNSIGNED_BYTE, 0);
 glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0); 
 bCopyToSysMem = true;
 bDoScreenShot = false;
}

SwapBuffers(...);

Above code snippet is just for single screenshot!

PixIn · March 16, 2008, 7:47pm

Thanks yooyo for the hints

your code give a faster result (10~15 fps faster).
but :s i get a black bitmap it seems that the src is empy :S

can u tell me what the sysme is ?

Thanks

Jackis · March 17, 2008, 1:58am

yooyo wants to say, that in order to use befits from PBO, you should make asynchronous readbacks. In the code above yooyo advice you to make ReadPixels with PBO on the first frame, but you can use this memory only next frame (or some frames later, maybe 2-3), and only way like that may get you PBO benefit.

yooyo · March 17, 2008, 7:06am

Just insert my code before you call SwapBuffers, at end of render frame. Something like…


// this is a very basic render loop
while (bQuit == false)
{
 UpdateGame();
 RenderGame();
 // insert my code here
 SwapBuffers(); // present frame
}

sysme is a typo… it should be sysmem
sysmem is pointer to system memory buffer. Applicatio should allocate this buffer. Size should be SCR_WIDTH * SCR_HEIGHT * BYTES_PER_PIXEL.

PixIn · March 17, 2008, 9:58am

Great.
so, if i understood well, this use only one pbo.
and for iterations (0…2…4…6…), it copy data from the pbo to SysMem, then in iteration (1…3…5…7…), it copy data from SysMem to my bitmap buffer…
that’s fine.but in my app i get a black result it seems like the bitmapbuffer is filled with 0 value (i already allocate my SysMem). :(.

also, if the code use only one pbo why should Swapbuffers ?.

Thanks

yooyo · March 17, 2008, 10:27am

Now, you are confusing me
You stated before that you need to readback backbuffer just for screenshot… not for streaming readback. Then I write code for that usage pattern. Now… If you want to do streaming readback then above code hav to be modifyed… something like:



#define NUMR_PBO 4
GLuint m_pbos[NUMR_PBO];
int vram2sys;
int gpu2vram;
unsigned char* membuffer;
unsigned int memsize;

// call this once
void Init()
{	
	memset(m_pbos, 0, sizeof(m_pbos));
	membuffer = NULL;
}

// call this at least once... and whenever screen size has been changed
void OnScreenSize(vec2i newsize)
{
	memsize = newsize.x * newsize.y * 4;  // BGRA

	// gen PBO names
	if (m_pbos[0] == 0)
		glGenBuffers(NUMR_PBO, m_pbos);

	// create empty PBO buffers
	for (int i=0; i<NUMR_PBO; i++)
	{
		glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[i]);
		glBufferData(GL_PIXEL_PACK_BUFFER_ARB, memsize, NULL, GL_STATIC_READ);
	}

	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);
	
	// vram to sysmem pbo index
	vram2sys = 0;
	// backbuffer to vram pbo index
	gpu2vram = NUMR_PBO-1;

 	if (membuffer != NULL)
 	{
 		delete [] membuffer;
 		membuffer = NULL;
 	}
 
 	membuffer = new unsigned char[memsize];
}

// call this at end of frame render
void ReadBack()
{
	// readback current frame into PBO
	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[gpu2vram]);
	glReadPixels(0,0,m_ViewPortSize.x, m_ViewPortSize.y, GL_BGRA, GL_UNSIGNED_BYTE, 0);

	// copy previous frame from PBO to sysmem
	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, m_pbos[vram2sys]);
	void* data = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);
	if (data != NULL)
	{
		memcpy(membuffer, data, memsize);
		// Do something with image in membuffer.. 
	}
	glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

	// shift names
	GLuint temp = m_pbos[0];
	for (int i=1; i<NUMR_PBO; i++)
		m_pbos[i-1] = m_pbos[i];
	m_pbos[NUMR_PBO - 1] = temp;
}

Regarding black image… can you post your code… or atlease pseudo code of your rendering loop. It is very strange why do you getting black image.

PixIn · March 17, 2008, 10:42am

Sorry for that.

i will try this piece of code

PixIn · March 17, 2008, 12:07pm

Hi

finaly it works , thanks yooyo for the Streaming code you are my hero ,i get my image perfectly.

now back to the performance issus, normaly with the pbo implementation what is the factor of performance VS the “no pbo” one ? because with my machine i get more (2~5 fps) with the pbo approach.

my config is :
Nvidia 7600 GS
PCI Express 16x
CPU : P4 3.0Ghz

yooyo · March 17, 2008, 12:25pm

Memory bandwidth is same in both case, but with PBO you can avoid CPU/GPU stall. This mean, you can do something else on CPU side (like decoding or encoding).

Using some advanced mem copy functions (using MMX registers, prefetch, cache align) and using memory buffers allocated with VirtualAlloc and locked with VirualLock you can increase transfer speed 25-30%. Also, underlaying hardware (mem controller, mem speed, mem latency) affect transfer speed. My laptop HP 8710w with Quadro 1600M with optimizes memcopy function can readback 1250MB/sec. Friends machine (Penryn E8200 + 8800GT + DDR2-800+) have ~2GB/sec.

NiCo1 · March 17, 2008, 12:50pm

Is this the same page-locking mechanism that CUDA uses? I just ran the CUDA bandwidthTest app. Without page-locked memory I get


Host to Device Bandwidth for Pageable memory
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               834.4

Device to Host Bandwidth for Pageable memory
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               789.0

and with page-locked memory I get


Host to Device Bandwidth for Pinned memory
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               2124.4

Device to Host Bandwidth for Pinned memory
Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               1629.9

PS. I also ran this on a HP 8710w laptop with Quadro FX 1600M

yooyo · March 17, 2008, 2:35pm

CUDA is still mystery for me.
I don’t have time to play with CUDA, but I’ll do that ASAP.

Hampel · March 18, 2008, 4:16am

@Nico: can you post the code snippet used in CUDA for this page-locking thing?

NiCo1 · March 18, 2008, 5:36am

The memory is being allocated with cudaMallocHost(void **ptr, size_t size). It’s implemented in their cudart dynamic library so I don’t know exactly how they do it.

Here’s a related thread on the gpgpu forum.

yooyo · March 19, 2008, 4:35am

I was able to raach CUDA speed (at least on my machine ~1650M/sec)… Readback from FBO buffer 512x512xRGBA (1MB size) with optimized memcpy and with mem buffers created using VirtualAlloc + VirtualLock.

Anyway, the problem with readback stall every 47.5MB is still there.