Concurrent render and readback

My problem is that readback does not seem to be executing concurrently with rendering using the GL_EXT_pixel_buffer_object extension.

I am trying to render multiple viewpoints and read them back to system memory as fast as possible.

To do this I’ve split my draw() function into 2 threads: the main thread renders to the back buffer, enters the mutex, binds to a pixel buffer object, and calls glReadPixels to read the buffer into the PBO, and maps the PBO to a global pointer in system memory.

The second “readback” thread does a memcpy on the aforementioned global pointer to a buffer in system memory.

Each thread executes in an alternating fashion with the intention of allowing the graphics card to render while the system does a DMA transfer from the VRAM to the system RAM.

This gives actually worse performance than the non threaded version with instructions in serial.

I then tried to increase performance by following the asynchronous readback example on the GL_EXT_pixel_buffer_object extension spec, which used 2 PBOs and split the image in half and called glReadPixels once for each half. This does not improve performance either.

I believe the memcpy is blocking. Am I missing something important? Or is this just not possible? Is this hardware dependent?

I’m testing it on a new dell something with PCIe x16 nVIDIA 7300GS on Windows and on another intel core 2 duo with dual 7300GS PCIe x16’s on Red Hat EL 4 and Windows.

Code on request… Thanks

Originally posted by texas_zebra:
[b]
To do this I’ve split my draw() function into 2 threads: the main thread renders to the back buffer, enters the mutex, binds to a pixel buffer object, and calls glReadPixels to read the buffer into the PBO, and maps the PBO to a global pointer in system memory.

I then tried to increase performance by following the asynchronous readback example on the GL_EXT_pixel_buffer_object extension spec, which used 2 PBOs and split the image in half and called glReadPixels once for each half. This does not improve performance either.
[/b]
The asynchronous transfer must be completed at time the MapBuffer returns so if your thread calls the MapBuffer immediatelly after the ReadPixels, you will gain nothing because the MapBuffer will wait until the transfer is done. The example in the specification inserts the first call to processImage() between mapping of the first and second PBO so the transfer of second PBO can continue during processing of data from the first PBO.

Thank you Komat.

It would help to process the data during transfer of the buffer, but the nature of the program disallows processing until a large number of these arrays are collected. They are fixed in a large singular array and then processed.

The bottleneck which the program faces is the transfer of the pixel arrays across the system bus (in this case PCIe).

I’m looking for a way to render and transfer pixels across the system bus simultaneously.

I was under the impression that PBO’s can be read back across the bus while the graphics card rendered the next frame. This should bump up performance close to that of a no-readback render.

The bottleneck which the program faces is the transfer of the pixel arrays across the system bus (in this case PCIe).
On a low-ish end 7600 I did a test and I got (extrapolated) ~600MB/s readback from card to system memory (using slow-ish DDR2 RAM, why that may have been a limiting factor), but that was on a “silent” system with no other serious activity going on. Depending on what you do this may not be enough, even that it’s faster than what we could upload over AGP not too long ago. :slight_smile:

If the bus speed is indeed the limiting factor, could image quality degradation be acceptable? I’m thinking especially of DXT compression, or even switching color space (both approaches requires some “shader” programming though).

I’m looking for a way to render and transfer pixels across the system bus simultaneously.
While I don’t think that should be a problem, it may be depending on how you deliver and pull the data.

This also strikes me as a situation where you do not want to map buffers - well, at least not keep different buffers mapped for reading and writing at the same time. You may even want to treat the gfx card (and the bus!) as a half-duplex pipeline. Upload everything to the card, tell it “render this”. While it’s busy rendering into one framebuffer, you should be able to pull the contents from a previous frame (now in an unrelated buffer).

One potential problem with this is that bandwidth isn’t unlimited. While the card is drawing to the current frame, it may actually use so much memory bandwidth that there simply isn’t much (or any) to spare for reading back the previous buffer to system memory.

If that is the problem, you probably have no choice but to put in another card and render alternate frames to them (should buy you enough bandwidth, so long as there is no bus contention).

++luck

I guess your problem is that you want to read from the buffer while it is being rendered to, so I guess there is little hope for asynchronous operation. I would suggest to use two FBOs with different renderbuffers. You could ping-pong between the FBOs and reading from one while rendering to the another.

Use two PBO’s. In one frame initiate read pixels to PBO1 then map PBO2 buffer (which contain previous frame) and do memcpy. In next frame do the same thing but swap PBO1 and PBO2. This will make 1 frame lag, but it will keep good sync between CPU and GPU.

FYI: Up to today I don’t know of any GPU that would let you read back data while the GPU is rendering. The pixel transfers are only asynchronous with the CPU, so you can eg. process the previous pixel data on the CPU while reading back another pixel region.

Up to today I don’t know of any GPU that would let you read back data while the GPU is rendering.
Are you thinking of driver or hardware here?

Assuming a local system, mapping RAM on a gfx card (into user process address space by extension) is only an MMU thing (right?).

Therefore, if that buffer is still mapped (from the OS’ POV), and the server (the GPU) is modifying data either within that region or not, you could read that memory even while the GPU was updating it (a sure recipie for disaster).

Therefore, there must be a locking mechanism, and that must be implemented by the “OpenGL driver” (in quotes, as “driver” in my world means kernel-mode, while here we have megabyte after megabyte of code in DLL’s in user-mode only communicating with the kernel-mode part).

Meaning OpenGL function calls can stop execution of user-mode code until processing is completed (for a particular memory area), but if I have an OS-mapped buffer I don’t see how an implementation could prevent that memory from being accessed.

So how would the GPU be able to stop data readback while it’s running? It sure isn’t allowed to hang the hardware.

Reading the contents of a PBO asynchronously is not a problem, but getting the framebuffer content into a PBO is.

You have to call ReadPixels for that, and this is put in the normal command stream and will not be executed in parallel with other rendering commands.

The synchronisation happens at MapBuffer. You can’t do a ReadPixels into a mapped buffer, and when you try to map a buffer, it waits until all pending read commands are complete. So it is never possible that the GPU writes to a mapped buffer.

Originally posted by tamlin:
[b] Assuming a local system, mapping RAM on a gfx card (into user process address space by extension) is only an MMU thing (right?).

Therefore, if that buffer is still mapped (from the OS’ POV), and the server (the GPU) is modifying data either within that region or not, you could read that memory even while the GPU was updating it (a sure recipie for disaster).[/b]
Good hypotheses, I need to say. :slight_smile: I also used to assume things to be like that before, but at least from NVIDIA I get the information that their current graphics hardware is not able to both render and push data back to the CPU at the same time. This is emphasized by Mark Harris eg. in his post in this thread on the GPGPU forums.

There would indeed be advantage in being able to make all the operations asynchronous on the CPU, on the GPU and in the bus between them. That, however, seems not to be the current case.

In following scenario w/o PBO:

...
Render();
glReadPixels(...)

glReqdPixels will force CPU to wait until all pending OpenGL commands is finished and glReadPixels is executed.

But with PBO situation is different:

OnInit()
{
 Create two PBO's pbo1 and pbo2
}

...
Render();
// copy previous frame into sysmem
glBindBuffer(..., pbo2);
void *ptr = glMapBuffer(...)
memecpy(sys_mem, ptr, size);
glUnmapBuffer()

// initiate readback of current frame
glBindBuffer(..., pbo1);
glReadPixels(...);
glBindBuffer(..., 0);

// swap pbo's handle
GLuint tmp; tmp = pbo1; pbo1 = pbo2; pbo2 = tmp;

In this case, app map PBO that contains previous frame and copy content into sysmem.
glReadPixels in this case will return immediately. glReadPixels command will be inserted into internal command stream and when GPU finish all tasks it can start copy framebuffer into PBO memory. Meanwhile, CPU can do other tasks.

Just like we have doublebuffered rendering using doublebuffered readback is normal and logical approach.

Even more… app can prepare 3 or 4 PBO’s and make another readback thread. Render thread just map buffer and pass pointer to another thread to do copy content to sysmem or whatever else. When thread finish job it can notify render thread to unmap buffer. Render thread must take care to NOT use mapped pbo’s for any other operation. This is usefull on dual (or multicore) systems.

Thank y’all very much!

That answers my question.

I would love for this feature to appear in the future, as it would increase performance for GPGPU applications and such.