I recently implemented (or tried to implement) async buffer copies into my renderer. Usecase is copying a render target (fbo color texture) to a PBO, and subsequently copying the PBO data to main memory.
I have currently disabled the copy to main memory to focus on the FBO->PBO copy. It does actually work, I get the results I want. It just is so slow.
I am using two ‘rendertargets’, two PBOs and GLsync objects to go with it (though I can use more). Problem is it does not look like it’s doing an async copy since performance is not increasing. If I just use one fbo and one pbo, performance is the same. Omitting the copy, performanec is vastly increased. Here is a short description of the render/copy loop. I use a writeindex and a readindex, which get incremented/swapped after each frame. Using more then two FBO/PBO buffers doesn’t affect performance as well.
I use a small structure to keep track of fbo/pbo/syncobjects (maybe there is a problem in using those).
I am using a NVidia GTX970, and I have read about NVidia’s dual copy engines, available on quadro GPUs. Is this the problem? My GPU just serializes the copy?
Or maybe I just got it all wrong, and it is only possible to do async copy from a CPU view and not on the GPU itself?
The FBO I am copying is quite large (16k x 2k), the PBOs are mapped persistent so the actual copy to main memory runs from a seperate thread. Using NSight, when copying is enabled, I can see the first drawcall in the renderloop takes up most of the frametime, making it slow (maybe that is of importance)
So, it does work, I can get the data to main memory (and it looks ok), my question is regarding performance.
Thanks for any hints or clarifications,
For starters, I’d recommend putting aside the async part of this, separate threads, persistent mapping, the sync objects, and especially binding and unbinding your framebuffer (which you don’t need to do) and focus on timing your readback method and your synchronous readback performance. How long does the readback take on your CPU thread, and what effective GB/sec readback bandwidth does that imply. Optimize that first. Then throw in “other things” with a careful eye on making sure that your CPU thread time goes down.
For timing purposes only, be sure to put a glFinish() right before you do the readback call to ensure that all future pipeline work is complete and you’re not timing anything but the readback. Then start a timer, do the readback to the CPU, and then stop the timer. How many msec? Now compute the effective bandwidth in GB/sec. What do you get?
As a starter, here is a short little GLUT/GLEW test program (which compiles on Windows and Linux) which does just that:
You’ll notice I’ve plugged in the image resolution you mentioned (16k x 2k).
Here on a GTX1080, here’s one of the typical readback timings I get:
--- FRAME ---
Readback FAST time = 32.092 msec ( 3.895 GBytes/sec)
Readback SLOW time = 42.969 msec ( 2.909 GBytes/sec)
Now even that is a small fraction of the ~15.75 GB/sec theoretical of the PCI Express v3 x16 bus my GPU is plugged into. But perhaps that’s good enough for you.
If you want to read more on this and why it might not be faster, see this thread.
Once you get your synchronous readback performance up as high as possible, then try mixing in other things like async, multiple buffers, etc. to try to hide some of that readback latency.
Using NSight, when copying is enabled, I can see the first drawcall in the renderloop takes up most of the frametime, making it slow (maybe that is of importance)
Don’t know for sure, but I do have some theories. One is that this may be instigated by you binding and unbinding your FBO needlessly. You should avoid changing render targets more than absolutely necessary as changing render targets is very expensive. It could be that the overhead of that is deferred until the first draw call that actually renders on a render target. Try removing the bind/unbind of your FBO.
Past that, make sure (for timing purposes) that you are doing a glFinish() before you do the readback. That isolates your timings from the other “stuff” (which you and/or the driver may be doing) which may make the rest of your frame slow.
I ran the timing tests on my project, with the following results:
min: 39.0039 ms (3,2GB/s)
max: 49.0049 ms (2,5GB/s)
I removed all unneccessary Unbind calls on Shaders and FBOs, increasing performance by a really small margin. Now the question remains how to speed up the readback copy. I read about the dual copy engines on NVidia Quadro cards, but I only have limited access to one of those (M6000) and “only” a GTX970 in my development machine. Is it still a fact that only the Quadro cards do have this dual copy engine? I might try to use CUDA to copy the data to CPU to see if this is faster. With 39ms minimum duration for the copy and the requirement to copy every frame, even if the copy is completely async I won’t hit my target of at least 30 frames per second.
Any more ideas on how to speed up the copy? Decreasing resolution is out of the question for now
Are you saying you’ve already put full async readback in and it’s no faster than 39ms? On frame N, are you processing the frame data from frame N-1 or N-2 to give the readback time to migrate in the background?
No, the 39ms are from the timing measurements I did on the readback alone. I still have to try to make that faster, currently I do a simple readback, not the nvidia workaround you did in your sample. My main concern is we have a 30fps source and readback needs to be done for every frame with no frames skipped.