I’m confused about the function of the parameters of glTexImage2D when used with a PBO. I assume that the format and type parameters are ignored? (as I understand in the traditional non-PBO usage that conversion is done in the CPU)
The format and type parameters are not ignored when using PBOs, they mean the exact same thing like in case of non-PBO use. The only difference is that the last argument of glTexImage is an offset into the pixel unpack buffer instead a pointer to the client side data.
Thus format conversion does in fact happen (if needed) also in case of PBOs.
Further, non-PBO usage does not require the conversion to be done on the CPU either. The driver can choose to still do it on the GPU, however, non-PBO usage might require an additional copy from application memory space to driver space and it also blocks the application to some extent as it is carried out synchronously, while PBOs work asynchronously.
Translation can still happen when using PBOs. And yes, it will be done on the CPU. It’s up to you to provide image data in a format that your implementation won’t have to convert.[/QUOTE]
Wow so that implies that the data efficiently blasted to the GPU in the PBO will be copied back to RAM converted and sent to the texture the old way. Would it still be asynchronous? OK I’ll keep an eye on it.
Further, non-PBO usage does not require the conversion to be done on the CPU either. The driver can choose to still do it on the GPU.[/QUOTE]
Of course, and I’m dying for info on how to trigger those new hw based ‘Copy Engines’ in the NVidia Fermi+ architecture. But in practice (except for the most common formats) the data transfer / conversion is astonishingly slow. It not just done on the CPU but it’s being done in a sub-optimal way. So I definitely don’t want to trigger the PBO data to be sent back to the CPU to be processed in the old way and then sent back to the texture. Anyway thanks for the heads up.
Wow so that implies that the data efficiently blasted to the GPU in the PBO will be copied back to RAM converted and sent to the texture the old way.
You assume that a buffer object will always be allocated using GPU memory.
Would it still be asynchronous?
All uploads are asynchronous. It’s simply a question of how much.
Using client memory, the driver will generally copy your data into an internal buffer, then upload that asynchronously. PBOs simply cut out the middle-man; you get to specify the “internal buffer” yourself.
The main use for PBOs is downloading. That’s not to say that they aren’t useful for uploading data. But doing downloads is where you really gain something. Downloads without PBOs are never asynchronous.
Really, the biggest bang for your buck will be figuring out what the optimal pixel transfer format and data type will be for your internal format of choice. If you pick the right format and data type, then there will be no need for any modification of the data outside of what the DMA engine can do (ie: swizzling).
Why you think that drivers upload and convert textures using the CPU? While, in fact, there may be certain scenarios when using a CPU path might be necessary, most uploads and conversions could be easily performed on the GPU. Not to mention that PBO does not necessarily resides in video-RAM, but in memory that is accessible by the GPU (which can be either video-RAM or some special type of system memory).
Actually, a GPU path could be used by the driver even without PBOs, however, without PBOs glTexImage still works synchronously, that’s why PBOs perform better, because they perform the whole thing asynchronously.
And you don’t need those “copy engines” to do so. Any GPU not older than 5-6 years definitely has some sort of support for GPU uploads/conversions.
Well, forgive me that I’m sceptic, but most of the time when people complain about very basic functionality being slow, like VBOs, PBOs, etc. it is because they don’t use it properly.
Yes, generally all forms of uploads, downloads and conversions are time consuming, because the CPU or GPU has to crunch through that data.
But, if used correctly, these operations can be made very fast.
Really, the biggest bang for your buck will be figuring out what the optimal pixel transfer format and data type will be for your internal format of choice. If you pick the right format and data type, then there will be no need for any modification of the data outside of what the DMA engine can do (ie: swizzling).[/QUOTE]
Yes that’s exactly what I’m trying to do. There is also the promise in the link in my first post to this thread that the transfer can be made completely asynchronous using threading. With my initial tests I’m seeing none of this though.
First of all, why you do the additional memcpy from your own buffer to the PBO instead of writing the data directly to the PBO and not using an intermediate buffer? That more or less defeats the purpose.
If you really have to use your own data buffer (which I don’t really see why), you should consider using AMD_pinned_memory to avoid the additional copy.
ReadPixels is presently accelerated only for color components. To get the best performance an application should be programmed to read colors back in BGRA format as GLubyte’s, GLushort’s, or
GLfloat’s with four components on a 32 bit desktop. To prepare for future acceleration an app should
read back depth values as 32-bit floats.
Yes you’re right it’s actually the memcpy in my code that is taking an inordinate amount of time (first of all I’m on x64 so I figured that might be a factor, secondly I thought of using pinned memory). But as it turns out it the first access to the PBO (via any memory access) that takes the most time… below is the output of doing the memory copy repeated times… consistently the first access is ~4 times as long irregardless the type of memory (pinned or unpinned) or the the API used… so is there some overhead to the first access of mapped PBO data?
Yes as long as I’m not simply moving the overhead to a different piece of code removing the extra copy would be imperative. This article implies that even with the copy the transfer via PBO can be faster, so that’s what I was trying that first.
Well, that’s not surprising. Drivers tend to delay the allocation of memory resources to the time it is first used, thus your first access to the PBO will most likely be always slow, but any subsequent access will be fast as the memory resource has been already allocated.
[QUOTE=Dan Bartlett;1238782]Try GL_BGRA instead of GL_RGBA.
From http://developer.amd.com/media/gpu_assets/ATI_OpenGL_Programming_and_Optimization_Guide.pdf (bit old, but probably still relevant on target hardware)[/QUOTE]
Yes this sort of thing is still relevant on the two nVidia boards I’ve been testing. The 8bit code path seems relatively consistent. With transfer to the card (4kx4k 4 channels) being ~50ms and read time ~30ms (though with reading I see often stalls where it sometimes takes 200ms). The 16bit code path takes 6x longer to write, which makes no sense, but the reading is 2x longer (as expected) but only to shorts it’s 6x longer reading to unsigned shorts, again makes no sense. It’s that sort of thing that makes we want to bypass the driver conversion.
But the problem is that it obliterates any advantage of using the PBO: that first hit takes much longer than sending all the data through glTexImage2D
I’ll look at some different PBO flags…
Also interesting to see if this can be handled in a non blocking thread as it seems it should be…
No, that’s not true. You create your PBO (or PBOs if you have to) only once and you can perform as many texture uploads with the same PBO as you wish, thus you pay the cost only once, when performing the first upload.
Of course, you see stalls often when reading back stuff. glReadPixels needs to flush the complete pipeline and wait until all rendering commands are done, then it does the download and returns. It’s like executing a glFinish, i.e. very, very expensive.
The GPU and the CPU works asynchronously. When you execute GL commands, those are queued up and at some later point in time get executed by the GPU itself, however, commands that require synchronization (e.g. glReadPixels, glMapBuffer, unless used properly) will wait in the CPU until the GPU finishes it’s job, both stalling the CPU till then and after that starving the GPU as there are no new commands ready.
Again, with PBOs, using a pixel pack buffer, glReadPixels won’t stall your application.
Yes I understaood that much but there is still the potential of a hit after the a PBO has been unmapped and remapped. I’m testing with 3 fixed PBO’s mapping unmapping and remapping them and there are cases when the first access after the remap is slow again (not always though)
[QUOTE=aqnuep;1238795]Of course, you see stalls often when reading back stuff. glReadPixels needs to flush the complete pipeline and wait until all rendering commands are done, then it does the download and returns. It’s like executing a glFinish, i.e. very, very expensive.
The GPU and the CPU works asynchronously. When you execute GL commands, those are queued up and at some later point in time get executed by the GPU itself, however, commands that require synchronization (e.g. glReadPixels, glMapBuffer, unless used properly) will wait in the CPU until the GPU finishes it’s job, both stalling the CPU till then and after that starving the GPU as there are no new commands ready.[/QUOTE]
My test is really simple and I’m not doing any rendering just reading and writing to the card. I also put 200ms sleeps before timing the glReadPixel calls to allow for any asynchronicity to run it’s course. Regardless this I’m seeing about a third of the glReadPixel calls taking several 100ms longer…
Unfortunately that doesn’t help me much as in my app I can’t start rendering my next frame till the data of the current is done. So I have to wait for glReadPixels to complete. I need it to go faster not asynchronous, writing stuff to the card can be asynchronous though as there is processing I can do while waiting for that to complete.
But that’s because of the same reason why glReadPixels is slow. If your PBO is still in use by the GPU (as I said, the GPU works asynchronously), then glMapBuffer will “do” a glFinish, i.e. it will wait until the GPU finishes everything and goes idle.
You can avoid this issue by e.g. having a large enough pool of PBOs that you use in a round-robin fashion to avoid the chance of accessing a buffer that is currently still in use by the GPU.
The fact that your test is simple doesn’t change anything. Actually, in this case it makes it ever worse as it is very likely that when you call glReadPixels the GPU didn’t even start to process commands that output to that renderbuffer/texture.
One has to do other tasks that hide this latency by performing work in the meantime thus decreasing the chance that when you call glReadPixels the GPU had enough time to pick up and execute the commands.
You see? This is your problem: you use glReadPixels and glMapBuffer on the same resource in the same frame. At the time you issue your rendering commands that you want to read back the results of using glReadPixels, it is very likely that your GPU is still processing a previous frame.
In many cases the GPU is 2-3 frames behind the application’s latest GL commands. For ultimate performance, one should never use glReadPixels or glMapBuffer (without the UNSYNCHRONIZED or INVALIDATE bits) for a framebuffer/buffer object that was used within the last one or two frames, otherwise a CPU stall and then a GPU pipeline idle will happen and both CPU and GPU performance suffer.
In most cases, with a little bit of thinking and designing, you can easily reorder your renderer’s tasks this way and thus you can really get ultimate performance.
In those few cases when such a reordering is not possible, one must have to leave with a decreased performance. Neither “copy engines” nor PBOs provide any “magic bullets” for applications that want to use the CPU-GPU asynchronous processor pair in a way that they were not designed for.