I’ve been working with (sparse) bindless buffers and textures for quite a while and everything is working fine. Currently, however, I only set the residency for all bindless objects once upon creation, i.e. glMakeNamedBufferResidentNV resp. glMakeTextureHandleResidentARB. I have reached the point where the total amount of data sometimes exceeds the physical memory of the video card, so I would like to make currently unused resource objects non-resident and allow the driver to swap them. The problem is that I don’t know when to call glMakeNamedBufferNonResidentNV resp. glMakeImageHandleNonResidentARB.
If I use a bindless buffer in a compute shader and call glMakeNamedBufferNonResidentNV immediately after glDispatchCompute, the driver will crash occasionally. I assume this is because the buffer is made non-resident immediately although the compute shader is still running. I obviously don’t want to use a fence, as I don’t want to block anything and wait 10 ms on a dispatch without doing anything else. What I would like to have is a callback once the dispatch is done. Does OpenGL provide such a mechanism? I could not find any extension for that.
// If the application doesn’t need to use texHandles for a while, it
// can make it non-resident to reduce the overall memory footprint.
The “for a while” part bugs me, because this is precisely what I don’t know how to determine. In a rendering pipeline with dozens of compute shaders and draw calls, all executed asynchronously, some of them modifying buffers and textures, some of them only consuming them, it is quite impossible to tell when a buffer is not being accessed anymore. And I cannot just assume a time in milliseconds.
Am I missing something here? Can I really only make resources non-resident safely after fences?
I obviously don’t want to use a fence, as I don’t want to block anything and wait 10 ms on a dispatch without doing anything else.
Nobody’s forcing you to wait on that fence. You can query it whenever you like.
What I would like to have is a callback once the dispatch is done. Does OpenGL provide such a mechanism?
No. Use a fence and query it.
In a rendering pipeline with dozens of compute shaders and draw calls, all executed asynchronously, some of them modifying buffers and textures, some of them only consuming them, it is quite impossible to tell when a buffer is not being accessed anymore.
But… you are the one who gave those “dozens of compute shaders and draw calls” the buffers and textures they use. You put them there for it to use.
Unless you’ve moved everything to the GPU (including positioning of the camera, visibility testing, and the like), then you have some idea about what can be seen from the camera’s current location. And therefore, you have some idea of what is being used, may soon be used, and almost certainly won’t be used in the immediate future. So use that to determine what is and is not resident.
I was not aware of that, thanks! I thought fences can only be waited on. I should have read the ARB_sync spec.
This is what I wanted to point out - “almost certain” leads to crashes. glDispatchCompute returns immediately, so if I have 50 dispatches in a frame, how should I know (without a fence) when which one has completed and when I can make non-resident the input buffers of which compute shaders? I cannot just start changing residency of each set of input buffers each 0.5 ms. Run the application on an older card and you get driver crashes. But, as mentioned, now that you made me aware of that fences can be queried without blocking, this of course solves the problem.
You need to be certain about what is or isn’t being used. But for things which aren’t being used right now, which ones to keep them resident is basically an educated guess.
Anything you’re going to be using for the current frame should be resident. When it comes to the next frame, you may need to make some other data resident. To free up memory, you may need to evict data you’re no longer using. As for which data to evict, the main factor is to avoid evicting anything that’s still required by pending operations; beyond that, it’s just a case of how soon you expect to need the data.
If the amount of data required for one frame exceeds available video memory, there aren’t any simple solutions. You may well end up needing to have threads which simply wait on a fence, evict the data which is now redundant, make new data resident and enqueue new commands using that data.
If you’re using compute shaders extensively, you might be better off with OpenCL or CUDA.
Thanks for your answer as well, GClements! Maybe I was not clear enough in my original post, but I am talking about making stuff non-resident within one frame. I am perfectly aware that this is a serious performance impairment, but it is a way to circumvent the limited GPU memory to some degree. In a current example from scientific visualization, I need to render a data set of 4 GB, then another data set of 4 GB on top of that, and then additional data amounting to approximately 1 GB. Which is horrible for a GTX 1080 with only 8 GB of memory. So in these edge cases, rather than crashing, I want to tell the GL to make the first data set non-resident after all its compute and draw calls have finished so the driver can swap it. It might take a few milliseconds, but at least I get an image at the end of the frame.
In the worst case scenario of your example, where the new data set is not in GPU memory at all, that means you have to transfer all of that data before you even begin to use any of it. 4GB does not cross the PCIe bus in “a few milliseconds”. Even in the best possible case, at least 1GB has to be transferred every frame. Again, that doesn’t happen in “a few milliseconds”.
If you want to frequently deal with more data than the GPU can hold, which will require DMA’ing data on the order of gigabytes to the GPU every frame, “a few milliseconds” is just not realistic. There’s a reason why high-performance graphics applications will do anything it takes to avoid this.
In any case, I would strongly suggest switching to Vulkan if that is at all possible for you. It’s not going to make the DMA any faster, but it gives you more low-level control over memory, so that you can control exactly what is being stored in memory and what is being evicted. You may even be able to restructure your compute operations to be better able to compute things while you’re DMA’ing other things.