I have a delay of several milliseconds caused by a call to Linux poll() within the vkWaitForFences that waits for the fences for the vkQueueSubmit of the central draw command buffer. What would that indicate? Too many draw related commands in the buffer? If yes, what’s the easiest way to find out the commands that cause the most delay?
Broadly speaking, if you call vkWaitForFences with a non-zero timeout, then you’re doing that because you ran out of anything useful to do on the CPU (because otherwise, you’d use a 0 timeout and go do those things if the fence wasn’t signaled). That means the GPU is taking longer than the CPU to do stuff. And since you should be waiting on the fence for the last frame (ie: not the one you just submitted), this would only happen if the GPU is taking significantly longer than the CPU to do stuff.
So you need to profile what’s happening on the GPU. I understand that Renderdoc is a useful tool for that.
Ok, so according to Nvidia Nsight there is a 1ms delay in three vkCmdDraw calls that draw a full screen quad with 6 vertices. It’s about 3 times what the same drawcall takes on OpenGL. The VS and FS are the same as for the OpenGL implementation so I assume this is not the reason for the performance loss.
Which pipeline parameters have the greatest influence on performance in such a scenario?
Edit: One major difference between the OpenGL implementation and the Vulkan implementation is that OpenGL has 2 attachments (color, depth/stencil) with color being a 4xmsaa glRenderBufferStorageMultisample and the Vulkan implementation has 3 attachments the third being a resolve color target for the same purpose.
That doesn’t change anything. It’s just a more explicit restatement of what you did, listing out the temporaries the hardware would have to compute to make your code work.
Stripped vertex and fragment shader to a minimum ( MVP transform and color reachthrough). No difference according to NSight. Duration seems to be proportional to pixel count though when comparing to other drawcalls. Is there any other per pixel overhead that typically arises when switching to Vulkan if one is careless? Do multisampled + resolve targets work differently on glRenderBufferStorageMultisample that makes them more efficient?
Wait: you can’t do that. All images attached to an FBO must have the same sample count. So the depth/stencil also has to be a 4x multisample buffer, right?
These are the commands from the beginning of the render pass up to the vkCmdDraw. It takes 1.12 ms when the corresponding glDrawArrays on OpenGL are only 0.34 ms. The results are roughly the same with Nvidia Nsight and Renderdoc.
Another drawcall that also draws a screen rect without multisampling and it still takes only half the time in OpenGL than it takes in Vulkan ( 0.28 ms vs. 0.48 ms) according to NSight and RenderDoc.
The problem is reproducible with the Diligent engine and the example:
The testapp can be started in Vulkan mode without parameters and in OpenGL mode by appending
“-mode GL”
in the command line. Using Renderdoc, I get the following result from the performance counters:
In the Vulkan call, 14 and 18 clear the color and depth buffer.
In the OpenGL call, 7 and 12 clear the color buffer and depth buffer respectively.
It can be observed, that Vulkan takes 2-3 times longer (second column from the left).
For the less pixel-heavy drawcalls that come afterwards Vulkan wins on the other hand.