Profile a delay in vkWaitForFences

Greetings,

I have a delay of several milliseconds caused by a call to Linux poll() within the vkWaitForFences that waits for the fences for the vkQueueSubmit of the central draw command buffer. What would that indicate? Too many draw related commands in the buffer? If yes, what’s the easiest way to find out the commands that cause the most delay?

Regards

Broadly speaking, if you call vkWaitForFences with a non-zero timeout, then you’re doing that because you ran out of anything useful to do on the CPU (because otherwise, you’d use a 0 timeout and go do those things if the fence wasn’t signaled). That means the GPU is taking longer than the CPU to do stuff. And since you should be waiting on the fence for the last frame (ie: not the one you just submitted), this would only happen if the GPU is taking significantly longer than the CPU to do stuff.

So you need to profile what’s happening on the GPU. I understand that Renderdoc is a useful tool for that.

1 Like

Ok, so according to Nvidia Nsight there is a 1ms delay in three vkCmdDraw calls that draw a full screen quad with 6 vertices. It’s about 3 times what the same drawcall takes on OpenGL. The VS and FS are the same as for the OpenGL implementation so I assume this is not the reason for the performance loss.

Which pipeline parameters have the greatest influence on performance in such a scenario?

Edit: One major difference between the OpenGL implementation and the Vulkan implementation is that OpenGL has 2 attachments (color, depth/stencil) with color being a 4xmsaa glRenderBufferStorageMultisample and the Vulkan implementation has 3 attachments the third being a resolve color target for the same purpose.

Interesting: The spirvcross seems to expand this

void main()
{
    color = u_color;
    float m = calcFFactor( gl_FragCoord.z );
    color.rgb = mix( u_f_color, color.rgb, m );
}

to this

void main()
{
    color = _68.u_color;
    highp float param = gl_FragCoord.z;
    highp float m = calcFFactor(param);
    highp vec3 _90 = mix(_16.u_f_color, color.xyz, vec3(m));
    color = vec4(_90.x, _90.y, _90.z, color.w);
}

I wonder how this affects performance.

That doesn’t change anything. It’s just a more explicit restatement of what you did, listing out the temporaries the hardware would have to compute to make your code work.

Stripped vertex and fragment shader to a minimum ( MVP transform and color reachthrough). No difference according to NSight. Duration seems to be proportional to pixel count though when comparing to other drawcalls. Is there any other per pixel overhead that typically arises when switching to Vulkan if one is careless? Do multisampled + resolve targets work differently on glRenderBufferStorageMultisample that makes them more efficient?

Wait: you can’t do that. All images attached to an FBO must have the same sample count. So the depth/stencil also has to be a 4x multisample buffer, right?

Yeah should be. Cannot look right now but should be.
Edit: Yes it is.

These are the commands from the beginning of the render pass up to the vkCmdDraw. It takes 1.12 ms when the corresponding glDrawArrays on OpenGL are only 0.34 ms. The results are roughly the same with Nvidia Nsight and Renderdoc.

Event    Description    CPU ms    GPU ms
280    "void vkCmdBeginRenderPass(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', const VkRenderPassBeginInfo* pRenderPassBegin = { .sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO, .pNext = nullptr, .renderPass = '0x00007fd658b2b1e0', .framebuffer = '0x00007fd65815e6a0', .renderArea = { .offset = { .x = 0, .y = 0 }, .extent = { .width = 2240, .height = 1260 } }, .clearValueCount = 0, .pClearValues = nullptr }, VkSubpassContents contents = VK_SUBPASS_CONTENTS_INLINE)"    -    -
Event    Description    CPU ms    GPU ms
281    "void vkCmdBindPipeline(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', VkPipelineBindPoint pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS, VkPipeline pipeline = '0x00007fd658ebdb80')"    -    -
Event    Description    CPU ms    GPU ms
282    "void vkCmdBindVertexBuffers(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', uint32_t firstBinding = 0, uint32_t bindingCount = 1, const VkBuffer* pBuffers = '0x00007fd658fbee60', const VkDeviceSize* pOffsets = 0)"    -    -
Event    Description    CPU ms    GPU ms
283    "void vkCmdSetStencilReference(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', VkStencilFaceFlags faceMask = VkStencilFaceFlags(VK_STENCIL_FACE_FRONT_AND_BACK), uint32_t reference = 0)"    -    -
Event    Description    CPU ms    GPU ms
284    "void vkCmdPushConstants(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', VkPipelineLayout layout = '0x00007fd658eba230', VkShaderStageFlags stageFlags = VkShaderStageFlags(VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT), uint32_t offset = 0, uint32_t size = 80, const void* pValues = 0x00007fd64bfd2a00)"    -    -
Event    Description    CPU ms    GPU ms
285    "void vkCmdDraw(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', uint32_t vertexCount = 6, uint32_t instanceCount = 1, uint32_t firstVertex = 0, uint32_t firstInstance = 0)"    -    1.12

Render Pass Parameters:

vkCreateRenderPass                vkCreateRenderPass({ { VK_FORMAT_R8G8B8A8_UNORM, VK_FORMAT_D24_UNORM_S8_UINT, VK_FORMAT_R8G8B8A8_UNORM }, { { { 0 }, 1 } } })
device                          Device 11
CreateInfo                      VkRenderPassCreateInfo()
sType                         VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO
pNext                         NULL
flags                         VkRenderPassCreateFlagBits(0)
attachmentCount               3
pAttachments                  VkAttachmentDescription[3]
[0]                         VkAttachmentDescription()
flags                     VkAttachmentDescriptionFlagBits(0)
format                    VK_FORMAT_R8G8B8A8_UNORM
samples                   VK_SAMPLE_COUNT_4_BIT
loadOp                    VK_ATTACHMENT_LOAD_OP_LOAD
storeOp                   VK_ATTACHMENT_STORE_OP_STORE
stencilLoadOp             VK_ATTACHMENT_LOAD_OP_DONT_CARE
stencilStoreOp            VK_ATTACHMENT_STORE_OP_DONT_CARE
initialLayout             VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
finalLayout               VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
[1]                         VkAttachmentDescription()
flags                     VkAttachmentDescriptionFlagBits(0)
format                    VK_FORMAT_D24_UNORM_S8_UINT
samples                   VK_SAMPLE_COUNT_4_BIT
loadOp                    VK_ATTACHMENT_LOAD_OP_LOAD
storeOp                   VK_ATTACHMENT_STORE_OP_STORE
stencilLoadOp             VK_ATTACHMENT_LOAD_OP_LOAD
stencilStoreOp            VK_ATTACHMENT_STORE_OP_STORE
initialLayout             VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
finalLayout               VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
[2]                         VkAttachmentDescription()
flags                     VkAttachmentDescriptionFlagBits(0)
format                    VK_FORMAT_R8G8B8A8_UNORM
samples                   VK_SAMPLE_COUNT_1_BIT
loadOp                    VK_ATTACHMENT_LOAD_OP_LOAD
storeOp                   VK_ATTACHMENT_STORE_OP_STORE
stencilLoadOp             VK_ATTACHMENT_LOAD_OP_LOAD
stencilStoreOp            VK_ATTACHMENT_STORE_OP_STORE
initialLayout             VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
finalLayout               VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
subpassCount                  1
pSubpasses                    VkSubpassDescription[1]
[0]                         VkSubpassDescription()
flags                     VkSubpassDescriptionFlagBits(0)
pipelineBindPoint         VK_PIPELINE_BIND_POINT_GRAPHICS
inputAttachmentCount      0
pInputAttachments         VkAttachmentReference[0]
colorAttachmentCount      1
pColorAttachments         VkAttachmentReference[1]
[0]                     VkAttachmentReference()
attachment            0
layout                VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pResolveAttachments       VkAttachmentReference[1]
[0]                     VkAttachmentReference()
attachment            2
layout                VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pDepthStencilAttachment   VkAttachmentReference()
attachment              1
layout                  VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
preserveAttachmentCount   0
pPreserveAttachments      uint32_t[0]
dependencyCount               2
pDependencies                 VkSubpassDependency[2]
[0]                         VkSubpassDependency()
srcSubpass                UINT32_MAX
dstSubpass                0
srcStageMask              VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT
dstStageMask              VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
srcAccessMask             VK_ACCESS_MEMORY_READ_BIT
dstAccessMask             VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
dependencyFlags           VK_DEPENDENCY_BY_REGION_BIT
[1]                         VkSubpassDependency()
srcSubpass                0
dstSubpass                UINT32_MAX
srcStageMask              VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
dstStageMask              VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT
srcAccessMask             VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
dstAccessMask             VK_ACCESS_MEMORY_READ_BIT
dependencyFlags           VK_DEPENDENCY_BY_REGION_BIT
pAllocator                      NULL
RenderPass                      Render Pass 219

Another drawcall that also draws a screen rect without multisampling and it still takes only half the time in OpenGL than it takes in Vulkan ( 0.28 ms vs. 0.48 ms) according to NSight and RenderDoc.

Can anyone tell me what the overhead of a push constant is in a shader that does not use it?

The problem is reproducible with the Diligent engine and the example:

The testapp can be started in Vulkan mode without parameters and in OpenGL mode by appending

“-mode GL”

in the command line. Using Renderdoc, I get the following result from the performance counters:

In the Vulkan call, 14 and 18 clear the color and depth buffer.
In the OpenGL call, 7 and 12 clear the color buffer and depth buffer respectively.

It can be observed, that Vulkan takes 2-3 times longer (second column from the left).
For the less pixel-heavy drawcalls that come afterwards Vulkan wins on the other hand.

Vulkan:
14 116.128 0 0 0 0 0 0 0 0 0 0 0 0
18 119.68 0 0 0 0 0 0 0 0 0 0 0 0
23 27.424 3 1 0 1 1 280208 3 0 0 0 280208 0
34 8.384 303 101 0 101 101 8948 116 0 0 0 8948 0

OpenGL:
7 40.672 0 0 0 0 0 0 0 0 0 0 0 0
12 11.136 0 0 0 0 0 0 0 0 0 0 0 0
18 38.048 3 1 0 1 1 287847 3 0 0 0 287847 0
39 18.976 303 101 0 101 101 8948 116 0 0 0 8948 0
40 1.12 0 0 0 0 0 0 0 0 0 0 0 0

Nvidia Quadro M1000M Driver Version: 510.73.05