Profile a delay in vkWaitForFences

Desperado17 · August 13, 2022, 2:58pm

Greetings,

I have a delay of several milliseconds caused by a call to Linux poll() within the vkWaitForFences that waits for the fences for the vkQueueSubmit of the central draw command buffer. What would that indicate? Too many draw related commands in the buffer? If yes, what’s the easiest way to find out the commands that cause the most delay?

Regards

Alfonse_Reinheart · August 13, 2022, 3:16pm

Broadly speaking, if you call vkWaitForFences with a non-zero timeout, then you’re doing that because you ran out of anything useful to do on the CPU (because otherwise, you’d use a 0 timeout and go do those things if the fence wasn’t signaled). That means the GPU is taking longer than the CPU to do stuff. And since you should be waiting on the fence for the last frame (ie: not the one you just submitted), this would only happen if the GPU is taking significantly longer than the CPU to do stuff.

So you need to profile what’s happening on the GPU. I understand that Renderdoc is a useful tool for that.

Desperado17 · August 15, 2022, 10:38am

Ok, so according to Nvidia Nsight there is a 1ms delay in three vkCmdDraw calls that draw a full screen quad with 6 vertices. It’s about 3 times what the same drawcall takes on OpenGL. The VS and FS are the same as for the OpenGL implementation so I assume this is not the reason for the performance loss.

Which pipeline parameters have the greatest influence on performance in such a scenario?

Edit: One major difference between the OpenGL implementation and the Vulkan implementation is that OpenGL has 2 attachments (color, depth/stencil) with color being a 4xmsaa glRenderBufferStorageMultisample and the Vulkan implementation has 3 attachments the third being a resolve color target for the same purpose.

Desperado17 · August 15, 2022, 2:25pm

Interesting: The spirvcross seems to expand this

void main()
{
    color = u_color;
    float m = calcFFactor( gl_FragCoord.z );
    color.rgb = mix( u_f_color, color.rgb, m );
}

to this

void main()
{
    color = _68.u_color;
    highp float param = gl_FragCoord.z;
    highp float m = calcFFactor(param);
    highp vec3 _90 = mix(_16.u_f_color, color.xyz, vec3(m));
    color = vec4(_90.x, _90.y, _90.z, color.w);
}

I wonder how this affects performance.

Alfonse_Reinheart · August 15, 2022, 2:48pm

That doesn’t change anything. It’s just a more explicit restatement of what you did, listing out the temporaries the hardware would have to compute to make your code work.

Desperado17 · August 15, 2022, 3:45pm

Stripped vertex and fragment shader to a minimum ( MVP transform and color reachthrough). No difference according to NSight. Duration seems to be proportional to pixel count though when comparing to other drawcalls. Is there any other per pixel overhead that typically arises when switching to Vulkan if one is careless? Do multisampled + resolve targets work differently on glRenderBufferStorageMultisample that makes them more efficient?

Alfonse_Reinheart · August 15, 2022, 4:21pm

Wait: you can’t do that. All images attached to an FBO must have the same sample count. So the depth/stencil also has to be a 4x multisample buffer, right?

Desperado17 · August 15, 2022, 4:56pm

Yeah should be. Cannot look right now but should be.
Edit: Yes it is.

Desperado17 · August 16, 2022, 8:21am

These are the commands from the beginning of the render pass up to the vkCmdDraw. It takes 1.12 ms when the corresponding glDrawArrays on OpenGL are only 0.34 ms. The results are roughly the same with Nvidia Nsight and Renderdoc.

Event    Description    CPU ms    GPU ms
280    "void vkCmdBeginRenderPass(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', const VkRenderPassBeginInfo* pRenderPassBegin = { .sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO, .pNext = nullptr, .renderPass = '0x00007fd658b2b1e0', .framebuffer = '0x00007fd65815e6a0', .renderArea = { .offset = { .x = 0, .y = 0 }, .extent = { .width = 2240, .height = 1260 } }, .clearValueCount = 0, .pClearValues = nullptr }, VkSubpassContents contents = VK_SUBPASS_CONTENTS_INLINE)"    -    -
Event    Description    CPU ms    GPU ms
281    "void vkCmdBindPipeline(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', VkPipelineBindPoint pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS, VkPipeline pipeline = '0x00007fd658ebdb80')"    -    -
Event    Description    CPU ms    GPU ms
282    "void vkCmdBindVertexBuffers(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', uint32_t firstBinding = 0, uint32_t bindingCount = 1, const VkBuffer* pBuffers = '0x00007fd658fbee60', const VkDeviceSize* pOffsets = 0)"    -    -
Event    Description    CPU ms    GPU ms
283    "void vkCmdSetStencilReference(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', VkStencilFaceFlags faceMask = VkStencilFaceFlags(VK_STENCIL_FACE_FRONT_AND_BACK), uint32_t reference = 0)"    -    -
Event    Description    CPU ms    GPU ms
284    "void vkCmdPushConstants(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', VkPipelineLayout layout = '0x00007fd658eba230', VkShaderStageFlags stageFlags = VkShaderStageFlags(VK_SHADER_STAGE_VERTEX_BIT | VK_SHADER_STAGE_FRAGMENT_BIT), uint32_t offset = 0, uint32_t size = 80, const void* pValues = 0x00007fd64bfd2a00)"    -    -
Event    Description    CPU ms    GPU ms
285    "void vkCmdDraw(VkCommandBuffer commandBuffer = '0x00007fd6581fe750', uint32_t vertexCount = 6, uint32_t instanceCount = 1, uint32_t firstVertex = 0, uint32_t firstInstance = 0)"    -    1.12

Desperado17 · August 16, 2022, 9:17am

Render Pass Parameters:

vkCreateRenderPass                vkCreateRenderPass({ { VK_FORMAT_R8G8B8A8_UNORM, VK_FORMAT_D24_UNORM_S8_UINT, VK_FORMAT_R8G8B8A8_UNORM }, { { { 0 }, 1 } } })
device                          Device 11
CreateInfo                      VkRenderPassCreateInfo()
sType                         VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO
pNext                         NULL
flags                         VkRenderPassCreateFlagBits(0)
attachmentCount               3
pAttachments                  VkAttachmentDescription[3]
[0]                         VkAttachmentDescription()
flags                     VkAttachmentDescriptionFlagBits(0)
format                    VK_FORMAT_R8G8B8A8_UNORM
samples                   VK_SAMPLE_COUNT_4_BIT
loadOp                    VK_ATTACHMENT_LOAD_OP_LOAD
storeOp                   VK_ATTACHMENT_STORE_OP_STORE
stencilLoadOp             VK_ATTACHMENT_LOAD_OP_DONT_CARE
stencilStoreOp            VK_ATTACHMENT_STORE_OP_DONT_CARE
initialLayout             VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
finalLayout               VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
[1]                         VkAttachmentDescription()
flags                     VkAttachmentDescriptionFlagBits(0)
format                    VK_FORMAT_D24_UNORM_S8_UINT
samples                   VK_SAMPLE_COUNT_4_BIT
loadOp                    VK_ATTACHMENT_LOAD_OP_LOAD
storeOp                   VK_ATTACHMENT_STORE_OP_STORE
stencilLoadOp             VK_ATTACHMENT_LOAD_OP_LOAD
stencilStoreOp            VK_ATTACHMENT_STORE_OP_STORE
initialLayout             VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
finalLayout               VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
[2]                         VkAttachmentDescription()
flags                     VkAttachmentDescriptionFlagBits(0)
format                    VK_FORMAT_R8G8B8A8_UNORM
samples                   VK_SAMPLE_COUNT_1_BIT
loadOp                    VK_ATTACHMENT_LOAD_OP_LOAD
storeOp                   VK_ATTACHMENT_STORE_OP_STORE
stencilLoadOp             VK_ATTACHMENT_LOAD_OP_LOAD
stencilStoreOp            VK_ATTACHMENT_STORE_OP_STORE
initialLayout             VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
finalLayout               VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
subpassCount                  1
pSubpasses                    VkSubpassDescription[1]
[0]                         VkSubpassDescription()
flags                     VkSubpassDescriptionFlagBits(0)
pipelineBindPoint         VK_PIPELINE_BIND_POINT_GRAPHICS
inputAttachmentCount      0
pInputAttachments         VkAttachmentReference[0]
colorAttachmentCount      1
pColorAttachments         VkAttachmentReference[1]
[0]                     VkAttachmentReference()
attachment            0
layout                VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pResolveAttachments       VkAttachmentReference[1]
[0]                     VkAttachmentReference()
attachment            2
layout                VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
pDepthStencilAttachment   VkAttachmentReference()
attachment              1
layout                  VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
preserveAttachmentCount   0
pPreserveAttachments      uint32_t[0]
dependencyCount               2
pDependencies                 VkSubpassDependency[2]
[0]                         VkSubpassDependency()
srcSubpass                UINT32_MAX
dstSubpass                0
srcStageMask              VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT
dstStageMask              VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
srcAccessMask             VK_ACCESS_MEMORY_READ_BIT
dstAccessMask             VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
dependencyFlags           VK_DEPENDENCY_BY_REGION_BIT
[1]                         VkSubpassDependency()
srcSubpass                0
dstSubpass                UINT32_MAX
srcStageMask              VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
dstStageMask              VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT
srcAccessMask             VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
dstAccessMask             VK_ACCESS_MEMORY_READ_BIT
dependencyFlags           VK_DEPENDENCY_BY_REGION_BIT
pAllocator                      NULL
RenderPass                      Render Pass 219

Desperado17 · August 16, 2022, 10:37am

Another drawcall that also draws a screen rect without multisampling and it still takes only half the time in OpenGL than it takes in Vulkan ( 0.28 ms vs. 0.48 ms) according to NSight and RenderDoc.

Desperado17 · August 18, 2022, 1:17pm

Can anyone tell me what the overhead of a push constant is in a shader that does not use it?

Desperado17 · August 18, 2022, 4:38pm

The problem is reproducible with the Diligent engine and the example:

github.com

DiligentGraphics/DiligentSamples/blob/master/Tutorials/Tutorial01_HelloTriangle/src/Tutorial01_HelloTriangle.cpp

/*
 *  Copyright 2019-2022 Diligent Graphics LLC
 *  Copyright 2015-2019 Egor Yusov
 *  
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *  
 *      http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 *
 *  In no event and under no legal theory, whether in tort (including negligence), 
 *  contract, or otherwise, unless required by applicable law (such as deliberate 
 *  and grossly negligent acts) or agreed to in writing, shall any Contributor be
 *  liable for any damages, including any direct, indirect, special, incidental,

This file has been truncated. show original

The testapp can be started in Vulkan mode without parameters and in OpenGL mode by appending

“-mode GL”

in the command line. Using Renderdoc, I get the following result from the performance counters:

In the Vulkan call, 14 and 18 clear the color and depth buffer.
In the OpenGL call, 7 and 12 clear the color buffer and depth buffer respectively.

It can be observed, that Vulkan takes 2-3 times longer (second column from the left).
For the less pixel-heavy drawcalls that come afterwards Vulkan wins on the other hand.

Vulkan:
14 116.128 0 0 0 0 0 0 0 0 0 0 0 0
18 119.68 0 0 0 0 0 0 0 0 0 0 0 0
23 27.424 3 1 0 1 1 280208 3 0 0 0 280208 0
34 8.384 303 101 0 101 101 8948 116 0 0 0 8948 0

OpenGL:
7 40.672 0 0 0 0 0 0 0 0 0 0 0 0
12 11.136 0 0 0 0 0 0 0 0 0 0 0 0
18 38.048 3 1 0 1 1 287847 3 0 0 0 287847 0
39 18.976 303 101 0 101 101 8948 116 0 0 0 8948 0
40 1.12 0 0 0 0 0 0 0 0 0 0 0 0

Nvidia Quadro M1000M Driver Version: 510.73.05

system · February 17, 2023, 4:39pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.