vkQueuePresentKHR blocks

Buxxy11 · September 30, 2018, 12:58am

Hi,

I have been timing portions of my code as part of an attempt to get a better grasp of how the presentation engine behaves. The code I’m using looks something like this:


// imageCount==2 for FIFO, 3 for Mailbox
// minImageCount==2
uint32_t idx;
acquiredImageAvailableSemaphore = device.createSemaphoreUnique({});
device.acquireNextImageKHR(*swapchain, timeout_infinite, *acquiredImageAvailableSemaphore, {}, &idx);
imageAvailableSemaphores[idx].swap(acquiredImageAvailableSemaphore);

device->waitForFences(1, &*presentationBufferExecutionFences[idx], VK_TRUE, vkt::timeout_infinite);
device->resetFences(1, &*presentationBufferExecutionFences[idx]);

vk::CommandBuffer& cb = *presentationCommandBuffers[idx];
cb.begin(&beginInfo);
cb.beginRenderPass(&renderPassInfo, vk::SubpassContents::eInline);

// I don't actually record any commands here at the mome

cb.endRenderPass();
cb.end();

vk::SubmitInfo submitInfo = {};
const vk::PipelineStageFlags waitStage = { vk::PipelineStageFlagBits::eColorAttachmentOutput };
submitInfo.waitSemaphoreCount = 1;
submitInfo.pWaitSemaphores = &imageAvailableSemaphores[idx];
submitInfo.pWaitDstStageMask = &waitStage;
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &cb;
submitInfo.signalSemaphoreCount = 1;
submitInfo.pSignalSemaphores = &presentWaitSemaphores[idx];
graphicsQueue.submit(1, &submitInfo, *presentationBufferExecutionFences[idx]);

vk::PresentInfoKHR presentInfo = {};
presentInfo.waitSemaphoreCount = 1;
presentInfo.pWaitSemaphores = &presentWaitSemaphores[idx];
presentInfo.swapchainCount = 1;
presentInfo.pSwapchains = &*swapchain;
presentInfo.pImageIndices = &idx;
presentQueue.presentKHR(&presentInfo);

The timings I get with mailbox look like this ([milliseconds::microseconds], release, no validation layers):


[ 5089:: 65] > acquiring image
[ 5089:: 72] > acquired image: 0
[ 5089:: 78] > waitForFences start
[ 5089:: 80] > waitForFences end
[ 5089:: 85] > submit
[ 5089::137] > presentKHR
[ 5089::300] > end

[ 5089::323] > acquiring image
[ 5089::330] > acquired image: 1
[ 5089::335] > waitForFences start
[ 5089::336] > waitForFences end
[ 5089::341] > submit
[ 5089::396] > presentKHR
[ 5089::532] > end

[ 5089::536] > acquiring image
[ 5089::558] > acquired image: 2
[ 5089::563] > waitForFences start
[ 5089::565] > waitForFences end
[ 5089::569] > submit
[ 5089::603] > presentKHR
[ 5089::705] > end

[ 5089::710] > acquiring image
[ 5089::715] > acquired image: 0
[ 5089::734] > waitForFences start
[ 5089::736] > waitForFences end
[ 5089::740] > submit
[ 5089::788] > presentKHR
[ 5089::957] > end

...

There are some things I’m wondering about:

The acquired images are always in consecutive order [0, 1, 2, 0, 1, 2, etc], though I would expect the presentation engine to be presenting one of them, resulting in something like [0, 1, 2, 1, 2, 1, 0, 2, 0, 2]. I guess the presentation engine works a bit differently internally and makes a copy of the relevant data?
Submit takes a bit of time, this makes sense. PresentKHR takes significantly more time. Is this normal?
Am I handling the semaphores correctly?

However, the really odd part was when I used the FIFO presentmode. I expected to have vkAcquireImageKHR to block, but what I got instead was this:


[ 7305:: 69] > acquiring image
[ 7305:: 84] > acquired image: 1
[ 7305:: 92] > waitForFences start
[ 7305:: 94] > waitForFences end
[ 7305::106] > submit
[ 7305::166] > presentKHR
[ 7321::533] > end

[ 7321::553] > acquiring image
[ 7321::583] > acquired image: 0
[ 7321::604] > waitForFences start
[ 7321::607] > waitForFences end
[ 7321::620] > submit
[ 7321::676] > presentKHR
[ 7338::135] > end

...

As you can see, acquiring the image is instantaneous. Instead, vkQueuePresentKHR seems to be the synchronization point for my code. Why? Am I doing something wrong? Is this expected (undocumented?) behaviour?

I’m using a g-sync compatible laptop with a GTX980M. The drivers are approximately one week old and g-sync is disabled in the NVIDIA control panel.

Any help and advice is appreciated (relevant to the topic or not)!

Best,

krOoze · October 1, 2018, 5:15pm

What does the imageAvailableSemaphores[idx].swap(acquiredImageAvailableSemaphore); do?
What’s the purpose of presentationBufferExecutionFences; what signals it?
What OS and Compositor is this on?

Generally yea, it is a valid choice to copy out the Image. E.g. the DRI3\Present:

When the X server has finished using ‘pixmap’ for this
operation, it will send a PresentIdleNotify event and arrange
for any ‘idle-fence’ to be triggered. This may be at any time
following the PresentPixmap request – the contents may be
immediately copied to another buffer, copied just in time for
the vblank interrupt or the pixmap may be used directly for
display (in which case it will be busy until some future
PresentPixmap operation).

Buxxy11 · October 2, 2018, 1:40pm

It swaps the two handles stores in the referenced uniquehandles. Because I don’t know which is the next image, I wanted to make sure I didn’t use any semaphore that may still be in use, so I swap them after I obtain the next image index.

They are the fences used with command buffer submission.

Windows 10 with visual studio 2017 (I assumed that’s what you meant by compositor). I am currently using vulkan 1.1.70, as the latest version a week or so back had some problems in vulkan.hpp. I obtained the surface I am presenting to using glfw.

krOoze · October 3, 2018, 6:17am

I think that is not necessary. Apparently you use vkAcquire and vkPresent in discrete 1:1 pairs. But can’t hurt to be paranoid…
Although, I assume the swapped out semaphore is destroyed at the end of scope, so you are assuming it is not used at that point anyway.

Oh, right. I missed the fence being referenced in the submit command. You apparently need to do that, because you are re-recording the cmdbuffer in the render loop.
You are waiting on it before signal though; I assume it was created pre-signaled?

By Compositor I mean whether you e.g. use Wayland, or X. On Windows it does not matter; there is only one available.

On AMD it does indeed block for me on vkAcquire in FIFO.
Apparently it is a known behavior of NVIDIA: Problems with VK_KHR_swapchain - Vulkan - NVIDIA Developer Forums
You could try to create the Swapchain with one extra image. Your fence should already make sure you do not queue more than one Present at a time.

Buxxy11 · October 3, 2018, 1:50pm

The swapped out semaphore is actually kept until the next vkAcquireImageKHR call, when it is swapped with a new one. I omit the creation of the semaphore every frame at this point, as the behaviour is the same anyway.

I figured that if I don’t know what the next image is, I don’t want to accidentally give the vkAcquireImageKHR call one that is in use, thus I have one spare that I swap out. I was under the impression the semaphore is signaled when the image is ready to be presented to screen, which may not be when it is acquired or submitted. Only when I acquire an image, I will be certain that it is not still in use, as I assume the acquire will not give me the same image twice without presenting it on screen (and thus signalling the semaphore).

And indeed, the fences are created pre-signaled.

Interesting that it works as expected on AMD. For me, the vkQueuePresentKHR blocks even if I use more images (I tried 3 and 5). I have also tried calling acquireImageKHR with a fence or changing the timeout, but it does not change anything. vkAcquireNextImageKHR always returns VK_SUCCESS without blocking and vkQueuePresentKHR always blocks.

JiuShei · November 4, 2018, 5:08am

I catch the same behavior on Windows 10/Nvidia 1060 with VK_PRESENT_MODE_FIFO_KHR mode using. vkQueuePresentKHR blocks thread for delay equals display refresh interval. We can assume that it is normal behavior but how can I determine whether queue is free or busy -> I cannot externally sync the queue. Practice shows that really queue is not blocked on whole display refresh interval. How can I determine queue busyness with using vkQueuePresentKHR with mode VK_PRESENT_MODE_FIFO_KHR?

Aasimon · February 25, 2019, 3:07am

What timings do you get if you increase the number of images in the swap chain? (limits allowing) Do the later function calls start to block?

Would an async fence check on the present, or a later semaphored commit, allow you to check for a free queue?

krOoze · February 25, 2019, 6:38am

BTW, if I am not mistaken, DirectX swapchain does behave that way too. Which may be the reason they chose to implement it this way…