radeon performance dropped on recent driver

mgoodfel · August 10, 2016, 12:30am

I’ve been messing with the Vulkan API and wrote a simple test program based on the Demos/cube.c code. On my Radeon 270X, I was getting 1500 fps (32K instanced cubes.) I was using the Apr 19 beta driver.

I just upgraded to the official 16.7.3 driver and although the demo still runs, it only get 500 fps. A friend tried it on his Radeon 290, and it only gets 100 fps on his (faster) machine! So something is seriously wrong here.

It all works fine on my NVidia production drivers, with no change from the beta. Is this is known issue with the Radeon driver?

Thanks.

Salabar · August 10, 2016, 1:51am

This can be either a regression you should report on AMD website, or cube demo being written using a couple of shortcuts for improved clarity (i.e. I see vkQueueWaitIdle in a very old SDK. Perhaps it is fixed by now, or perhaps not). If the latter is the case, it’s not really a problem as changes should not affect actual Vulkan code.

mgoodfel · August 10, 2016, 8:19am

Demos/cube.c still uses vkQueueWaitIdle and even vkDeviceWaitIdle in places. It would be nice to have a demo with a good main loop!

Second, I didn’t use either of these. I’m using fences and semaphores.

Third, we have a binary here that runs 3 times slower on the newer driver on the 270X, and 15 times slower on the 290. Can you suggest any feature that might slow things down that much? There must be other people developing on Radeon. I can post on their dev site, but I hate to just say “my code runs slow” without any details.

Thanks.

Sascha_Willems · August 10, 2016, 10:02am

I just plugged in a R9 390 and compared between the last public driver and the most recent one (16.7.3) and could not see any performance degradation with my examples. Compute ones seem to be tad faster with the recent drivers, and all other demos show the same performance as with the drivers before.

mgoodfel · August 10, 2016, 11:06am

Thanks! So what could slow down this code? My draw loop looks like this:

vkWaitForFences(m_device, 1, &m_drawFence, VK_TRUE, UINT64_MAX);
vkResetFences(m_device, 1, &m_drawFence);

AcquireNextImageKHR(m_device, m_swapChain, UINT64_MAX, m_presentComplete,
(VkFence) VK_NULL_HANDLE, &m_currentBuffer);

VkSubmitInfo submitInfo;
memset(&submitInfo, 0, sizeof(submitInfo));
submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submitInfo.pNext = NULL;
submitInfo.waitSemaphoreCount = 1;
submitInfo.pWaitSemaphores = &m_presentComplete;
VkPipelineStageFlags waitFlags = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT;
submitInfo.pWaitDstStageMask = &waitFlags;
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &m_buffers[m_currentBuffer].cmd;

vkQueueSubmit(m_queue, 1, &submitInfo, m_drawFence);

VkPresentInfoKHR present;
memset(&present, 0, sizeof(present));
present.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present.pNext = NULL;
present.swapchainCount = 1;
present.pSwapchains = &m_swapChain;
present.pImageIndices = &m_currentBuffer;

QueuePresentKHR(m_queue, &present);

I’m still learning the API, so this probably looks sad. I’m just kicking out the same queue over and over. Am I using the fences or present semaphore incorrectly?

Thanks.

Sascha_Willems · August 10, 2016, 11:22am

Hard to tell from that small part of the code. If you have a performance problem it’s probably caused by something outside of that code excerpt. One thing I’m missing in your code above is a signal semaphore for the queue presentation, but I don’t think that’s the problem.

Did you try profiling your application with e.g. CodeXL to find out what could cause this problem?

mgoodfel · August 10, 2016, 11:30am

AcquireNextImageKHR signals the semaphore given as argument, right? This is really the entire inner loop. I just draw some cubes over and over.

I’ll look at it with CodeXL. I guess I’ll have to install the beta driver that’s fast, profile that, then install the new one again and profile it. I was hoping someone would know what the problem is!

Thanks.

krOoze · August 10, 2016, 4:15pm

Your waitFlag is bad (no waiting).

I dont see a semaphore between the submit (pSignalSemaphores) and the present (pWaitSemaphores).

mgoodfel · August 11, 2016, 12:20am

Thanks. Can anyone point me to code with an efficient inner loop (no QueueWait calls) that’s correct?

Sascha_Willems · August 11, 2016, 12:33am

There is nothing fundamentally wrong with your code (which seems to have no queue waits).

But as krOoze said, you should use a different flag for the wait stages. Try VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT instead of VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT (just realized that I don’t use that in my examples, gotta change that) and also add a signal semaphore to your submit info that you use as wait on your present info.

krOoze · August 11, 2016, 7:42am

@mgoodfel Well, I offer mine – GitHub - krOoze/Hello_Triangle: Hello World like demo for Vulkan API. It should show the proper render loop for the Hello Triangle style app.

mgoodfel · August 14, 2016, 5:47am

KrOoze, somehow I didn’t get notified about your message, so I put together my own version. Looks about the same as yours. I notice in your code there are no VkImageMemoryBarrier calls to change the format of the frame buffer. Are those optional?

Anyway, I still have the same problem. Works fine on NVidia, and performance is terrible on Radeon. Any idea what the problem can be?

Here’s my latest loop and the command buffer.

// Get the index of the next available swapchain image:
err = AcquireNextImageKHR(m_device, m_swapChain, UINT64_MAX, m_acquireComplete,
(VkFence) VK_NULL_HANDLE,
&m_currentBuffer);

VkSubmitInfo submitInfo;
memset(&submitInfo, 0, sizeof(submitInfo));
submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submitInfo.pNext = NULL;
submitInfo.waitSemaphoreCount = 1;
submitInfo.pWaitSemaphores = &m_acquireComplete;
VkPipelineStageFlags waitFlags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;
submitInfo.pWaitDstStageMask = &waitFlags;
submitInfo.signalSemaphoreCount = 1;
submitInfo.pSignalSemaphores = &m_submitComplete;
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &m_buffers[m_currentBuffer].cmd;

err = vkQueueSubmit(m_queue, 1, &submitInfo, (VkFence) VK_NULL_HANDLE);
assert(!err);

VkPresentInfoKHR presentInfo;
memset(&presentInfo, 0, sizeof(presentInfo));
presentInfo.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
presentInfo.pNext = NULL;
presentInfo.waitSemaphoreCount = 1;
presentInfo.pWaitSemaphores = &m_submitComplete;
presentInfo.swapchainCount = 1;
presentInfo.pSwapchains = &m_swapChain;
presentInfo.pImageIndices = &m_currentBuffer;

err = QueuePresentKHR(m_queue, &presentInfo);

The command buffer looks like this: I’ve removed some of the structure elements for brevity.

VkImageMemoryBarrier barrier;
barrier.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
barrier.newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;

vkCmdPipelineBarrier(cmdBuf, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
0, 0, NULL, 0, NULL, 1, &barrier);

// begin render pass
vkCmdBeginRenderPass(cmdBuf, &passBegin, VK_SUBPASS_CONTENTS_INLINE);

vkCmdBindPipeline(cmdBuf, VK_PIPELINE_BIND_POINT_GRAPHICS, m_pipeline);
vkCmdBindDescriptorSets(cmdBuf, VK_PIPELINE_BIND_POINT_GRAPHICS,
m_pipelineLayout, 0, 1, &m_descSet, 0, NULL);

// set viewport and scissor
vkCmdSetViewport(cmdBuf, 0, 1, &viewport);
vkCmdSetScissor(cmdBuf, 0, 1, &scissor);

VkDeviceSize offsets[1] = {0};
vkCmdBindVertexBuffers(cmdBuf, VERTEX_BUFFER_BIND_ID, 1, &m_vertices.buf, offsets);
vkCmdBindVertexBuffers(cmdBuf, INSTANCE_BUFFER_BIND_ID, 1, &m_instances.buf, offsets);
vkCmdBindIndexBuffer(cmdBuf, m_indexes.buf, 0, VK_INDEX_TYPE_UINT32);

vkCmdDrawIndexed(cmdBuf, 6 * 6, m_cubeCountm_cubeCountm_cubeCount, 0, 0, 0);
vkCmdEndRenderPass(cmdBuf);

VkImageMemoryBarrier barrier;
barrier.oldLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;
barrier.newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR;

barrier.image = m_buffers[m_currentBuffer].image;

vkCmdPipelineBarrier(cmdBuf, VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, 0, 0, NULL, 0,
NULL, 1, &barrier);

So what is wrong with this??

krOoze · August 15, 2016, 5:30am

Use CODE block (in advanced mode) for code listings. It’s horrible to read in plain text.

Barriers are not “optional”. There are just three programming elements, that do the same thing: Barriers, Events and Subpasses. I used the subpasses (only).

You should mostly use subpasses, because you must use them anyway to draw anything. I often seen in the early examples, that they supplied fake dependency and layout transitions to the subpass and then gone on to use barriers instead. Well, but why not use the subpass you already have for that, right??

krOoze · August 15, 2016, 5:46am

ad code: Your barriers are unnecessarily strict. You should use the clearer ALL_COMMANDS, when you want that meaning instead of TOP or BOTTOM. Oh my, how large is the m_cubeCount? Well, should work though — by eye I don’t see anything wrong with it function-wise.

I have AMD. Any chance to have the full project code, to try?

mgoodfel · August 15, 2016, 7:51am

You can find it at http://sea-of-memes.com/misc/vulkan-test-2016-08-13.zip

There’s a compiled version at the top (windows), and the VS 2015 project file is in the Vulkan/BuildWin directory. Excuse the mess – I’ve been working through the Demos/cube.c code and gradually putting together my own test cases. In the Vulkan/Source directory, you want TestCube.cpp and TestCube.h.

But… this may all be a wild goose chase. A friend installed the latest Radeon driver, since it said it supported Vulkan games, and saw his performance drop from 2000 fps or something down to 100 fps. He blamed my code, since I clearly didn’t have much experience with Vulkan. I upgraded the drivers on my 270x and saw the same slowdown, although not as much.

It didn’t occur to me to try some other Vulkan code, but I did that today. Sascha Willems demos are also 3-5 times slower on the Radeon 270X than on my NVidia 1060. The benchmarks out there are that the 1060 is like 88% faster, so this is wildly wrong.

I tried backing out the new display driver, but can’t seem to get back to the old performance, even with the old code. Radeon just doesn’t completely uninstall for some reason.

Anyway, I will continue to try and bring this code up to some kind of standard that I can use as a template. Any advice is welcome. But it doesn’t look like the problem is just with my code (although it was wrong.)

Thanks.