Poor multithreading performance compared to DX12

Well, at least on current nvidia driver.

I made simple test program to gauge performance for both APIs (and dx9 for comparison).
It runs two different tests in succession, first one draws 20,000 of small quads to measure API call overhead, and second one draws Julia Set on a large quad of (somewhat animating) 125,000 triangles, to test shader execution performance.

Those look like this:

Here are source code and binaries for those interested.
[Src] https://drive.google.com/open?id=0BzeNJCHJJEyjUTZDTmF6andRZUE
[Bin] https://drive.google.com/open?id=0BzeNJCHJJEyjeDVURWlTaWVBNWM

You will probably need Visual Studio 2015 redistributable package to run the .exe.
If you want to compile the project, you should have Visual Studio 2015, LodePNG and Vulkan SDK.
I also used VLD, but you can disable it by simply commenting out “#include <vld.h>” in WinMain.cpp.
(Sorry for lack of active links, this board doesn’t allow me to link too many urls.)

Anyway, for julia set rendering, performance of both APIs are almost identical as expected.
But it wasn’t the case for the heavy draw call test.

With multithreading off, both APIs shows similar performance (about 300fps) on my system (i7 4770, geforce 980 GTX).
But with MT on, dx12 runs at 600fps but in vk it’s still the same 300fps, no performance gain whatsoever.

The problem is, even though both renderers were running at same 300fps in ST, GPU usage for dx12 was only 50%, while in vk it was well over 90%.
dx12 runs only at 300fps in this setup because of cpu bottleneck, busy to record and submit commands in ST, while in vk’s case it was already in gpu bottleneck situation, despite shader workload is minimum.
Hence, as soon as cpu bottleneck is alleviated by MT, dx12 shows huge performance leap while vk shows none.

I ran various setup(batch count, quad size, different shaders) and profilers to understand this situation.
And my conclusion is this:
vk can record and submit rendering commands very fast, even faster than dx12.
But for whatever reason, it has to impose heavier workload on gpu than dx12 for each API call.

As a result, with MT off, if you artificially setup the test for cpu bottleneck, by increasing batch count and reducing quad size, vk runs faster than dx12.
But if you make it more gpu intensive, by increasing quad size or with more complex pixel shader, dx12 quickly outperforms vk.
With MT on, dx12 runs always faster than vk, sometimes more than twice.

Microsoft’s GPUView also shows different characteristics of drivers for both APIs.

First one shows dx12 trace of drawing 8,000 batches, second is vk with same setup.
In “Hardware Queue” section, you can see small boxes stacked up.
Each one of those boxes is “command packet”, it is stream of api commands which driver sends to hardware for execution.
See wide horizontal blank spacing in dx12 trace, that’s gpu idle time and vk trace doesn’t have those.

There’s a difference of box dispostion too, in vk trace boxes are much smaller, and many.
If you click one of those boxes you can see basic information of that particular command packet.
Regardless of batch count setup, in dx12 command packet is uniform 32k bytes, while in vk it is rather small, and various in size (~2044 bytes).

If this information is accurate, it means dx12 driver batches commands in large uniform packet, while vk driver behave somewhat differently.
Whatever it does differently to dx12 driver, it doesn’t look very effective.

Honestly I don’t understand why drivers for both apis have to behave so differently with significant performance gap, because to me both apis look damn close to each other.
Yes, this test is a extreme case and real world games won’t exibit this much performance differences.
But bottom line is, workload on gpu per api call is always higher in vk than dx12. And in today’s games, thousands of draw calls per frame is common.
Extra cpu overhead in dx12 can be mitigated by MT, but there’s no such option for extra gpu overhead in vk.

That’s somewhat disappointing as a developer who plans to implement new engine based on vulkan.
I’ll probably stick to vulkan because of it’s multiplatform nature and in my opinion it’s a bit cleaner api than dx12.
So hopefully future driver update will fix this issue.

I also like to know the situation on AMD gpus.
So feel free to download the test program and leave some feedbacks.
Thank you.

I know there are some broken links to images but the board doesn’t allow me to fix those. :frowning:
I contacted admin and I’ll make them right when I can.

A few points to make,

  1. you don’t need to do CmdSetImageLayout at all. That can be folded into the renderpass by setting the initialLayout and finalLayout. Add a subpass dependency from EXTERNAL to 0 with the correct stagemask to ensure the semaphore gets waited on.

  2. Your vulkan MT code splits each set of batches and then waits on all of them however you cannot resubmit a batch unless you create another set of secondary command buffers for that frame. Not an error in this case but something to keep in mind.

  3. did you check where most of the time is spent? timing monolithic parts of your codebase doesn’t tell you much about what is actually slow.

[QUOTE=ratchet freak;41138]A few points to make,

  1. you don’t need to do CmdSetImageLayout at all. That can be folded into the renderpass by setting the initialLayout and finalLayout. Add a subpass dependency from EXTERNAL to 0 with the correct stagemask to ensure the semaphore gets waited on.

  2. Your vulkan MT code splits each set of batches and then waits on all of them however you cannot resubmit a batch unless you create another set of secondary command buffers for that frame. Not an error in this case but something to keep in mind.

  3. did you check where most of the time is spent? timing monolithic parts of your codebase doesn’t tell you much about what is actually slow.[/QUOTE]

  4. Yep, I know I can specify initial(/final)Layout of attachment description can do the same but I just decided not to use it. I think I thought specifying initial layout as “undefined” might be faster and doing that in attachment description gave me error from debug layers.
    I didn’t try subpass dependency yet but anyway CmdSetImageLayout() is called twice per frame. So it should have minimum impact in this case.

  5. I don’t quite follow you here. I know I cannot resubmit command buffer unless specified with “VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT” flag. Is that you are trying to say? If that so, that’s quite as I intended so.

  6. I checked cpu time with profilers and according to them recording commands in vk is a lot faster than dx12 and submitting them is somewhat slower. Overall cpu time for one frame is slightly faster in vk than dx12. But note that it’s gpu overhead in vk not cpu overhead I’m talking about.

What I mean is that you can only call Draw once per frame when multithreading is enabled. Even with different batches.

Oh I see.
I understand it can be seen weird because Draw() is called in QuadPool and JuliaSet.
But still my intention was it should be called once a frame I just messed up a little bit while amending the code many times.

I learned I can’t edit a post after 15 mins.
So I upload those broken images here.

Those look like this:

Microsoft’s GPUView also shows different characteristics of drivers for both APIs.

Regardless of batch count setup, in dx12 command packet is uniform 32k bytes, while in vk it is rather small, and various in size (~2044 bytes).

Today I ran the same test after a long time.
Then whoa!, vulkan performs much faster than before.
It’s 700fps in multithreading mode in my PC, which is even faster than D3D12.

For those interested, this is a simple benchmark program mainly focused on testing how each API performs when drawing a large number of batches.
Images are gone but you can still download src/bin in the link above.

At the time vulkan was significantly slower than d3d12 in MT mode, but apparently not anymore.
I’ve run this test from time to time so I think the performance boost happened not very long ago.
Finally nvidia did something right on their driver!