I am currently porting a Cuda algorithm to Vulkan.
The port is accomplished from the functional point of view - the Vulkan version provides exactly the same output as the Cuda version, given the input is the same.
The problem is performance - Vulkan is about 2-4 times slower than Cuda, depending on the NVida GPU I am running tests on.
I am aware that there may be many reasons why Vulkan is slower but in the first place I would like to focus on a single symptom which may give the root cause of the problem.
The point is the Cuda kernels used by the algorithm require __syncwarp calls in certain places - if I comment these calls out, the algorithm stops working and hangs.
I rewrote the Cuda kernels to HLSL and I use dxc to compile HLSL to SPIRV so they may be used by Vulkan. But I didn’t include the __syncwarp counterparts in the HLSL code yet (because I didn’t figure out how to do it yet).
BUT, it doesn’t prevent the Vulkan algorithm from working properly! It looks like the driver runs Vulkan shaders in a mode which makes those syncs not required. If it is the case indeed, it seems obvious that such a mode should hurt performance at the same time.
Any idea how to make Cuda and Vulkan equivalent in how they behave in regards to syncing?
As you probably know from writing CUDA, good performance with GPU kernel code requires deliberate intent. It’s not forgiving like the CPU. I’d suggest pulling out Nsight Systems, Nsight Compute, and/or Nsight Graphics and go digging for what’s performing differently between the two. With a 2X-4X time difference, you should be able to see where the extra time is coming from.
Don’t just look at how long your kernels are running on the GPU, but also how much time is spent mapping/unmapping resources and getting the driver in a state where it can actually execute your kernels. Yes, you can see this kind of thing in Nsight Systems.
But before you spend any time optimizing, I’d first make sure that both the CUDA and the Vulkan implementations are 100% correct. For instance, your comment about just removing sync threads calls in the kernel concerns me, unless you absolutely know they aren’t required (“seems” to work doesn’t mean you’re not violating the spec and won’t experience misbehavior down-the-road).
Also re the Nsight tools, If you haven’t already, be sure to use NVTX markup to give context to what’s going on in your application (e.g. nvtxNameOsThread(), nvtxRangePush(), nvtxRangePop(), etc.) That can accelerate the process of mapping your code to the thread timing diagrams you’ll see in Nsight Systems.
I was able to significantly improve performance of the Vulkan variant - it is still not as fast as Cuda but much faster than it originally was. Let me explain.
My algorithm is based on a number of work segments. Every segment looks like this:
send data CPU → GPU
run two compute kernels
read data GPU → CPU
I assumed that I would get a performance benefit by running both transfers in a separate transfer queue (queue with a transfer bit set only) and compute work in a compute queue.
So I used both queues to execute a segment as described above. I synced both queues using timeline semaphores (in-GPU syncing) with a single wait on CPU at the very end of the segment (using a timeline semaphore as well).
Using two queues of course means more submit operations. Also I am aware that submits may be heavy to perform. I assumed it wouldn’t be a problem as I was creating a separate thread for every queue to perform submits in async to the main thread.
That’s the theory. And that’s how it looked like when I inspected a single segment execution in NSight Graphics:
Green - memory transfers
Orange/yellow - compute execution
Red - waiting on semaphores
If I am interpreting the graph correctly, it looks like waiting on semaphores alone introduced significant overhead. Remember that it is 100% in-GPU sync - no CPU involved in this after workload was submitted to GPU. I was shocked seeing how huge the overhead was
I reworked the implementation to use a single queue and do a single submit per work segment. No need to queue syncing now - only pipeline barriers in a single queue. It is much faster now.
Unfortunately it’s not like the problem was solved completely. Probably I didn’t express it clearly enough
While the Vulkan variant is much faster than it originally was, it’s still slower than the Cuda variant. Still looking where I can speed it up. I will post if I have some observations.