Suggestions on profiling headless Vulkan compute?

Hello everyone,

I have a headless Vulkan compute application that just dispatches a single compute shader in its pipeline. I’d like to profile the compute shader and improve its performance. But right now I’m having troubles with tool selection. I’m wondering whether the somebody have good suggestions on this.

To give more details, I’d like to understand the compute shader’s low-level characteristics like the generated hardware ISA’s timing, register/memory usage, barrier overhead, and others. This pretty much means I need to look into vendor specific tools like Nsight and RGP. But IIUC at the moment they are all for graphics and frame oriented; for my application there is no frames and it just completes very quickly so I’m not even possible to capture anything. I’m wondering whether there is a programmatic way to perform captures with these tools?

To my knowlegde, RenderDoc provides an API that I can use to do captures programmatically and it has the nice integration with RGP. But I’m not able to get instruction timing information out of RGP via RenderDoc API. I might be missing something but I think RGP instruction timing still require one to perform captures with the traditional graphics way?

I’ve also checked tools like Tracy. It’s awesome and comprehensive but the information I can get stops at shader level; no insights into the shader itself.

So to summarize, my questions are

  1. Is there a programmatic way to do shader level profiling for Vulkan compute? What kind of tools I should look into?
  2. If not, what tricks can I use to make Nsight/RGP/etc. work better with headless Vulkan compute applications?

Thanks in advance!

So from my experience, the main things to profile for compute work loads is

  • The shader itself
  • Synchronization bubbles between dispatches

since you only seem to have a single dispatch, the 2nd might not be as important. Shader Playground is a good way to play around with how your shader gets compiled. I know also GPUOpen has a standalone shader compiler which can dump out some good information.

As you said as well, a lot of the fine grain details do come in the form of Hardware specific implementations. I am not familiar with NSight or RGP to speak about them, but I have used the RenderDoc API to “fake the frame” for debugging compute workloads

RENDERDOC_API_1_4_0* pRenderDocApi = nullptr;

// dlopen and setup pRenderDocApi 

// Start of a "frame" to capture AFTER VkCreateInstance
pRenderDocApi->StartFrameCapture(nullptr, nullptr);
// compute workload
pRenderDocApi->EndFrameCapture(nullptr, nullptr);

so maybe RGP can use that as well (again, no idea how RGP is integrated with RenderDoc, but you mentioned it was somehow). I hope NSight and RGP have a similar “API” to trigger when the frame is for offline workloads such as compute

Thanks @sfricke_samsung! Yes as you said, we are using RenderDoc APIs to programmatically capture for now. Together with its RGP generation (the forum does not allow me to post link…) functionality we can get intra-command-buffer-level information. But still I haven’t gotten it to work with instruction timing information. Trying to capture with RGP itself does not work.

So I just recently came across this

which might be helpful for you. There is a CLI and GUI for the tool and it allows you to give your shader (with optional pipeline information) and it will give you a pretty good break down of the ISA with time per instructions, registers used, etc

Go ahead and post the link without the leading http:// (and with spaces added in the URL as needed), and I’ll fix it for you.

New users can’t post more than 2 links until they’ve spent more time in the forums (spam avoidance).