Abysmal performance - How to find cause?

I have some serious rendering performance issues in my engine (using Vulkan), and I’m having a lot of trouble pinning down the cause.
I’ve used a mesh with 29952 vertices and 9984 triangles as reference. I can render the mesh 262 times in the source engine and my FPS count is still at a steady 90 fps:

That’s ~0.0415ms per mesh (Ignoring level geometry and such, which means it would actually be even faster).

In my engine I already run into massive performance problems if I just render a handful.
I’ve used timestamp queries to time how long it takes to render 36 of them and it turned out to be ~46.1373ms (~1.282ms per mesh). That’s about 30 times slower compared to the source engine (And that’s without a texture, lighting effects, etc!).

I’ve already ruled a few things out:
[li]State Changes: I can render hundreds of different small objects (= a lot of state changes) just fine, no issues whatsoever.
[/li]I did some measurements on the CPU and was able to narrow the it down to this function-call:


I’m using FIFO present mode with 2 swapchain images (Mailbox isn’t supported on my GPU). For each image there is a fence, to make sure all previous render-calls for the command buffer for that swapchain image have been completed. The above call waits for that fence, and thus waits until the command-buffer has executed all commands in the queue. This is where my program spends most of its time (~95%), which means it’s mostly just waiting for the GPU. (Which concurs with the timestamp measurements I mentioned.)
[li]Shaders: The shaders I’ve used for testing are as simple as can be:
Fragment Shader:

#version 440

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout(location = 0) out vec4 fs_color;

void main()
	fs_color = vec4(1,0,0,1);

Vertex Shader:

#version 440

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout(location = 0) in vec3 in_vert_pos;

layout(push_constant) uniform Matrices {
	mat4 MVP;
} u_matrices;

void main()
	gl_Position = u_matrices.MVP *vec4(in_vert_pos,1);

(Back-facing triangles are discarded by the cull mode.)

I’m using Vulkan 1.0.26 and my drivers are up to date.

Everything points towards the GPU struggling to render the meshes, but that doesn’t explain how so many can be rendered in the source engine (And my GPU definitely should be able to handle it).
I haven’t posted any code because I don’t even know what to look for. What can I try to narrow the problem down further?

Don’t know much about the source engine, but as you’re rendering the same mesh multiple times I bet they use instancing, so that would be the first route I’d go for if you’re currently rendering mesh-by-mesh instead.

Other than use a tool like NVIDIA’s nSigh or AMDs CodeXL for profiling GPU bottlenecks.

I have tried copying the mesh 36 times into a single mesh (So it’s all one instance), but the results are almost exactly the same. If it were something like that, I’d also expect bad performance when rendering a lot of small objects/simple meshes without instancing, but that’s not the case.

I’ve tried CodeXL, but it just confirms what I already know:

(1.455ms for one mesh)
All other commands are in the μs or ns range and not worth mentioning.
The CPU profiler also doesn’t show anything out of the ordinary. :sad:

So, it turns out the reason for the terrible performance was that I hadn’t specified the vk::MemoryPropertyFlagBits::eDeviceLocal flag for my vertex buffers. Setting the flag makes all the difference:

That’s about 400 of them (Individual meshes, no instancing), frame rate at a stable 56 FPS (17ms).

This is what the Vulkan specification says about the flag:

So, as long as memory for the types I need is available, I should be able to just use it for all of my vertex, uniform and storage buffers, correct?

Yes, one of the most basic things in Vulkan is to learn and use the proper memory types for your use cases. For stuff like vertices, indices etc. you should always go with a device local buffer type (note that mobile devices have buffers that are device local and can be accessed by the host) and stage from host to device. For data that’s updated frequently (like UBOs) you still may want to go with a host visible buffer as the staging and copies may be slower for smaller buffers.