High GPU execution time. Sync issue or memory bandwidth limit?

Hi!
My apologies.

I’m currently attempting to implement a vulkan based deferred renderer and having trouble debugging/figuring out if the performance characteristics i’m seeing on my end are expected or if i’m loosing performance due to inefficient utilisation of the API. (Still working out how to synchronize execution/memory via barriers/external subpass dependencies.)

During debugging i noticed that the execution time on the GPU on some shaders was rather high (basically everytime a 3D mesh was written into the GBuffer over a larger portion of the screen) or even during postprocessing effects like SSAO.

Note that i’m testing this on a GTX 1060 at FullHD (1080p) resolution. I’m aware that this gpu model is considered to be old nowadays but still i didn’t expect to have this kind of (horrible) performance that i’m experiencing here.

Example of the SSAO shader:

#version 450
#extension GL_ARB_separate_shader_objects : enable
#include "normalCompression.glsl"

layout(binding = 0) uniform SSAOshaderUniform {
mat4 invViewProj;
mat4 projection;
mat4 normalViewMatrix;
vec4 ssaoKernel[32];
vec2 resolution;
vec2 noiseScale;
int kernelSize;
float radius;
float bias;
} ubo;

layout(binding = 1) uniform sampler2D texSsaoNoise;
layout(binding = 2) uniform sampler2D texViewDepth;
layout(binding = 3) uniform sampler2D texViewNormal;


layout(location = 0) in vec2 fragTexCoord;


layout(location = 0) out vec4 outColor;


vec3 depthToWorld(sampler2D depthMap,vec2 texcoord,mat4 inverseProjView){
 float depth = texture(depthMap,texcoord).r;
       
       //vec4 position = vec4(texcoord,depth,1.0);
       vec4 position = vec4(texcoord* 2.0 - 1.0,depth,1.0);
       
 
       position = ((inverseProjView)*position);
       return vec3(position/ position.w);
}

vec3 reconstructViewPos(vec2 texcoord,float depth,mat4 invProj){

vec4 clipSpaceLocation;
clipSpaceLocation.xy = texcoord * 2.0f - 1.0f;
clipSpaceLocation.z = depth;
clipSpaceLocation.w = 1.0f;

vec4 homogenousLocation = invProj * clipSpaceLocation;
return homogenousLocation.xyz / homogenousLocation.w;
}


//Plane equation. Define a plane pointing towards the +Z axis, use "coords" to select a point on the plane. Returns the z-coordinate at this specific point
float calcDepthOnPlane(vec3 planeNormal,vec2 coords){
return (-planeNormal.x * coords.x - planeNormal.y * coords.y)/planeNormal.z;
}


void main()
{
 
int kernelSize = ubo.kernelSize;
float radius = ubo.radius;
float bias = ubo.bias;
 

//position and normal should be in viewspace!
vec2 fragPosCentered = (floor(fragTexCoord * ubo.resolution)+vec2(0.5,0.5))/ubo.resolution;//ivec2(floor(fragTexCoord * resolution));
 
vec3 fragPos = depthToWorld(texViewDepth,fragPosCentered,inverse(ubo.projection));//ubo.invViewProj);
vec3 normal = (ubo.normalViewMatrix * vec4(normalDecode(texture(texViewNormal, fragPosCentered).rg),1.0)).xyz;
vec3 randomVec = (texture(texSsaoNoise, fragTexCoord * ubo.noiseScale).xyz * 2.0) - 1.0;  
randomVec.z = 0.0;
 
 
vec3 tangent = normalize(randomVec - normal * dot(randomVec, normal));
   vec3 bitangent = cross(normal, tangent);
   mat3 TBN = mat3(tangent, bitangent, normal);
 
// iterate over the sample kernel and calculate occlusion factor
   float occlusion = 0.0;
   for(int i = 0; i < kernelSize; ++i)
   {
       // get sample position
       vec3 samplePos = TBN * ubo.ssaoKernel[i].xyz; // from tangent to view-space
       samplePos = fragPos + samplePos * radius; //viewspace pos
       
       // project sample position (to sample texture) (to get position on screen/texture)
       vec4 offset = vec4(samplePos, 1.0);
       offset = ubo.projection * offset; // from view to clip-space
       offset.xyz /= offset.w; // perspective divide
       offset.xyz = offset.xyz * 0.5 + 0.5; // transform to range 0.0 - 1.0
       // get sample depth
       float sampleDepth = depthToWorld(texViewDepth,offset.xy,inverse(ubo.projection)).z;//depthToWorld(texViewDepth,offset.xy,inverse(ubo.projection)).z;//texture(gPosition, offset.xy).z; // get depth value of kernel sample

       // range check & accumulate
       float rangeCheck = smoothstep(0.0, 1.0, radius / abs(fragPos.z - sampleDepth));
       occlusion += (sampleDepth >= samplePos.z + bias ? 1.0 : 0.0) * rangeCheck;            
   }
   occlusion = 1.0 - (occlusion / kernelSize);


vec3 fColor = texture(texViewDepth, fragTexCoord).rgb;

outColor = vec4(occlusion,occlusion,occlusion,1.0);
}

Note that i set the kernelsize to 16 and radius to 0.1.
The SSAO is nothing fancy. 1 texture tap for the depth at the fragment pos, (depth is a 32 bit floating point buffer btw), 1 tap for the scene normal (for subpixel accuracy) and then 16 taps of the depth buffer around the kernel. (normalbuffer isn’t touched there.)

Here the output

SSAO output

And that’s the execution time:
Nsight execution time

SSAO was taken here with 16 samples with a very low sample radius to make sure that every texelfetch during kernel sampling is close in memory to each other (to check if that’s a caching issue due to large memory jumps which doesn’t seem to be the case). With a bigger radius the execution time gets way worse.
Spending 3-4 ms on SSAO seems a bit excessive for the GPU and the quality of the result hence why i’m wondering if i’m maybe overlooking something on the API end? (Increasing the sampling to 32 taps or increasing the radius can result in the execution time exceeding 5-6 ms. On a GTX 1060.)

Another thing i noticed that Gbuffer rendering also takes a long time.
Here a test where i write into 4 attachments (each having 32 bits) and getting 0.4 ms execution time by rendering a very simple floor. (pointing the camera down to render this over the entire screen worsens this of course.)
Shader:

#version 450
#include "normalCompression.glsl"
#include "normalFilter.glsl"

layout(binding = 1) uniform sampler2D sAlbedo;
layout(binding = 2) uniform sampler2D sNormal;
layout(binding = 3) uniform sampler2D sMetal;
layout(binding = 4) uniform sampler2D sRoughness;
layout(binding = 5) uniform sampler2D sEmissive;
layout(binding = 6) uniform sampler2D sAo;
layout(binding = 7) uniform sampler2D sShadow;

layout(location = 0) in vec3 fragColor;
layout(location = 1) in vec2 vTexcoord;
layout(location = 2) in vec3 vNormal;
layout(location = 3) in vec3 vModelViewPosition;
layout(location = 4) in vec2 fragTexCoordLightmap;
layout(location = 5) in float emissionMultiplier;
layout(location = 6) in mat3 tangentToWorldMatrix;


layout (location = 0) out vec4 gAlbedo;//32 bit RGBA8
layout (location = 1) out vec2 gNormal;//32 bit R16G16
layout (location = 2) out vec2 gNomalGeometry;//32 bit R16G16
layout (location = 3) out vec2 gNormalClearcoat;//32 bit R16G16


void main() {
 
   gAlbedo = texture(sAlbedo, vTexcoord) * vec4(fragColor.rgb,1.0);
 
vec3 normal = filterNormalMap(sNormal, vTexcoord);
gNormal.xy = normalEncode(normal * tangentToWorldMatrix);
 

gNomalGeometry.xy = normalEncode(normalize(vNormal));
gNormalClearcoat = gNomalGeometry;
 
}

Results: (Notice that only the mesh which is marked with the wireframe was rendered into the gbuffer)

Render result
Nsight

I noticed that in both cases removing texture writes to the attachments (simply commenting out the line which writes information to the attachment) in the shader reduces the execution time to below 0.1 ms. (so i assume that the write access is the bottleneck?)

My question would be if that’s a “normal” performance characteristic for this GPU or if i’m potentially stalling the GPU pipeline somewhere which leads to this results?

In case that information is needed, i only render with renderpasses which have only one (main) renderpass and no subpasses with this execution/memory dependency:

{
		VkSubpassDependency dependency;
		dependency.srcSubpass = VK_SUBPASS_EXTERNAL;
		dependency.dstSubpass = 0; // First subpass attachment is used in
		dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
		dependency.srcAccessMask = 0;

		dependency.dstStageMask = VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT;
		dependency.dstAccessMask = VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT;

		dependency.dependencyFlags = 0;

		dependencies.push_back(dependency);
	}
	{
		VkSubpassDependency dependency;
		dependency.srcSubpass = 0;
		dependency.dstSubpass = VK_SUBPASS_EXTERNAL;
		dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
		dependency.srcAccessMask = 0;

		dependency.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
		dependency.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;

		dependency.dependencyFlags = 0;

		dependencies.push_back(dependency);
	}

VkRenderPassCreateInfo renderPassCreateInfo = {};
	renderPassCreateInfo.dependencyCount = dependencies.size();
	renderPassCreateInfo.pDependencies = dependencies.data();

Sorry, hello!

You are showing virtually none of that synchronization, so hard to know.

It is slightly sus that your depth attachment depends on your color attachment. Best would be if you document your overal resource synchronization strategy (e.g. with the ol’ pen and paper; the horror!). That can help you determine if your synchronization is water-tight and sensible.

Great GPU if in 6GB variant.

Performance does not exist in a vacuum. What constitutes a “horrible” performance, and what performance was expected instead?

Do you have some kind of baseline to compare it to? Say, why not download someone’s example and see if they do any better for similar workload. Your scene looks basic, but I cannot know just by looking at it; might be made from trillion triangles or something.

for(int i = 0; i < kernelSize; ++i)

I am perhaps oldschool, but lot of divisions in a loop makes me uneasy. Also you keep recalculating inverse(ubo.projection), which can’t be cheap if the compiler is dumb.

That’s not awfully surprising considering dead-code elimination.

Do you refer to the subpass dependency?)
In that case

dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
dependency.dstStageMask = VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT;

Means (please correct me if i’m wrong) that the depth access waits for the color attachment write of the previous renderpass. (I did this “just in case” i add texture fetch capabilities to the vertex shader.)
But you are right, the dstStageMask should be changed to the Fragment shader stage.

I have to admit, i still have to grasp the concept of synchronisation fully. (Especially which GPU action requires which pipeline stage to be blocked and how to utilize dstAccessMask, srcAccessMask, etc…)
That’s the reason why i for example omitted the use of additional subpasses. (to keep things as simple as possible for now.)

It is :slight_smile: (Would love to have some RTX cores to play with though.)

That’s true. However, spending around 5ms on the GPU for a fairly standard SSAO shader (which for a 60 FPS target is already 30% of your overall render budget) doesn’t seem right. (Looking up other posts of SSAO implementations which had performance comparisons, from what i was able to find the target for SSAO is often in the range of 1 at most 2 ms. not more?) Might be wrong here though.

I fixed that after posting this topic here. (Didn’t make a difference in the overall performance though.)

With regards to synchronisation:
I have one Semaphore which waits for the availability of the incoming swapchain image to start rendering my renderpasses for the current frame.
(Originally i chained semaphores between each renderpass to make sure that they don’t overlap each other and to get things up and running quickly. That’s of course not great for performance.)

Currently the only synchronisation i have is two VK_SUBPASS_EXTERNAL dependencies which i add to every renderpass i submit to the queue. (same dependencies as shown above)

{
		VkSubpassDependency dependency;
		dependency.srcSubpass = VK_SUBPASS_EXTERNAL;
		dependency.dstSubpass = 0; // First subpass attachment is used in
		dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
		dependency.srcAccessMask = 0;

		dependency.dstStageMask = VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT;
		dependency.dstAccessMask = VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT;

		dependency.dependencyFlags = 0;

		dependencies.push_back(dependency);
	}
	{
		VkSubpassDependency dependency;
		dependency.srcSubpass = 0;
		dependency.dstSubpass = VK_SUBPASS_EXTERNAL; // First subpass attachment is used in
		dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
		dependency.srcAccessMask = 0;

		dependency.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
		dependency.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;

		dependency.dependencyFlags = 0;

		dependencies.push_back(dependency);
	}

The only thing i’m doing in addition to that to transition the image layouts of the rendertargets in the renderpasses. (For example color attachments are always in the VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL layout and i only transition them to VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL during renderpasses and then back to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL by using the automatic layout transition feature of the renderpasses.)

The way this is set up doesn’t throw validation errors and the output from the renderer looks as expected.

The only weird thing i noticed is that removing those two dependencies (basically not having any subpass dependencies, semaphores/fences, etc…) and submitting the renderpass/command buffers as is into the queue still produces the same results (no graphical corruption, still same performance profile.)

I’m not entirely sure if that’s the case but according to the spec:
https://registry.khronos.org/vulkan/specs/1.2-khr-extensions/html/chap8.html#VkSubpassDependency
If one doesn’t supply render pass dependencies with VK_SUBPASS_EXTERNAL then the driver will add preset dependencies automatically which look like this:

VkSubpassDependency implicitDependency = {
    .srcSubpass = VK_SUBPASS_EXTERNAL;
    .dstSubpass = firstSubpass; // First subpass attachment is used in
    .srcStageMask = VK_PIPELINE_STAGE_NONE;
    .dstStageMask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT;
    .srcAccessMask = 0;
    .dstAccessMask = VK_ACCESS_INPUT_ATTACHMENT_READ_BIT |
                     VK_ACCESS_COLOR_ATTACHMENT_READ_BIT |
                     VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT |
                     VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT |
                     VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT;
    .dependencyFlags = 0;
};

kSubpassDependency implicitDependency = {
    .srcSubpass = lastSubpass; // Last subpass attachment is used in
    .dstSubpass = VK_SUBPASS_EXTERNAL;
    .srcStageMask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT;
    .dstStageMask = VK_PIPELINE_STAGE_NONE;
    .srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT |
                     VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT;
    .dstAccessMask = 0;
    .dependencyFlags = 0;
};

Which means that seperate renderpasses are already (to an extent) synchronised with each other? (As from my understanding VkSubpassDependency already functions as an memory- and execution barrier.)

But i might be totally wrong here. (Because if that’s the case then i don’t understand how Renderpasses can be reordered/executed in pararell by the GPU if they already recieve implicit synchronisations by the driver as described above.)

That actually makes things more complicated. Breaking things into subpasses offers a clear place for dependencies; your main command buffer recording logic doesn’t have to care.

Does it? From the information that i was able to gather, subpasses do have additional restrictions compared to regular (single subpass) renderpasses. For example (i might be wrong here) subpasses have to sample the same pixel/fragment (so stuff like SSAO or any other shader that would sample neighbouring pixels) isn’t possible. (Having to deal with those limits/gotchas seemed a bit much for now.)

With regards to synchronisation:
I think that constantly switching between two imagelayouts on each renderpass (shader_read_optimal > color_attachment > shader_read) might be the culprit here?
The spec states that IF you specify a layout transition THEN (and only then) an implicit subpass dependency with “VK_SUBPASS_EXTERNAL” will be added to the renderpass (unless you specify a subpass dependency yourself with dstSubpass/srcSubpass = “VK_SUBPASS_EXTERNAL”).
Subpass dependencies seem to be treated as memory barriers (possibly even execution barriers but not sure on that?) which means that you automatically get synchronisation between renderpasses (if you perform image layout transitions on the attachments.)
If initialLayout, layout and finalLayout are the same no layout transitions will take place which in turn doesn’t create external subpass dependencies. (So you need to manually insert barriers if nessecary in such cases.)

So to avoid pipeline stalls one has to make sure to only perform layout transitions in the renderpass if nessecary (otherwise pararell execution/reordering of renderpasses is not possible.) This could explain why i don’t see any synchronisation issues between render passes even without explicit subpass dependencies on my end. (And possibly affects the overall performance too.)

If that’s the case then i’m not sure how to go about designing a usable abstraction around the Vulkan API as you need to (at least) know the final layout of an attachment in advance during your renderpass creation time. (This probably lends itself greatly for a rendergraph system. Otherwise one has to in the worst case correct the layout of images between renderpasses with pipeline barriers manually.)