Compute Shader poor write performance

I’ve been developing a CAE app using the compute shader example by sascha_willems as a starting point. It’s computing fine.

#version 450

layout (local_size_x = 2, local_size_y = 2) in; // Trials have been done with 1 to 32

layout (binding = 0, rg32f) uniform readonly image2D inputImage;
layout (binding = 1, rg32f) uniform image2D resultImage;

#define accum(SUM, X, Y) \
...
#define accum16 \
...	
#define accum32 \
...

void main()
{	
	highp uint kernelDim = 32;

	vec2 sum = {0, 0 };

	vec2 phase;

	highp uint dstX = gl_GlobalInvocationID.x;
	highp uint dstY = gl_GlobalInvocationID.y;

	highp uint x = kernelDim * dstX;
	highp uint y = kernelDim * dstY;

	highp uint dx = x;
	highp uint dy = y;

	for (int i = 0; i < kernelDim; i++) {
		accum32; 
		dx++; 
		dy = y;
	}
// *************** The following call is extremely time consuming
	imageStore(resultImage, ivec2(dstX, dstY), vec4(sum.rg, 0, 0));
}

dstImage has usage = VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_STORAGE_BIT;

I need the transfer because I read out the result of the operation to the host.

I’m running on a low end NVIDIA card (MX-150) and I usually get about 225 to 300 FPS (pardon the use of FPS).

With the compute shader step added my rate drops to 100 FPS.

If I comment out the imageStore statement, the rate goes back up to about 200 to 250 FPS. It’s relatively invariant to the amount of math being done (as expected) or the number of threads calling imageStore (unexpected.)

This led me to believe I was dealing with a buffer usage type issue. I’ve tried various modifications to the usage flags with no effect.

I had a frag shader implementation working, but integer indexing into the image is required and getting that to work consistently in a fragment shader is a bit painful. However, the performance was much better for the fragment shader (225+ FPS).

Any assistance/advice would be appreciated.

Do a more fine-grained measure. Get a timestamp. Submit your compute with a fence. Wait for fence. Get a timestamp diff. (You could also do a timestamp query, but that seems like an overkill.)

Notebooks work in mysterious ways. Make sure it actually runs on the dedicated GPU, and that it triggers the gpu out of stand-by power state and clocks.

1 Like

Your proposal seems appropriate if my difference were on the order of 25%. My difference is clearly on the order of 100% or more. It’s also tied to one call only. Removing the imageStore call puts the performance back as if the compute shader section were never called.

For the past two weeks the app has had pretty steady performance in the 225 FPS/4 ms/frame region. Simply writing out the result adds 5 ms per frame. I may just ditch the compute shaders and wing it with fragment shaders. That was delivering the speed I needed, but it’s hard to test if I’m hitting exactly on the integer texel boundaries - which I need for this algorithm.

I’ll try what you propose, but I’m not hopeful. It will give me a more precise measure of how bad it is, but I don’t see it pointing directly to the problem.

I’m really looking to see if anyone has encountered something similar and has an “Ah hah!” response.

Yes, notebooks are odd. I upgraded my Nvidia driver over the weekend and lost the GPU entirely. My code won’t even function if it’s not accessing the GPU. Also, the Vulkan stats app is reporting correctly.

There are already fences on the stage. I’ll go digging for how to get timing data from them. Didn’t know that was possible.

Compute shaders do not have outputs. As such, they only do something if they perform atomic, SSBO, or imageStore operations. Which means if a CS doesn’t do these things… it doesn’t do anything and the compiler is perfectly free to optimize it into a big no-op.

That’s probably what’s happening to you here. It’s not that imageStore is causing a problem. It’s the difference between “doing something” and “doing nothing”; obviously doing nothing is faster.

It seems more likely that whatever is hiding behind your macros is taking a long time, thus leading to an ~5ms CS execution time. It’s also important to test the CS execution time vs. the latency such execution causes (since you will need to synchronize its completion with whatever does the reading). For example, if your CS does nothing more than write a value (no reading from the other image), will it execute any faster?

I’ll try what you propose, but I’m not hopeful. It will give me a more precise measure of how bad it is, but I don’t see it pointing directly to the problem.

My direction of thinking aligns with what Alfonse is saying above. Without the load and without the store it can basically be optimized away to void main(){}.

It is not really a matter of measuring how bad exactly it is. But rather localizing the problem (for all I know the problem might be your sync, or your readback to host), and also if it is indeed optimized away in the case of the no store, then that case should also be visible in the measurements.

That’s the Ah Hah I was looking for.

Without output (from the function), the compiler may have zeroed out the code. I feel like an idiot, I’ve been burned by that since 1982.

Looks like I’m going to be returning to the fragment shader, it performed a lot better.

BTW In the broad definition, imageStore is an output operation. It moves data from the shader to storage crossing the shader code boundary in an outward direction.
True, it is not an output stream.

I omitted the big Thank You, sorry.

I have more information and suspect that I know what the issue is. I have a theory on how to fix it, but would like some feedback before investing time in developing and testing it.

I wrote an alternate pipeline that uses a vert/frag shader pair in place of the compute shader. Pretty standard render two triangles (-1,-1) to (1,1) and have the fragment shader do what the compute shader did. I’m achieving similar results as far as numeric values but…

The speed of the vert/frag pair is 50% to 100% better (10 ms reduced to 4 to 5 ms per frame).

The algorithm uses a source image and a destination image. The process reduces the image size by doing a sum of blocks. So each invocation of the stage reduces the image dimension by 2x, 4x etc. The stages are chained to reduce a 4m px x 4m px image to 16px by 16 px. It’s not a graphics application, so don’t worry about doing all that work to get a useless image.

I assume that the vert/frag version knows that the sampler is read only and the frame buffer’s image is write only and are optimized accordingly. I assume that if I can mark each image as read or write, the speed will improve.

Since the output image from one stage is the input image for the next stage, I suspect I will need to change the image’s role with an image barrier.

Do the assumptions and approach seem valid?

Oops, 4k x by 4k px.
Oops, 4k x by 4k px.
Oops, 4k x by 4k px.

That’s still lotsa pixels. Yeah