Compute Shader slower than expected

I have a program that is supposed to run on the gpu. Now I’ve measured the performance of one function initEdgesX on the cpu which gave me roughly 150 ms for a 400³ array of data. Now I wanna parallelize it on the gpu, and I expected a high speedup due to the gpus parallel nature.

However, when I run the code on the gpu, it’s only roughly 2 times as fast as the cpu version. I’m using opengl compute shader.

This is my code:

CPU:

ComputeShader computeShader("./AVISE_GPU/Shader/initEdgesX.cs");
computeShader.use();

Buffer scalarFieldBuffer(GL_SHADER_STORAGE_BUFFER, scalarFieldSizeTotal * 4, scalarField, GL_DYNAMIC_COPY);
scalarFieldBuffer.bindBufferBase(0);

Buffer heightmapBufferNeg(GL_SHADER_STORAGE_BUFFER, sizeEdgesX * sizeY * sizeZ * 4, nullptr, GL_DYNAMIC_COPY);
heightmapBufferNeg.bindBufferBase(1);

Buffer heightmapBufferPos(GL_SHADER_STORAGE_BUFFER, sizeEdgesX * sizeY * sizeZ * 4, nullptr, GL_DYNAMIC_COPY);
heightmapBufferPos.bindBufferBase(2);

Buffer heightmapIndexOffsetBufferNeg(GL_SHADER_STORAGE_BUFFER, sizeY * sizeZ * 4, nullptr, GL_DYNAMIC_COPY);
heightmapIndexOffsetBufferNeg.bindBufferBase(3);

Buffer heightmapIndexOffsetBufferPos(GL_SHADER_STORAGE_BUFFER, sizeY * sizeZ * 4, nullptr, GL_DYNAMIC_COPY);
heightmapIndexOffsetBufferPos.bindBufferBase(4);

unsigned int testCounter = 0;

Buffer atomicCounter(GL_ATOMIC_COUNTER_BUFFER, 4, &testCounter, GL_DYNAMIC_COPY);
atomicCounter.bindBufferBase(5);

computeShader.setUInt("sizeX", sizeX);
computeShader.setUInt("sizeY", sizeY);
computeShader.setUInt("sizeZ", sizeZ);
computeShader.setUInt("sizeEdgesX", sizeEdgesX);

glfwSetTime(0.0);

/*for (int x = 0; x < sizeX - 1; ++x) {
    computeShader.setUInt("currentX", x);
    glDispatchCompute(1, ceil((float)sizeY / 8), ceil((float)sizeZ / 8));
}*/
glDispatchCompute(1, ceil((float)sizeY / 8), ceil((float)sizeZ / 8));
glFinish();
std::cout << glfwGetTime() << std::endl;

And the shader:

# version 450 core

const int localSizeX = 1;
const int localSizeY = 8;
const int localSizeZ = 8;
layout(local_size_x = localSizeX, local_size_y = localSizeY, local_size_z = localSizeZ) in;

uniform uint sizeX;
uniform uint sizeY;
uniform uint sizeZ;
uniform uint currentX;
uniform uint sizeEdgesX;

layout(binding = 5) uniform atomic_uint testCounter;

layout(std430, binding = 0) readonly buffer scalarField
{
        float density [];
}
inputScalarField;

layout(std430, binding = 1) buffer heightmapBuffer1
{
        uint height [] ;
} heightmapZYNeg;

layout(std430, binding = 2) buffer heightmapBuffer2
{
        uint height [] ;
} heightmapZYPos;

layout(std430, binding = 3) buffer heightmapIndexOffsetBuffer1
{
        uint indexOffset [] ;
} heightmapIndexOffsetZYNeg;

layout(std430, binding = 4) buffer heightmapIndexOffsetBuffer2
{
        uint indexOffset [] ;
} heightmapIndexOffsetZYPos;

uint getScalarIndex(uint x, uint y, uint z)
{
    return z * sizeX * sizeY + y * sizeX + x;
}

uint getHeightmapIndex(uint widthIndex, uint heightIndex, uint depthIndex, uint width, uint depth)
{
    return heightIndex * width * depth + widthIndex * depth + depthIndex;
}

void main()
{
    uint currentYIndex = gl_LocalInvocationID.y + (gl_WorkGroupID.y * localSizeY);
    if (currentYIndex > sizeY)
    {
        return;
    }

    uint currentZIndex = gl_LocalInvocationID.z + (gl_WorkGroupID.z * localSizeZ);
    if (currentZIndex > sizeZ)
    {
        return;
    }

    uint heightmapIndexOffsetIndex = currentYIndex * sizeZ + currentZIndex;
    heightmapIndexOffsetZYNeg.indexOffset[heightmapIndexOffsetIndex] = 0;
    heightmapIndexOffsetZYPos.indexOffset[heightmapIndexOffsetIndex] = 0;

    atomicCounterIncrement(testCounter);

    for (int x = 0; x < sizeX - 1; ++x)
    {
        float scalar1 = inputScalarField.density[getScalarIndex(x, currentYIndex, currentZIndex)];
        float scalar2 = inputScalarField.density[getScalarIndex(x + 1, currentYIndex, currentZIndex)];

        if (scalar1 < 0 && scalar2 >= 0)
        {
            uint currentHeightmapIndexOffset = heightmapIndexOffsetZYNeg.indexOffset[heightmapIndexOffsetIndex];
            uint arrayIndex = getHeightmapIndex(currentZIndex, currentYIndex, currentHeightmapIndexOffset, sizeZ, sizeEdgesX);
            heightmapZYNeg.height[arrayIndex] = x;
            heightmapIndexOffsetZYNeg.indexOffset[heightmapIndexOffsetIndex] = currentHeightmapIndexOffset + 1;
        }
        else if (scalar1 >= 0 && scalar2 < 0)
        {
            uint currentHeightmapIndexOffset = heightmapIndexOffsetZYPos.indexOffset[heightmapIndexOffsetIndex];
            uint arrayIndex = getHeightmapIndex(currentZIndex, currentYIndex, currentHeightmapIndexOffset, sizeZ, sizeEdgesX);
            heightmapZYPos.height[arrayIndex] = x;
            heightmapIndexOffsetZYPos.indexOffset[heightmapIndexOffsetIndex] = currentHeightmapIndexOffset + 1;
        }
    }

}

Yeah, I hear that alot. “We’ll just throw it on the GPU. It’s got a lot of compute power. It’ll be faster.”

This is nonsense. On the GPU, you have to work a lot harder to rework your algorithm so that it’s not memory bound than you do on the CPU. If you’re very GPU memory access bound, you aren’t even going to get close to peak GPU FLOPS performance. I’m no compute shader expert, but you’re doing some inefficient things in this short section of code. Like incrementing the value in global memory locations hundreds of times within the same compute shader invocation, with each increment involving a global mem read and a global mem write. No shared mem caching. Much less register caching. Just pure, unnecessary global memory read and write bandwidth. And I haven’t even looked at your access patterns yet.

I think you should take a close look at the memory accesses that your compute shader performs, along with the access patterns. First, get rid of the waste. Then, mesh what’s left with what access patterns are efficient on the GPU(s) you are targeting. See the GPU vendor documentation for details here.

2 Likes

A second example:

I have 2 arrays filled with numbers, and a third array which stores the result of adding two numbers from the arrays together.

MY CPU implementation of this on a ryzen5 4600h takes about 70 micro seconds for an 160000 sized array. MY GPU implementation using openGL compute shaders running on the integrated chip of the main processor (AMD radeon vega 6) takes about 5 milliseconds.

Why on earth is my compute shader so slow?

the code:

ComputeShader test("./AVISE_GPU/Shader/TestShader.cs");
test.use();


unsigned int* testArray1 = new unsigned int[testSize];
unsigned int* testArray2 = new unsigned int[testSize];

for (int i = 0; i < testSize; ++i) {
    testArray1[i] = i;
    testArray2[i] = i;
}

unsigned int* resultingArrayTest = new unsigned int[testSize];

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();


for (int i = 0; i < testSize; ++i) {
    resultingArrayTest[i] = testArray1[i] + testArray2[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "CPU time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[microSec]" << std::endl;

Buffer testArrayBuffer1(GL_SHADER_STORAGE_BUFFER, testSize * 4, testArray1, GL_STREAM_DRAW);
testArrayBuffer1.bindBufferBase(0);

Buffer testArrayBuffer2(GL_SHADER_STORAGE_BUFFER, testSize * 4, testArray2, GL_STREAM_DRAW);
testArrayBuffer2.bindBufferBase(1);

Buffer resultArray(GL_SHADER_STORAGE_BUFFER, testSize * 4, nullptr, GL_STREAM_DRAW);
resultArray.bindBufferBase(2);

glFinish();
glfwSetTime(0.0);

glDispatchCompute(ceil((float)testSize / 64), 1, 1);
glFinish();
double time = glfwGetTime();
std::cout << "GPU time: " << time << std::endl;

the shader:

#version 450 core

layout(local_size_x = 64) in;

uint arraySize = 160000;

layout(std430, binding = 0) buffer arrayBuffer1
{
    uint content [];
}
array1;

layout(std430, binding = 1) buffer arrayBuffer2
{
    uint content [];
}
array2;

layout(std430, binding = 2) buffer resultArrayBuffer
{
    uint content [];
}
resultArray;

void main()
{
    uint currentIndex = gl_LocalInvocationID.x + (gl_WorkGroupID.x * 64);

    if(currentIndex >= 160000)
    {
        return;
    }
    resultArray.content[currentIndex] = array1.content[currentIndex] + array2.content[currentIndex];
}

How do you know that? What are you using to measure this time? Are you also measuring the time it takes to transfer the data to GPU-accessible memory and transfer it back?

And how fast is it if you actually make use of the vector features of your GPU? That is, instead of having one instance compute one value, why not have it do 4 values via a uvec4? Or even better, maybe each instance could compute 16 values.

For the timing I used the glfwGetTime function, and I only time it after the upload of the data is finished.

I updated my graphics driver and now it runs faster at somewhere inbetween 100 microseconds and 1 millisecond.

But I also tried running the shader with an emtpy main function and the execution still was never faster than 100 microseconds.

I’m very new to graphicscard programming so it’s likely that I made a mistake somewhere along the lines or that my code isn’t optimized (Like you said, I didn’t even know about the vector feature of a gpu)

If you’re asking how long it takes to perform a GPU operation, you need to time the GPU, not the CPU. OpenGL has mechanisms for timing how long it takes for GPU commands to complete.

If you’re timing how long it takes for the CPU to tell the GPU to do some work and then have the CPU figure out when the GPU is done, that’s something glfwGetTime can kind of tell you. But in any case, there’s always overhead when submitting work to the GPU. 100 microseconds is not alarmingly large in terms of submission overhead; you could probably issue larger dispatch calls or multiple dispatch calls in sequence (without state changes between them) and probably get the same CPU timing.

1 Like

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.