Compute shader slower than same "non-compute" operation

I’m building a simple 2D particle “system”, I’ve recently started moving the transformations and trajectory calculations to the compute shader. For some reason it is slower than doing it on directly on the CPU. I’m new to graphics programming so any help is welcome.

Compute shader code.

layout(local_size_x = 256) in;

struct Particle
{
    mat4 Transform;
    vec2 Trajectory;
};

layout(std430, binding = 0) readonly buffer InParticlesBuffer
{
    Particle InParticles[];
};

layout(std430, binding = 1) writeonly buffer OutTransformsBuffer
{
    mat4 OutParticleTransforms[];
};


uniform mat4 ParticleEmmiterTransform;

uniform uint WindowWidth;
uniform uint WindowHeight;

uniform float ParticleScaleFactor;


vec2 CartesianToNDC(in vec2 cartesianPosition)
{
    return vec2(((2.0f * cartesianPosition.x) / WindowWidth), 
                ((2.0f * cartesianPosition.y) / WindowHeight));
};

// Matrix translation converted from glm::translate to GLSL
mat4 Translate(in mat4 inputMatrix, in vec3 translationVector)
{
    mat4 result = mat4(inputMatrix);

	result[3] = inputMatrix[0] * translationVector[0] + inputMatrix[1] * translationVector[1] + inputMatrix[2] * translationVector[2] + inputMatrix[3];

	return result;
};

void main()
{
    const Particle particle = InParticles[gl_GlobalInvocationID.x];

    const vec2 ndcPosition = CartesianToNDC(particle.Trajectory) / ParticleScaleFactor;
    
    const mat4 screenTransfrom = (Translate(ParticleEmmiterTransform, vec3(ndcPosition.x, ndcPosition.y, 0.0f))) * particle.Transform;

    OutParticleTransforms[gl_GlobalInvocationID.x] = screenTransfrom;
};

GLSL Particle struct definition in C++


struct alignas(16) ComputeShaderParticle
{
    glm::mat4 Transform;
    glm::vec2 Trajectory;
};

How I create the SSBOs. Compute shader input buffer takes “GL_DYNAMIC_COPY”, and output buffer takes “GL_STATIC_DRAW”

glGenBuffers(1, &_bufferId);

glBindBuffer(GL_SHADER_STORAGE_BUFFER, _bufferId);

glBufferData(GL_SHADER_STORAGE_BUFFER, bufferSizeInBytes, bufferData, usageType);

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, bindIndex, _bufferId);

How I retrieve and upload the data to the Compute Shader.
(GL calls are abstracted away in actual code, but they still follow the same principal).
When I call glMapBuffer (or glSUbBuffer and similar functions) OpenGL outputs a warning stating that moving an SSBO from video memory to RAM may result in a performance penalty

_inputBuffer.get().Bind();
ComputeShaderParticle* inputBuffer = static_cast<ComputeShaderParticle*>(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_WRITE_ONLY));

for(std::size_t i = 0; i < _numberOfParticles; i++)
{
    inputBuffer[i].Trajectory = _particles[i].Trajectory;
    inputBuffer[i].Transform = _particles[i].Transform;
};

glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);

glDispatchCompute((_numberOfParticles / 256) + 1, 1, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

_outputBuffer.get().Bind();
glm::mat4* screenTransformsBuffer = static_cast<glm::mat4*>(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));

I think you’ll need to consider what you want to measure here. The snippet you posted:

  • maps a buffer and fills it with data,
  • immediately dispatches a compute shader that operates on that buffer (meaning copying the data to GPU memory must be complete before the GPU can process the dispatch)
  • maps the buffer again for read access by the CPU, so dispatch has to fully complete and results must be copied back to CPU accessible memory. At that point the GL driver decides it would be better to have the buffer in GPU accessible main memory (if your hardware supports that) and issues a warning that now GPU access will be slower.

I would not be surprised if the majority of the time is spent transferring your data around. In general the biggest wins of moving computations to the GPU can be had if all consumers of the data are on the GPU as well, so that you don’t have to transfer the data back to CPU memory. Ideally those consumers run at a later point in the frame (or on the next frame) so that other work can overlap with the tail end of the computation (where not all compute units are fully loaded) - if you have to copy back to CPU memory this becomes even more important to avoid the stall where the CPU has to wait for the computation to finish.

Thanks for your reply, I’ll try and keep what you said in mind. However, I don’t think I can easily move my “consumers” onto the GPU, because after the computation is complete I need to check collisions/bounds, which obviously implies conditional checks, which to my understanding is something to be avoided when writing shaders. I don’t really understand what you mean by “Ideally those consumers run at a later point in the frame”, my “game-loop” right now goes bind-update-draw, can you please elaborate?

Performance is a holistic exercise. It’s not about making each part as fast as theoretically possible; it’s about making the operation as fast as possible. This will often involve tradeoffs.

If the GPU can do collision and bounds checking, then it should. You may want to optimize it so that operations that execute in the same workgroup are more likely to hit the same collisions, but putting everything on the GPU has inherit advantages. Namely, it’s asynchronous.

This is especially true for something like a particle system, which typically the CPU doesn’t need to know about. If you’re talking about an interactive application, it may need to add new particles or possibly remove some (though the GPU can probably handle that itself). But outside of injecting new particles into the mix, the CPU doesn’t really care.

Removing the CPU->GPU->CPU synchronization and memory transfer is worth a lot in terms of overall performance.

Furthermore, even if the GPU executes it a bit more slowly than it could have… that doesn’t mean it won’t still be faster than the CPU would have been at the same job. This will depend on the GPU itself, but they tend to have a lot of memory bandwidth and the like.

And lastly, if the GPU’s doing it, that means the CPU isn’t. So the CPU is now free do to other things. Even if the GPU was a bit slower overall, that still can be a win.

Basically, don’t sell an idea short just because it doesn’t map perfectly into GPU shader wavefronts. There are lots of benefits to getting more stuff running on the GPU.

Well, as I said the snippet you posted immediately after issuing the dispatch accesses the results on the CPU, that means your CPU must block at that point waiting for the GPU to compute the results. And while it is blocked it cannot issue other work for the GPU to do, so you’ll end up with what is sometimes called a “pipeline bubble”, a spot in the GPU timeline where all preceding work is done and no new work has been issued (or at least can not be processed yet) and so the GPU is sitting idle.
If instead you issue the dispatch at some point early in your frame, then submit some other work (depends on your app, maybe there is none in your case) that does not depend on the compute results and only then access the results on the CPU you could keep the GPU busy for more parts of your frame.
Sometimes it is possible to use compute results from a previous frame instead of those submitted this frame and then you e.g. use two buffers A and B to submit compute work on A in frame N, use results in buffer B (computed in frame N-1) in frame N and so on. Basically increase the latency of your results to increase throughput and avoid stalls.

So let me get this straight, you’re suggesting something like “chunking”? As in instead of computing the entire set of transforms and drawing, I should maybe compute half, or a quarter of the set and in the next frame use the results from earlier? If so, that’s an interesting idea. I’ll try it tomorrow

You’ve got some great suggestions so far. I’ll just add that you need to profile your existing technique and determine what the bottleneck is. Don’t make assumptions. Determine what the primary limiting factor is. Then you’ll know that you’re not wasting your time trying various optimizations which will reduce or eliminate the cost of that limiting factor.

In development, there’s little that’s more frustration than “optimizing” your code, only to find that the new code is no faster, or even slower, than before. Profiling (and learning how to profile well) is how you avoid that feeling and the otherwise needless waste of your time.

This. :+1:

This is legacy “conventional wisdom” which is often wrong nowadays. It’s far from that simple.
Some light reading for you. Search for branch and conditional:

1 Like

I have just now tried your suggestion. ATM, in the simplest way possible, a single Boolean that indicates if to dispatch, or retrieve dispatch results. The performance gain quite was unexpected, 500+ FPS (Compared to the code I originally showed). However now I’m faced with another problem which leans more into OpenGL architecture. Maybe I should’ve mentioned it before, however my particle system has the idea of an “emitter” of particles. The problem with your solution is that it only works when there’s a single emitter. Currently every emitter contains a reference to a single VBO(s), and SSBO(s). Should I create a new VBOs and SSBOs for every single emitter?

As usual for performance questions the answer is: it depends :wink:
If it is easy/convenient to implement your emitters this way and you don’t expect to have many hundreds/thousands of them at once, I would say go with what is efficient in terms of your developer time. If you later measure your performance and it needs to be improved and you’ve determined your particle system/emitters to be the bottleneck then you can optimize them - see also what @Alfonse_Reinheart and @Dark_Photon said about performance optimization above.

1 Like

I would split the processing of emitters into its own compute operation. That is “emit particles” is a separate phase of particle processing relative to “move particles”. The compute operation creating new particles can add more particles to the particle system (it may even be reasonable to have it remove old ones). Through indirect dispatch calls, you can even adjust the number of particles that get processed without the CPU ever knowing how many workgroups are in the system.

ATM I have about 130 emitters with 250 particles on each one, totalling 32,500 particles. I do need(want) to be able render hundreds of emitters, so is creating an SSBO for every single emitter not a good idea, what should I do then? I have looked at @Alfonse_Reinheart, and @Dark_Photon answers, but there’s something that I still don’t understand, how can I move my particles to the GPU? How do I handle trajectory, collisions, and others, on the GPU, that is how do I render the particles without having to call glMapBuffer?

When you’re talking about phases do you mean something similar to my “bind-update-draw” loop? Should I add a function that “uploads” the particles and dispatches, another which “downloads” and moves the particle? (No offense) that kinda looks like my idea with extra steps. However, I never used indirect anything, so I will read up and give it a try.

By writing code. I’m not sure I understand. All of that “trajectory, collisions, and others” is just code.

Write it in shader code instead of CPU code.

This is not easy to do, mind you. But that’s what you do. Any details beyond this would require intimate knowledge of exactly what you’re doing.

With indirect dispatching and rendering. The GPU determines how many particles need to be processed, so it writes how many work-groups need to be processed. The GPU then determines how many of what kind of rendering operations to perform, so it writes that to memory used for indirect rendering.

All the CPU does is say “compute particle stuff” and “render particle stuff.” It may manage the list of emitters, but that’s a one-way street: at no point does it need to read anything about them.

No, I’m saying you have a compute operation which generates (or removes?) particles to be processed based on a set of emitters. Then you have a compute operation which processes those particles and generates rendering commands for them. Then you have a rendering operation that renders with those particles.

Basically, break your system down into all the steps that need to be done… then put those steps on the GPU until the CPU no longer has to synchronize with it.

I understand what you’re saying. It’s exactly the last sentence(Or paragraph) which I’m having trouble comprehending. How do I send the resulting particles from the Compute Shader directly to draw? Is it related to indirect dispatch?

So let’s say your particle system is broken down into the following steps:

  1. Generate particles from emitters.
  2. Process particles into triangles (this may involve collision detection or whatever).
  3. Draw triangles.

The CPU knows that each of these processes is going to happen. The CPU is what causes these processes to happen. But the CPU doesn’t need to know the details of these processes.

The CPU manages the list of emitters using CPU data. It passes GPU-appropriate copies of the current list of emitters to GPU-accessible memory. The CPU then starts a CS operation to generate particles to the particle system from that list of emitters.

This communication is entirely one-way. The GPU is not going to modify these emitters.

So the “generate particles” operation generates new particles to be added to the existing list of particles. This means that the GPU has:

  1. An existing list of particles along with their state.
  2. The number of elements within this list to be processed.

So, when the “generate particles” operation adds a particle, it increments the count of particles in the array appropriately.

To perform the “generate triangles from particles” operation, you need a dispatch call to dispatch enough work groups to process all of the particles in the list. That is, the work group count needs to be based on the number of particles.

But the number of particles is on the GPU. So… just use that. Or more likely, when you atomically bump the number of particles, you also atomically adjust the work group count appropriately. For example, if the work group size is 32, then you increment the group count every time you add a 32nd particle to the list.

The work group count is used with an indirect dispatch operation, thus allowing the GPU to read the work group count directly from the buffer.

The same thing happens with triangle generation. Each particle contributes some number of triangles. Each time it writes a vertex to the output buffer, it atomically increments a vertex count accordingly. This vertex count would then be part of the indirect rendering call, providing the number of vertices in the rendering command.

So all the CPU does is:

  1. A regular dispatch with a work group count based on the number of emitters provided to the GPU.
  2. An indirect dispatch which reads from the work group count provided by the previous operation.
  3. An indirect rendering command which reads the vertex count provided by the previous operation.

And of course, you need appropriate glMemoryBarriers between each operation. But at no time does the CPU need to read anything written by the GPU.

Thank you very much for this explanation, it feels somewhat advanced, relative to my understanding of OpenGL. I will give indirect drawing and dispatch a try tomorrow (I’m not sure where the triangles came from)

Well, you’re trying to render the particles eventually, right? That means you need to go from particle data to vertex data. Probably with some viewport culling.

I’m assuming the particles are being rendered as GL_POINTS.

The principle is the same regardless of primitive type. The compute shader calculates vertex positions and the number of vertices. Both are written to buffers. A call to glDrawArraysIndirect with a buffer bound to GL_DRAW_INDIRECT_BUFFER is equivalent to a call to glDrawArraysInstancedBaseInstance with the parameters taken from the buffer. This means that the data never needs to leave the GPU, which avoids synchronisation. Similarly, glDispatchComputeIndirect can take the workgroup counts from a buffer, so you can chain compute operations without having to send data through the CPU.

It isn’t always feasible to do everything on the GPU. E.g. if collisions involving particles trigger events which affect gameplay logic, those events will need to get sent to the CPU. The main principle in such cases is to avoid reading a result immediately after issuing the commands which calculate it. Try to wait at least one complete frame in between (i.e. read the result from frame n just before issuing the commands for frame n+1). There may be an advantage to waiting longer, but that usually requires multiple sets of buffers.