How to speed up the simulation? Compute shaders or CUDA?

I am working on a 3D simulation of photons hitting an object with a spherical shape. Here is the video:

link to video on YT

The problem is that the spacing between the flying photons is too large, and when I increase the number of them even more, the animation starts to stutter. Everything runs from the CPU. Can I somehow improve performance? Maybe by using compute shaders or CUDA?

In my rendering loop I’m calling the “render” method:

while (!glfwWindowShouldClose(window))

Simulation::render() looks like this:

void Simulation::render()
    for (size_t i = 0; i < number_of_particles; ++i)

    for (size_t i = 0; i < number_of_photons; ++i)

The number of photons in the photons[] array is in the order of several thousand. Is it possible to use parallel computing on the graphics card instead of executing this loop sequentially on the processor?

Here is the “draw” method:

void Object_sphere::draw(SpriteRenderer_sphere& renderer)
    renderer.drawSprite(position, size, rotation, color);

And here is the “drawSprite” method:

void SpriteRenderer_sphere::drawSprite(glm::vec3 position, glm::vec3 size, float rotate, glm::vec3 color)

glm::mat4 projection = glm::perspective(glm::radians(camera.Zoom), static_cast<float>(SCR_WIDTH) / static_cast<float>(SCR_HEIGHT), 0.1f, 100.0f);

glm::mat4 view = camera.GetViewMatrix();
shader.setMat4("view", view);
shader.setMat4("projection", projection);

glm::mat4 model = glm::mat4(1.0f);
model = glm::translate(model, position);

model = glm::translate(model, glm::vec3(0.5f * size.x, 0.5f * size.y, 0.5f * size.z)); // move origin of rotation to center of quad
model = glm::rotate(model, glm::radians(rotate), glm::vec3(0.0f, 0.0f, 1.0f)); // then rotate
model = glm::translate(model, glm::vec3(-0.5f * size.x, -0.5f * size.y, -0.5f * size.z)); // move origin back

model = glm::scale(model, glm::vec3(size));
shader.setMat4("model", model);

shader.setVec3f("spriteColor", color);

glm::mat3 normalMatrix = transpose(inverse(view * model));
shader.setMat3("normalMatrix", normalMatrix);


glDrawElements(GL_TRIANGLES, indices.size(), GL_UNSIGNED_INT, 0);

There is no interaction between photons.
Is it possible to run this code on the GPU?

Yes. Probably the simplest way to do this is to use instancing (requires OpenGL 3.3 or the ARB_instanced_arrays extension). Replace the uniforms (other than the view/projection matrices) with instanced attributes.

An alternative is to store all of the data in uniform arrays (or SSBOs, or textures) which are indexed using values derived from gl_VertexID (gl_VertexID/4 for the sprite index, gl_VertexID%4 for the index of the vertex within the sprite).

Either way, you want to draw all of the sprites with a single draw call. Or at least thousands of sprites per call, not one per call.

Also: you don’t need a separate normal matrix if the model-view matrix contains only translation, rotation and uniform scale…

1 Like

In case it comes down to it, you could probably slide through with the instancing support in OpenGL 3.1 or the ARB_draw_instanced extension. However, you’d need to pull the per-instance data into the shader from a texture or buffer object indexed by gl_InstanceID instead of having OpenGL “push” it into the shader for you using a special vertex attribute array containing per-instance attributes (i.e. one with divisor > 0).

1 Like

Thank you. Instancing seems to be the best solution.

So, my plan is to use vertex attribute arrays, which will contain the model transformation matrices of all photons. The photons are moving, hence the matrices will have to be computed in each iteration and will need to be sent to the vertex shader in each iteration.
Is there a better solution? I don’t know if this will be the optimal approach

You are right. I do not have non-uniform scale. Thank you for pointing this out. I have removed the normal matrix.

Well, you can send the translation, scale and rotation as attributes and construct the matrix in the vertex shader. That would only require 5 floats (2 or 3 attributes; each is internally a vec4) rather than 16 (a matrix attribute is treated as 4 consecutive vec4 attributes).

You can probably run the simulation on the GPU as well, but the biggest performance gain will come from not using a separate draw call per sprite.

One reason why “fake instancing” (storing the data in uniform arrays / SSBOs / textures rather than using instanced attributes) might be faster is that AFAIK implementations typically don’t batch multiple instances into a workgroup. So if the GPU uses 32 threads per workgroup but your instances only have 4 vertices, it will only use 4/32 threads with the other 28/32 idle. Instancing was designed around instances being relatively complex meshes. The fragment shader will still use as many threads as possible, but if these are small sprites then the vertex shader will use a significant portion of the GPU’s capacity.

OTOH, attribute values are automatically fetched prior to invocation of the shader, whereas array/texture lookups are dependent upon the shader executing them, which adds latency.

If the sprites are very small, you might be better off with point sprites (i.e. GL_POINTS) than with quads. The main issue with point sprites is that it’s entirely up to the implementation what sizes it supports.

1 Like

Thanks for your valuable tips. So far, I have managed to use instancing. Although the gaps between photons did not decrease, it became possible to render up to a million photons without reducing the FPS.
Due to this, I have significantly reduced the velocity of the photons, so that the gaps between them have decreased. For now, this solution is enough for me. If I need more performance then I will follow your tips.

The sprites are spheres generated by the following method:

void SpriteRenderer_sphere_instance::initRenderData()
float xSegment, ySegment; //we consider segments step by step 
float xPos, yPos, zPos;
float phi, theta;

for (size_t y = 0; y <= ySegments; ++y)
	ySegment = static_cast<float>(y) / ySegments;
	phi = ySegment * M_PI;

	for (size_t x = 0; x <= xSegments; ++x)
		xSegment = static_cast<float>(x) / xSegments;
		theta = xSegment * (2 * M_PI);

		xPos = cos(theta) * sin(phi);
		yPos = cos(phi);
		zPos = sin(theta) * sin(phi);


// indices
for (int y = 0; y < ySegments; ++y)
	for (int x = 0; x < xSegments; ++x)
		indices.push_back((y + 1) * (xSegments + 1) + x);
		indices.push_back(y * (xSegments + 1) + x);
		indices.push_back(y * (xSegments + 1) + x + 1);

		indices.push_back((y + 1) * (xSegments + 1) + x);
		indices.push_back(y * (xSegments + 1) + x + 1);
		indices.push_back((y + 1) * (xSegments + 1) + x + 1);

glGenVertexArrays(1, &VAO);

unsigned int VBO;
glGenBuffers(1, &VBO);
glBufferData(GL_ARRAY_BUFFER, sizeof(float) * positions.size(),, GL_STATIC_DRAW);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, (void*)0);
glEnableVertexAttribArray(0); //0 = location
glBindBuffer(GL_ARRAY_BUFFER, 0);

unsigned int EBO;
glGenBuffers(1, &EBO);
glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(unsigned int) * indices.size(),, GL_STATIC_DRAW);

glGenBuffers(1, &bufferMatrices);
glBindBuffer(GL_ARRAY_BUFFER, bufferMatrices);
glBufferData(GL_ARRAY_BUFFER, number_of_instances * sizeof(glm::mat4), &modelMatrices[0], GL_STATIC_DRAW);

std::size_t vec4Size = sizeof(glm::vec4);
glVertexAttribPointer(1, 4, GL_FLOAT, GL_FALSE, 4 * vec4Size, (void*)0);
glVertexAttribPointer(2, 4, GL_FLOAT, GL_FALSE, 4 * vec4Size, (void*)(1 * vec4Size));
glVertexAttribPointer(3, 4, GL_FLOAT, GL_FALSE, 4 * vec4Size, (void*)(2 * vec4Size));
glVertexAttribPointer(4, 4, GL_FLOAT, GL_FALSE, 4 * vec4Size, (void*)(3 * vec4Size));

glVertexAttribDivisor(1, 1);
glVertexAttribDivisor(2, 1);
glVertexAttribDivisor(3, 1);
glVertexAttribDivisor(4, 1);


I thought about replacing the spheres with points (because the photons are small), but I don’t know if it is possible - I need to be able to set the radius of the photon so that I can later create interactions between the photons and the object they hit. I don’t know if it is possible to draw points of various sizes.

Sure. You can do this in the vertex shader by writing gl_PointSize (see the GLSL 4.60 Spec).

Here’s one tutorial that mentions this:

1 Like