Compute Shaders (Particle System)

hi, i’m trying to implement a simple particle system
last time i used transform feedback objects and double buffering, it delivers (im my judgement) very good results:
without collision detection, about 3 millions particles can be simulated witht 60 frames per second (only gravity and collision with y = 0 level enabled)
with simple line-triangle-intersection method to detect collisions between particles and some (few) triangles in the scene, i can render about 800.000 particles with 60 frames per second
(my graphics card: NVIDIA GT 640, about 3 years old)

this time i want to push the limits further by using compute shaders, i managed to build this application:
web.engr.oregonstate.edu/~mjb/cs557/Handouts/compute.shader.1pp.pdf

i changed that to only 1 particle buffer for position / velocity / color / etc, but double buffered
the rendering method looks like this:


void ParticleSystem::Render(const glm::mat4 & view, const glm::mat4 & projection, float timestep)
{
	// double buffered, switch vertex array every frame
	static unsigned int flipflop = 1;
	flipflop = !flipflop;
	
	// bind both particle buffers
	glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, m_particle_buffer[1 - flipflop].ID());		// source
	glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, m_particle_buffer[flipflop].ID());			// results
	
	// compute shader
	unsigned int program = m_program_update.ID();

	// simulate 1 frame
	glUseProgram(program);
	glDispatchCompute(m_particle_count / PARTICLES_WORK_GROUP_SIZE, 1, 1); // work group size = 128
	glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);


	// render 1 frame
	program = m_program_render.ID();

	glUseProgram(program);

	glUniformMatrix4fv(glGetUniformLocation(program, "Model"), 1, false, glm::value_ptr(glm::mat4(1)));
	glUniformMatrix4fv(glGetUniformLocation(program, "View"), 1, false, glm::value_ptr(view));
	glUniformMatrix4fv(glGetUniformLocation(program, "Projection"), 1, false, glm::value_ptr(projection));

	glBindVertexArray(m_vertexarray[flipflop].ID());
	glDrawArrays(GL_POINTS, 0, m_particle_count);
	glBindVertexArray(0);

	glUseProgram(0);
}

question 1:
i’ve read that glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); is used to syncronize and is relatively expensive, so that if i want to read back data from that buffer, i can be sure that the compute shader already finished processing the data
BUT: i use 2 buffers, the comput shader calculates data for te next frame, the current one renders the “old” frame from which the compute shader ONLY reads data
do i acually need to syncronize ?
or can i delete glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); without problems ?

compute shader source:


#version 450

layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;

layout (std140, binding = 0) buffer Source		{ vec4 DataSource[]; };			// particle buffer to read from
layout (std140, binding = 1) buffer Destination	{ vec4 DataDestination[]; };		// particle buffer to write into


const vec3 gravity = vec3( 0, -9.81, 0);
const float timestep = 0.016;


void main()
{
	// read old data
	// this is a 1-dimensional calculation because the data is a 1D array (of particles)
	uint index = gl_GlobalInvocationID.x;  // .y and .z == 1

	vec4 data0 = DataSource[3 * index + 0];
	vec4 data1 = DataSource[3 * index + 1];
	vec4 data2 = DataSource[3 * index + 2];

	vec3 position = data0.xyz;
	float lifetime = data0.w;
	vec3 velocity = data1.xyz;
	float unused = data1.w;
	vec4 color = data2;

	// calculate new data
	//vec3 accelleration = gravity;
	vec3 accelleration = vec3(0, 0, 0);

	vec3 position_new =		position + velocity * timestep;
	float lifetime_new =		lifetime - timestep;
	vec3 velocity_new =		velocity + accelleration * timestep;
	vec4 color_new =		color;

	if (position_new.x < -1) { position_new.x = -1; velocity_new.x *= -0.9; }
	if (position_new.y < -1) { position_new.y = -1; velocity_new.y *= -0.9; }
	if (position_new.z < -1) { position_new.z = -1; velocity_new.z *= -0.9; }
	if (position_new.x > +1) { position_new.x = +1; velocity_new.x *= -0.9; }
	if (position_new.y > +1) { position_new.y = +1; velocity_new.y *= -0.9; }
	if (position_new.z > +1) { position_new.z = +1; velocity_new.z *= -0.9; }

	// write new data
	DataDestination[3 * index + 0] = vec4(position_new, lifetime_new);
	DataDestination[3 * index + 1] = vec4(velocity_new, 0);
	DataDestination[3 * index + 2] = color_new;
}

question 2:
what about the ModelxViewxProjection matrix calculation in the vertex shader (for rendering the particles) ?
should i move this calculation also to the compute shader and store the results in a third buffer ? what about syncronising ?

question 3:
what about a struct Particle { … }; in the compute shader as data source / destination array, can i assume that the data is packed tightly together or do i have to bother about any offsets between struct members ??
(i would like to avoid this uglyness)


	vec4 data0 = DataSource[3 * index + 0];
	vec4 data1 = DataSource[3 * index + 1];
	vec4 data2 = DataSource[3 * index + 2];

	vec3 position = data0.xyz;
	float lifetime = data0.w;
	vec3 velocity = data1.xyz;
	float unused = data1.w;
	vec4 color = data2;

BUT: i use 2 buffers, the comput shader calculates data for te next frame, the current one renders the “old” frame from which the compute shader ONLY reads data
do i acually need to syncronize ?

Yes.

First, glMemoryBarrier defines how you intend to read the data, not how it was written. So if you intend to use it in a rendering process to provide vertex attributes, the proper barrier to use is GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT. Of course, if you also plan to read from it in next frames compute shader, you still need the storage buffer barrier too.

Second, you don’t have to synchronize right away. You only need to sync when you use it.

Unfortunately, such synchronization is not fine-grained. That is, you’re not just synchronizing the particular buffer you’re going to read from; you’re synchronizing all buffers. OpenGL doesn’t have a way to be more explicit about this.

what about the ModelxViewxProjection matrix calculation in the vertex shader (for rendering the particles) ?

That’s up to you. You’ll have to performance test it to see where it performs best.

what about a struct Particle { … }; in the compute shader as data source / destination array, can i assume that the data is packed tightly together or do i have to bother about any offsets between struct members ??

It will be packed in accord with the layout you specified in the storage block.

thanks for your answer

Second, you don’t have to synchronize right away. You only need to sync when you use it.

does that mean that i should call that function later, right before glBindVertexArray(vao) of right before glDrawArrays(…) ?
and then with both
glMemoryBarrier(GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT | GL_SHADER_STORAGE_BARRIER_BIT);
?
(the vao uses that buffer as attribute input)