Optimizing particles rendering

I’m currently trying to implement a particle system for my game.

First of all, all the relevant source code i’m going to discuss about can be downloaded here:
http://www.fl-tw.com/opengl/Particles.h http://www.fl-tw.com/opengl/Particles.cpp

The CPU processing is pretty optimized, from my profiling i found it takes very little CPU time (around 0.01 milliseconds for 1000 particles).

The rendering, however (in the source code, it’s the whole CParticleGroup::render() code), takes forever. Rendering the scene, including a 50K polys terrain with LOD, and 1000 ojects, takes 4 milliseconds. Rendering the particles takes 10 milliseconds! More than twice, just for particles.

I’m not really sure how i could improve the code, so suggestions are welcome.

Vertex shaders is out of question, as i want it to run on older hardware (Geforce 2s). I cannot use point sprites (or can i?), because my particles all share the same texture, and have different texture coordinates. One thing i’m thinking of doing is to use a triangle instead of a quad, for each particle… but that will not fundamentally solve the problem. And as you see, i’m already using a dynamic VBO to fill my vertices in. If somebody has implemented an optimized particle system, please share…


If you’re only drawing 1000 particles, then how big are they? Could it be you’re fillrate limited? Try halfing your window size and see if it has any impact on speed.

Why are you using DrawElements for particles? Maybe try to use DrawArrays or use a vbo for the indices as well. You might be seeing some sort of performance glitch related to mixing regulars vertex arrays and vbos.

Vertex shaders arn’t out of question, even GF1 has ARB_v_p, but it’s emulated on CPU through driver (with werrrrry acceptable speed even on PII-350, I was working with GF2GTS vp a lot).

The problem is, they’re not very big. It’s hard to tell, with perspective some of them are small, some are bigger, but i’d say none of them exceeds a size of 100x100 pixels (on a 1024x768 screen). Most of them are probably 30x30.

Changing the resolution doesn’t impact on the performance.

Harshman: you’re right, that’s an idea, but i’m not mixing regular vertex arrays and VBO ? All the objects, the terran and the particles in my scene are done with VBO.

Madman: sure, but then i’m gonna loose all the benefits of VBO, writing directly the data to AGP/video memory. Because if you’re doing vertex programs in software, the driver has to read back from the vertex data. That’s why i want to avoid vertex programs.


Well, the indices aren’t in an vbo are they? That’s what I meant with “regular vertex array”. Shouldn’t really matter on a GF2 but it might be a glitch in the VBO implementation.

Couldn’t the UnmapBuffer call have something to do with it… worth a try

Can you not use UNSIGNED_SHORT instead of UNSIGNED_INT in your drawelements call? That’s not causing your problem but it will be a little faster.

I would be interested to see a time for the drawelements call alone.

I can get 7 million particles a second with pointsprites so if you could use pointsprites(not sure if you can) its probably worth doing.

[This message has been edited by Adrian (edited 08-02-2003).]

Thanks for all the feedback. I do not have access to the computer causing the problem now, but i’ll test everything asap.

I’m pretty convinced it’s not a fillrate problem. I think the bottleneck definately has something to do with filling the vertices to the VBO. I’ll try the glBufferSubData way, maybe my current lock/unlock is causing the CPU to wait rendering to be finished. A double-buffer version has came to my mind.

I do not think using short indices will solve anything. Anyway, i think i’ll switch to DrawArrays, as it’s a bit stupid to use indices that will never get reused.

I’m not sure if point sprites is a good solution. I want to pack my particles into a single texture, to avoid switching textures, and i don’t think you can specify sub-texture coordinates with point sprites ?


If you’re telling the driver to copy the data out of a memory buffer, then you might want to try turning off VBO and just drawing out of your “prepartion area” instead. The extra copy might be what hurts you (if you prove it’s not fill rate).

I’m assuming your VBO is allocated with STREAMING usage, btw. You might want to try mapping it, and writing your particle outputs into the buffer directly; that might be faster than doing the copy.

I also think it’s a good idea to use DrawArrays, although with a fixed, never-changing index buffer, it might not be that big of a deal (depending on hardware).

Last, where does VTune tell you you’re spending your time?

More news.

The problem seems to be CPU dependant. I benchmarked the different operations in my particles system code, and it seems like mapping and unmapping the buffer generally takes 1 millisecond (i’ll try the double-buffer version to reduce that to 0, btw); rendering the particles, generally 0.5-1 millisecond (which means rendering is NOT the bottleneck), but filling the vertex array takes up to 10 milliseconds for a few thousand particles… doesn’t look too good. I’m not sure what i’m doing that hogs the CPU.

Jwatte: i’m already mapping/unmapping from a VBO. I use the GL_DYNAMIC_DRAW_ARB creation flag. I do not keep a system-memory copy of the vertex array. I just have an array of particles, which are then converted to 4 vertices / particle, appended to the VBO.


A pretty obvious question, but i hope you are filling the vertex array sequentially (to make full use of the CPU caching)?

[This message has been edited by Stebet (edited 08-03-2003).]

You’re right, i wasn’t filling it sequentially. But i fixed it in a previous version, and although it runs a bit faster, it essentially didn’t fix the problem. It’s maybe 30-40% faster (not bad, but not enough).


Do you use this code on the VBO?
Note that read-modify-write ops technically require readback over the AGP, which is very bad.
I’ve marked them in red.

I also don’t think that keeping the map established until just before rendering is a good idea, but I’m not sure (never tried it, seemed natural).

I’m not quite sure how the code works (where’s your map?), so feel free to explain

///		Updates the particles in the group
TVoid CParticleGroup::update(const TFloat a_elapsed)
	SVec3D f(0, 0, 0);
	if (m_gravity)
		f.y = -4;

	for (TUInt i = 0; i < m_alive->getSize(); i++)
		TUInt id = m_alive->get(i);
		SParticle *p = m_particles->getByAdr(id);

		/// update the particle..
		p->m_time += a_elapsed;
		if (p->m_time >= p->m_life)
			/// particle dies..

		/// particle is alive and rocking..
[i]		p->m_pos += p->m_vel * a_elapsed;
		p->m_vel += f * a_elapsed;
		p->m_size += p->m_dsize * a_elapsed;
		p->m_size = MMax(p->m_size, 0);[/i]
		for (TUInt k = 0; k < 4; k++)
[i]			p->m_color[k] += p->m_dcolor[k] * a_elapsed;
			p->m_color[k] = MMax(p->m_color[k], 0);
			p->m_color[k] = MMin(p->m_color[k], 1);[/i]

edit: Oh my, I’ve never noticed that this forum lacks the color tag

[This message has been edited by zeckensack (edited 08-04-2003).]

What are the CArray::get() and CArray::getByAdr() up to?
If they are doing something non-trivial it might hit because you do it for every particle…


[This message has been edited by Mezz (edited 08-04-2003).]

Read-modify-write is a sure recipe for bad performance, when you’re using streaming memory.

When writing to AGP memory (which dynamic VBO ends up using 99 times out of 100) you should write sequentially, and you should write EVERY BYTE – if there’s padding or un-changing data in between, you should still re-write those bytes.

And you should never read from it.