Optimizing particles rendering

Ysaneya · August 1, 2003, 3:51am

I’m currently trying to implement a particle system for my game.

First of all, all the relevant source code i’m going to discuss about can be downloaded here:
http://www.fl-tw.com/opengl/Particles.h http://www.fl-tw.com/opengl/Particles.cpp

The CPU processing is pretty optimized, from my profiling i found it takes very little CPU time (around 0.01 milliseconds for 1000 particles).

The rendering, however (in the source code, it’s the whole CParticleGroup::render() code), takes forever. Rendering the scene, including a 50K polys terrain with LOD, and 1000 ojects, takes 4 milliseconds. Rendering the particles takes 10 milliseconds! More than twice, just for particles.

I’m not really sure how i could improve the code, so suggestions are welcome.

Vertex shaders is out of question, as i want it to run on older hardware (Geforce 2s). I cannot use point sprites (or can i?), because my particles all share the same texture, and have different texture coordinates. One thing i’m thinking of doing is to use a triangle instead of a quad, for each particle… but that will not fundamentally solve the problem. And as you see, i’m already using a dynamic VBO to fill my vertices in. If somebody has implemented an optimized particle system, please share…

Y.

Nutty · August 1, 2003, 3:59am

If you’re only drawing 1000 particles, then how big are they? Could it be you’re fillrate limited? Try halfing your window size and see if it has any impact on speed.

harsman · August 1, 2003, 4:11am

Why are you using DrawElements for particles? Maybe try to use DrawArrays or use a vbo for the indices as well. You might be seeing some sort of performance glitch related to mixing regulars vertex arrays and vbos.

M_dm_n · August 1, 2003, 4:11am

Vertex shaders arn’t out of question, even GF1 has ARB_v_p, but it’s emulated on CPU through driver (with werrrrry acceptable speed even on PII-350, I was working with GF2GTS vp a lot).
http://www.delphi3d.net/hardware/extsupport.php?extension=GL_ARB_vertex_program

Ysaneya · August 1, 2003, 5:11am

The problem is, they’re not very big. It’s hard to tell, with perspective some of them are small, some are bigger, but i’d say none of them exceeds a size of 100x100 pixels (on a 1024x768 screen). Most of them are probably 30x30.

Changing the resolution doesn’t impact on the performance.

Harshman: you’re right, that’s an idea, but i’m not mixing regular vertex arrays and VBO ? All the objects, the terran and the particles in my scene are done with VBO.

Madman: sure, but then i’m gonna loose all the benefits of VBO, writing directly the data to AGP/video memory. Because if you’re doing vertex programs in software, the driver has to read back from the vertex data. That’s why i want to avoid vertex programs.

Y.

harsman · August 2, 2003, 5:58am

Well, the indices aren’t in an vbo are they? That’s what I meant with “regular vertex array”. Shouldn’t really matter on a GF2 but it might be a glitch in the VBO implementation.

rwilco · August 2, 2003, 6:46am

Couldn’t the UnmapBuffer call have something to do with it… worth a try

imported_Adrian1 · August 2, 2003, 7:17am

Can you not use UNSIGNED_SHORT instead of UNSIGNED_INT in your drawelements call? That’s not causing your problem but it will be a little faster.

I would be interested to see a time for the drawelements call alone.

I can get 7 million particles a second with pointsprites so if you could use pointsprites(not sure if you can) its probably worth doing.

[This message has been edited by Adrian (edited 08-02-2003).]

Ysaneya · August 2, 2003, 8:03am

Thanks for all the feedback. I do not have access to the computer causing the problem now, but i’ll test everything asap.

I’m pretty convinced it’s not a fillrate problem. I think the bottleneck definately has something to do with filling the vertices to the VBO. I’ll try the glBufferSubData way, maybe my current lock/unlock is causing the CPU to wait rendering to be finished. A double-buffer version has came to my mind.

I do not think using short indices will solve anything. Anyway, i think i’ll switch to DrawArrays, as it’s a bit stupid to use indices that will never get reused.

I’m not sure if point sprites is a good solution. I want to pack my particles into a single texture, to avoid switching textures, and i don’t think you can specify sub-texture coordinates with point sprites ?

Y.

imported_jwatte · August 2, 2003, 10:29pm

If you’re telling the driver to copy the data out of a memory buffer, then you might want to try turning off VBO and just drawing out of your “prepartion area” instead. The extra copy might be what hurts you (if you prove it’s not fill rate).

I’m assuming your VBO is allocated with STREAMING usage, btw. You might want to try mapping it, and writing your particle outputs into the buffer directly; that might be faster than doing the copy.

I also think it’s a good idea to use DrawArrays, although with a fixed, never-changing index buffer, it might not be that big of a deal (depending on hardware).

Last, where does VTune tell you you’re spending your time?

Ysaneya · August 3, 2003, 6:31am

More news.

The problem seems to be CPU dependant. I benchmarked the different operations in my particles system code, and it seems like mapping and unmapping the buffer generally takes 1 millisecond (i’ll try the double-buffer version to reduce that to 0, btw); rendering the particles, generally 0.5-1 millisecond (which means rendering is NOT the bottleneck), but filling the vertex array takes up to 10 milliseconds for a few thousand particles… doesn’t look too good. I’m not sure what i’m doing that hogs the CPU.

Jwatte: i’m already mapping/unmapping from a VBO. I use the GL_DYNAMIC_DRAW_ARB creation flag. I do not keep a system-memory copy of the vertex array. I just have an array of particles, which are then converted to 4 vertices / particle, appended to the VBO.

Y.

Stebet · August 3, 2003, 4:20pm

A pretty obvious question, but i hope you are filling the vertex array sequentially (to make full use of the CPU caching)?

[This message has been edited by Stebet (edited 08-03-2003).]

Ysaneya · August 3, 2003, 9:20pm

You’re right, i wasn’t filling it sequentially. But i fixed it in a previous version, and although it runs a bit faster, it essentially didn’t fix the problem. It’s maybe 30-40% faster (not bad, but not enough).

Y.

zeckensack · August 3, 2003, 10:10pm

Do you use this code on the VBO?
Note that read-modify-write ops technically require readback over the AGP, which is very bad.
I’ve marked them in red.

I also don’t think that keeping the map established until just before rendering is a good idea, but I’m not sure (never tried it, seemed natural).

I’m not quite sure how the code works (where’s your map?), so feel free to explain

///
///		Updates the particles in the group
///
TVoid CParticleGroup::update(const TFloat a_elapsed)
{
	SVec3D f(0, 0, 0);
	if (m_gravity)
		f.y = -4;

	for (TUInt i = 0; i < m_alive->getSize(); i++)
	{
		TUInt id = m_alive->get(i);
		SParticle *p = m_particles->getByAdr(id);

		/// update the particle..
		p->m_time += a_elapsed;
		if (p->m_time >= p->m_life)
		{
			/// particle dies..
			remove(i);
			i--;
			continue;
		}

		/// particle is alive and rocking..
[i]		p->m_pos += p->m_vel * a_elapsed;
		p->m_vel += f * a_elapsed;
		p->m_size += p->m_dsize * a_elapsed;
		p->m_size = MMax(p->m_size, 0);[/i]
		for (TUInt k = 0; k < 4; k++)
		{
[i]			p->m_color[k] += p->m_dcolor[k] * a_elapsed;
			p->m_color[k] = MMax(p->m_color[k], 0);
			p->m_color[k] = MMin(p->m_color[k], 1);[/i]
		}
	}
}

edit: Oh my, I’ve never noticed that this forum lacks the color tag

[This message has been edited by zeckensack (edited 08-04-2003).]

Mezz · August 4, 2003, 12:21am

What are the CArray::get() and CArray::getByAdr() up to?
If they are doing something non-trivial it might hit because you do it for every particle…

-Mezz

[This message has been edited by Mezz (edited 08-04-2003).]

imported_jwatte · August 4, 2003, 5:09am

Read-modify-write is a sure recipe for bad performance, when you’re using streaming memory.

When writing to AGP memory (which dynamic VBO ends up using 99 times out of 100) you should write sequentially, and you should write EVERY BYTE – if there’s padding or un-changing data in between, you should still re-write those bytes.

And you should never read from it.