100.000 polys w/lighting @ 30 fps - how?

karx11erx · May 27, 2008, 4:23am

How in all the earth does one render 100,000 polys w/ lighting @ 30 fps on a system with a decent CPU and let’s say Radeon X1900 XT gfx card?

Dark_Photon · May 27, 2008, 4:57am

I assume you live someplace where the 10^3 seperator is . instead of , and mean “100,000 polys”.

You’ll generate a lot more response if you just post a short snippet of what your batch submission code looks like, and let folks rip it to shreds

But even before that, have you run a CPU or GPU profiler to determine where you are bound? Are you doing basic things such as culling? What do your timings suggest is the main bottleneck?

For starters, eliminate all state changes and get your batch submission code as streamlined as possible. Then add them back and see how you need to regroup/reorder to keep performance up.

karx11erx · May 27, 2008, 5:08am

Oops, yes, meant one hundred thousand polys.

I only have Visual C++ Express 2008 which comes w/o built-in profiler, and my AQtime profiler doesn’t connect to it. I will add some profiling code myself for a start.

I have eliminated state changes as much as I could.

I am doing face culling ahead of processing faces (omitting that decreases frame rates).

My face rendering code has over 2000 lines of code.

I could give you the entire project plus game data plus setup guide and you could take a look … j/k.

Currently with a test case I am having about 200 fps for 909 faces, 8 state changes on a sytem w/ Athlon 64 3500+ and Radeon X850 XT w/o lighting (everything just bright). That’s pretty poor I think, but I have no idea what to improve here.

I am using vertex arrays, but no VBOs because I can have a lot of dynamic face color changes and would need to update the color buffer every frame anyway.

knackered · May 27, 2008, 7:11am

That would only require a card capable of 3 million triangles per second. That is frankly nothing. I imagine that even with immediate mode you could well exceed that.
To put it in perspective, my card has a throughput of 300 million triangles per second, and it’s a year old. I regularly get more than 300 million out of it.
You should go back to basics, read some basic documentation, and post on the beginners forum.

karx11erx · May 27, 2008, 7:46am

Sorry, you won’t like that, but that is of entirely no help to me.

Read some basics then. Hm, calling glColor and glDrawArrays? Stuff like that?

I’d need at least some key words, some ideas. I know about VBOs and am using them for object rendering, I know about client arrays, etc. I just don’t know how to put all that together so that I get decent frames rates with that decade old game I am trying to hack up.

But maybe you want to teach John Carmack something, too. As far as I know, none of his engines pushes 300M triangles/sec with full lighting and fx. Obviously that has nothing to do with raw (theoretical) triangle throughput of some gfx hardware.

speedy · May 27, 2008, 8:28am

Did you try to render it without per-frame dynamic updates? Also you could try using VBOs set to streaming mode?

Do post important parts of your batch submission code, as Dark Photon suggested.

karx11erx · May 27, 2008, 8:47am

The problem with VBOs is that there are a lot of faces with variable color/alpha values, and I’d need to update the color buffer for these each time I render them.

If you have links to some documents with good outlines about efficient OpenGL rendering, I’d love to read them.

I really see no way to post “relevant” parts of my code here. It’s just too much.

There’s code involved doing some (software) culling before faces are further processed. Turning it off slows the renderer down.

There’s code changing the alpha values of certain special faces.

There’s code (efficiently) sorting the faces by texture to reduce texture state changes later on (I have profiled that code, it is very fast and negligable compared to overall frame rendering time).

There’s code detecting required state (texture) changes and performing these.

There’s code buffering faces until a state change is required and that renders the entire batch of buffered faces at once using glDrawElements before the state change happens.

That’s just the basic stuff.

There’s also code dispatching transparent faces to another buffer for subsequent transparency rendering.

There’s code activating or deactivating (simple) shader programs handling things like color key transparency or monochrome rendering.

I could post all this here, but I think even snippets would be too much.

knackered · May 27, 2008, 8:51am

karx, are you saying you’re hacking around with the quake3 source code? As far as I remember, that engine was designed around poor/absent hardware T&L and fill rate. He used bsp’s for rendering, which meant lots of CPU work (half space tests), lots of batches being fully submitted every frame, desperately trying to eliminate overdraw. This is the antithesis of todays approaches. Carmack would not be writing a renderer like that on todays hardware - and he most certainly will be getting 300Mtps on todays hardware with most advanced features turned off.
Your problem is fundamental - you’re using yesterdays algorithms on todays hardware. Just go to the nvidia or ATI developer sites and RTFM. This information is literally pushed in your face with the most simple searches - saying you need keywords is frankly bollocks.

karx11erx · May 27, 2008, 8:55am

Thanks for the kind words. I tried to avoid these, but your replies are just bollox for me as well.

If you feel offended by my stupid questions, why the heck don’t you just stay away and have a good day somewhere else instead of trying to ruin mine? Nobody forces you to deal with stuff you don’t like here, and if you have a generally negative attitude towards noobs asking the same old questions every day of every year, I have a bad and a good news for you: The bad one is that this will never change, and the good one is that you can avoid them.

I am not playing around with Quake 3. I am coding around in Descent 2 (D2X-XL -> http://www.descent2.de ). FYI: Compared to that engine, Q3 is brand spankin’ new. I would love to see someone more skilled in OpenGL coding than me do it, but there is no one willing to.

CatDog · May 27, 2008, 9:21am

Concerning VBOs: make shure, your code follows these guidelines (at least).

CatDog

skynet · May 27, 2008, 9:29am

karx11erx, you seem to do a lot on per-face basis. Try to minimze that. Today, it doesn’t matter if you render 1000 or 900 faces. Don’t do culling, state-soring or anything at per-face level.

Also, try GLIntercept. Recently it helped me to find a stupid bug that in some cases almost halfed my render-performance. Maybe you just do a stupidly amount of glLoadMatrix() calls, or just like in my case a stupidly high amount of glPushAttrib()/glPopAttrib() calls. You can make the log of one frame public, so we can take a look at it and maybe pinpoint whats going wrong.

Also, for static geometry, use static VBOs! Even for dynamic stuff, VBOs should be your choice. Stay away from immediate mode or “old” vertex arrays.

Our engine renders a CAD scene per-frame:
2,6 Mtris
5000 glDrawElements
at 40fps (thats 104Mtris/s)

on a simple GF8800GTS.

knackered · May 27, 2008, 10:38am

typical and predictable - you pick the one negative thing I said and made that the focus of your reply. Totally ignored the other stuff. Totally failed to give any more information on how you’re submitting your vertices.
It’s the vagueness of your question coupled with the inappropriateness of the forum (advanced GL) that annoys me. I could ignore you, but I thought I’d try and kick-start some kind of thought process in you. I have failed miserably. You just want to be spoon fed the basics.
Pray continue…

karx11erx · May 27, 2008, 10:38am

I know for sure that I am not doing excessive amounts of glPush/glPop calls, and the model view and projection matrices are only set once per frame. It’s true though that I am doing a lot of stuff per face, and the reason is that Descent has an excessive amount of light sources, often 16 or more per face, and they vary very frequently.

I have also found that simply pushing all faces to the OpenGL driver and not doing any dynamic lighting, culling or stuff does not speed up rendering for me, and I am clueless why.

I can try VBOs, but I am using them for rendering 3D objects in my scene already (they get loaded once during level load, so there’s no frequent changing them or so, they just stay untouched in the gfx card’s memory after that), and they only about doubled rendering speed.

I should probably use deferred lighting, but I am not that far yet.

knackered · May 27, 2008, 10:57am

What format are your vertices (float3 pos, float3 norm, uint32 rgba)?
When you say you’ve tried pushing all faces to GL, what method do you use?
Still not enough information to give a sensible reply - you’re just playing a guessing game with us.

karx11erx · May 27, 2008, 11:29am

Ok, I will try to post enough information to be useful here.

vertices and normals are float3. color values are float4.

The face buffer renderer looks like this:


#define FACE_BUFFER_SIZE			1000
#define FACE_BUFFER_INDEX_SIZE	(FACE_BUFFER_SIZE * 4 * 4)

typedef struct tFaceBuffer {
   grsBitmap   *bmBot;
   grsBitmap   *bmTop;
   short       nFaces;
   short       nElements;
   int         bTextured;
   int         index [FACE_BUFFER_INDEX_SIZE];
} tFaceBuffer;


void G3EnableClientStates (int bTexCoord, int bColor, int bNormals, int nTMU)
{
glActiveTexture (nTMU);
glClientActiveTexture (nTMU);
glEnableClientState (GL_VERTEX_ARRAY);
if (bNormals)
   glEnableClientState (GL_NORMAL_ARRAY);
else
   glDisableClientState (GL_NORMAL_ARRAY);
if (bTexCoord) {
   glEnableClientState (GL_TEXTURE_COORD_ARRAY);
else
   glDisableClientState (GL_TEXTURE_COORD_ARRAY);
if (bColor) {
   glEnableClientState (GL_COLOR_ARRAY);
else
   glDisableClientState (GL_COLOR_ARRAY);
glEnableClientState (GL_VERTEX_ARRAY);
}


void BeginRenderFaces (void)
{
G3EnableClientStates (1, 1, 1, GL_TEXTURE0);
glNormalPointer (GL_FLOAT, 0, gameData.segs.faces.normals);
glTexCoordPointer (2, GL_FLOAT, 0, gameData.segs.faces.texCoord);
glColorPointer (4, GL_FLOAT, 0, gameData.segs.faces.color);
glVertexPointer (3, GL_FLOAT, 0, gameData.segs.faces.vertices);
G3EnableClientStates (1, 1, 0, GL_TEXTURE1);
glTexCoordPointer (2, GL_FLOAT, 0, gameData.segs.faces.decalTexCoord);
glColorPointer (4, GL_FLOAT, 0, gameData.segs.faces.color);
glVertexPointer (3, GL_FLOAT, 0, gameData.segs.faces.vertices);
G3EnableClientStates (1, 0, 0, GL_TEXTURE2);
glTexCoordPointer (2, GL_FLOAT, 0, gameData.segs.faces.texCoord);
glVertexPointer (3, GL_FLOAT, 0, gameData.segs.faces.vertices);
}


void G3FlushFaceBuffer (void)
{
//basic vertex ordering is quads, but program can turn that into tris
if (gameStates.render.bTriangleMesh)
   glDrawElements (GL_TRIANGLES, faceBuffer.nElements, GL_UNSIGNED_INT, faceBuffer.index);
else
   glDrawElements (GL_QUADS, faceBuffer.nElements, GL_UNSIGNED_INT, faceBuffer.index);
}

Descent 1+2 have a segment based engine. A segment is a cuboid, and levels consist of such cuboids attached to each other by their faces.

D2X-XL builds a face list from a level’s segment list. Each face has properties like base texture, decal texture, etc.

BeginRenderFaces() is called before faces get rendered.
G3EnableClientStates() accepts the desired client states and the TMU to use as parameters.
The renderer then walks through a list of all faces, culls them (doing a software vertex transformation for that) and calls the face render function for each visible face.
The face render function checks whether a state change would occur, and if so flushes the face buffer.
After that check and eventual flush, the new face is pushed into the face buffer.

For the rendering, hardware transformation is used.
If the face culling is omitted, the renderer gets slower.

knackered, I’d rather have no help from you than in that tone of yours.

skynet was more helpful when telling me I should just throw my polys at the gfx driver no matter what. I simply have no clue what tasks to leave to modern gfx hardware. I know the basic OpenGL stuff (and then some), and I have understood the Descent renderer (which is 15 years old and is a software renderer). That’s about it.

knackered · May 27, 2008, 2:15pm

Your wish is my command. Good luck, with that tone of yours.

karx11erx · May 27, 2008, 3:16pm

Your wish is my command. Good luck, with that tone of yours. [/QUOTE]
The only one who was constantly impolite and generally behaving like an arrogant prick who believes he is smarter than everybody else and that that gives him the right to behave like a jerk here were you. If you were as smart as you’re trying to make us believe you’d understand that to ask good questions you already need to know half of the answer, and that apparently that is not the case for me.

Good bye.

karx11erx · May 27, 2008, 3:35pm

skynet,

I have found out that software visibility culling and lighting cost 75% of time spent in the renderer, so it won’t help much fiddling around with the actual rendering code.

Unfortunately I cannot do w/o the software culling because simply lighting and rendering all faces makes the program even slower.

I would first have to look into different lighting methods (like deferred lighting).

yooyo · May 27, 2008, 3:41pm

wow… this is wrong! Why do you set vertex pointer 3 times? You have to read some OpenGL manuals before you start coding. Your questions is not for advanced forum.

Regarding your piece of code… Setup vertex, color & normal pointers once, and then setup texture pointers for each TMU. Without using VBO your vertices are copied every time when your app call glDrawElements.

Anyway it will not gain any performance boost. I belive that you have more unappropriate usage of OpenGL API in your code.

You have to do following:

Use VBO for vertices
Use VBO for faces
If you have some vertex attributes thats change every frame, split your vertex into static and dynamic part and store them in two separate VBO’s. Update only VBO thats contains dynamic vertex data.
Optimize your software culling. You dont have to check every face.
Try to minimize number of draw calls (glDrawElements, glDrawArrays,…). Sort faces based od material (textures).
Do not call glGetXXXXXX. It cause pipeline stall
Data layout suitable for CPU and programmer is not good for GPU. Here is some suggestion for data storage layout.


typedef struct tsgVertex
{
 float pos[3];
 float norm[3];
 float color[4];
 float tex0[2];
 float tex1[2];
 float tex2[2];
} tVertex;

typedef struct tagFace
{
 unsigned int indices[3]; // or unsigned short.. depending on vertex number
 // indices are related to array of tVertex
} tFace;

typedef struct tagCuboid
{
 unsigned int faces[12]; // index in tFace array.
 unsigned int face_material_ids[12]; // or 6? 
} tCuboid;

typedef struct tagMaterial
{
 GLuint textures[3]; // up to 3 texture per face
 unsigned int* pRenderingQueue;
 unsigned int max_queue_size;
 unsigned int queue_pos;
 unsigned int additional_flags; // transparency, texture stages or shader...
} tMaterial;

Store all your vertices in VBO. tVertex perfectly fits in hardware.
Store all your faces in another VBO. tFace perfectly fit in hardware.
Do software occlusion culling. Select only visible cuboids. In each cuboid you have faces and their materials. Add cuboid face indices to queue in material where face belongs (dont foret to reset queue_pos at beginning of frame render). At end… iterate trought materials, setup textures, and render faces using glMultiDrawElementsEXT call. This would eliminate frequent texture switch, reduce draw calls, but it will not handle multiple lights. Im suggesting to you to do lighting in separete pass without any textures, then turn on aditive blending and render textures.

knackered is OpenGL guru… sometimes his comment mught offend some people. He gives you few good suggestions, so its up to you to read again what he says or wait that somebody else say the same.

skynet · May 27, 2008, 4:45pm

karx11erx:

I am unsure what to tell you. Yooyo already stated the most important parts about getting best performance from OpenGL.

Concluding from this little piece of code to the rest of it, there might be some other pitfalls you tapped into. A GLintercept log would have told us

For instance, you are using glDrawElements plus old vertex arrays. This forces the driver into kind-of-immediate mode, since he cannot know how many vertices get touched by your draw call in advance. Thus, the driver just submits one vertex after another, until the last triangle has been submitted. This is bad and gets worse the more passes you render. glDrawRangeElementsEXT + VBO is the way to go.

Also, I cannot imagine (from the screenshots I’ve seen) how you come close to render 100k of triangles at all. One map having 10k faces would be already much, I guess. In that case I suggest you put the whole map geometry into one static VBO and either render it all every frame or only the visible sub-ranges of it. Do not try to build face-lists from visibly cuboids at runtime. Convert all your cuboid-level-geometry already at loading time.

Another dubious statement of yours is that you spend most of the time for culling and lighting. Lighting? Shouldn’t this be done by the GPU? Is this the reason why you have to change the faces’ vertex colors so often?