Performance improvement

nicolas · February 3, 2012, 2:27am

Hello,
I have a problem of improvement for my application. For explaining its context, I have to display dBSPL levels from some enclosures (200+) on surfaces.
Following, I post the main idea of how I do this :


[u]Done only one time at initialization:[/u]
- Per enclosure, build a cubemap (GL_R16, GLushort) using a FBO to store distances from enclosure position/orientation and fragment coordinates.

[u]Done each time scene is rendered:[/u]
- Draw only front faces of surfaces
- In a FBO using GL_RGBA32F texture, draw surfaces for each enclosure using appropriate cubemap to know if fragment is hidden or not. Blending (GL_MAX for my example) is done here.
- Draw surfaces to default FBO using previous texture (textureProj). This will left hidden pixels (for all enclosure) at the default color used in first step.
- Draw only back faces of surfaces
- Draw floor grid
- Draw enclosures

This works for 192 enclosures on my GTX260@20~30fps but on my HD3450 it is working only from 2 to 10 fps.

I must improve this application to let it work on base graphic card, something not more efficient than a GT520/NVS4200M/HD3670.

Actually, complexity in rendering process is (M+N*M+M+M) with M surfaces and N enclosures.

Could you tell me if there is a way to improve this ?

thokra · February 3, 2012, 5:44am

Make sure you REALLY need the precision offered by GL_RGBA32F. Unless it’s absolutely mandatory, switch to GL_RGBA16F to save bandwidth. This can have a huge impact on performance, especially on the toned down cards.

How do you draw? How’s your VAO/VBO/IBO setup?

nicolas · February 3, 2012, 6:36am

Right, RGB16F will save bandwith and improve performance. But I really need to take a closer look to my needed limits. I will come back with more information about that further.

For the rest of your question:

cubemaps are GL_R16 to keep a 4.6 cm precision for detecting occlusions
currently, VBO are all in floats, and IBO are in ubyte or uint.
each enclosure and surface has a proper VAO to draw it independently.

thokra · February 3, 2012, 6:48am

currently, VBO are all in floats, and IBO are in ubyte or uint. each enclosure and surface has a proper VAO to draw it independently.

Why leave out ushort? I’m not concerned about the format of the VBO contents since I’m already expecting floats in there. My question aimed at whether you use one vao/vbo/ibo per object or if you use only as many buffers as you really need to hold all objects.

Using a single VAO/VBO/IBO for each object will result in one binding call for the VAO and one draw call per object. You can dramatically reduce these calls by using a single(or a few more) VAOs for all objects and by invoking, for instance, glDrawElementsBaseVertex. An even higher reduction of draw calls may be achieved by using glMultiDrawElementsBaseVertex if you can be sure that you don’t need to switch VAOs and shaders. Then you can assemble all indices, base vertices and vertex counts and draw the whole batch with a single call.

nicolas · February 3, 2012, 7:24am

I have already done this job few days ago.
Surprisingly, drawing surfaces using their own VAO (as you said, binding it and sending a draw command for each) takes the same time than drawing all surfaces in one shot, using glDrawelementsBaseVertex or glMultiDrawElementsBaseVertex.
So for convenience, I chose to keep a VAO per drawable object.

thokra · February 3, 2012, 7:32am

Well, of course it depends on whether you’re app is actually bound by bindings/draw calls. With a Radeon you can use GPUPerfStudio to get an estimate of what limits your app.

remdul · February 6, 2012, 2:42am

What thokra said, first determine what the bottleneck is (fill rate? data transfer? etc). The steep drop to ~2fps seems more like it falls to software rendering however.

You may also see a performance boost by drawing opaque geometry from front to back, to take advantage of the hardware’s early-z rejection test. So the pixels of occluded geometry will not be rasterized, invoke the shader, texture samplers etc. And if you’re really serious about this, you can pre-sort (per material/drawcall) the polygons to minimize overdraw. Note: you may already be doing this depending on the BSP implementation.

nicolas · February 7, 2012, 1:14am

How can I see if program falls to software rendering ?

I have done an improvement of my work, drawing a depth texture and using it to know if fragment is shown or not. If not, it is discard in fragment shader.
This allowed me to pass from ~2fps to 6+ fps.

thokra · February 7, 2012, 1:23am

Ok, so you’re obviously not bound by draw calls. What you describe is also known as a z pre-pass and it helps to alleviate heavy fragment processing. IMHO, the fact that you get a noticable increase may point to too complex fragment shader code.

I noticed one thing that might help. How do you compute the distance? Do you use squared or linear distance (i.e. the length of a vector)?

nicolas · February 7, 2012, 7:52am

For the cubemaps where distances are stored, they are computed using distance(p1, p2) and normalized using a max distance value. I remind you that distances are stored in “GL_R16/GLushort” color texture.

For the fragment value of surfaces, I compute the distance from fragment to POV, compare it with the one in cubemap to know if fragment is lit by POV, and if so then I currently compute the angle between source orientation and vector between source and fragment (* 0.5f + 0.5f, to put it in [0,1]). And return a vec2 as produced texture is in GL_RG16F.

Then the produced texture is used to redraw surfaces in default FBO, applying a simple formula to get a color interpolation.

nicolas · February 10, 2012, 7:58am

I come back with new info:
when setting my code for using glMultiDrawElementsBaseVertex instead of glDrawElements per each surface, there is no gain at all.
When using GPU PerfStudio, it tells me that my application is GPU bound.

I think my main problem is that I have 192 ponctual sources, and that’s a lot of operations… ^^’

If someone can help me… Or just tell me that is normal if I cannot increase my FPS, it would be nice !

nicolas · February 13, 2012, 8:53am

Well… In order to get an answer, I formulate again my problem:

I have 192 punctual sources
I have many surfaces
I have to compute light values on those surfaces for a specific wavelength
I may have to compute sum of light values for many (100+) wavelengths
I have to deal with occlusion of surfaces by some other surfaces.

Actually, I have done cubemap for each source in which I store distance from source to fragment.
I generate a depth texture to filter unlit fragments from scene POV.
Then I generate the texture of light values on surfaces for each source using both cubemap and depth texture to avoid computing value for hidden fragment from scene POV or source.
Each texture are then blended with previous one to get sum of values.

For now, generating the result tooks me 2ms per texture when a 800x600 window is full of visible and lit fragments. So, it takes around 380 ms, i.e 2.7fps.

Could you help me with the topic, and let me know if you have better ideas or some hints to improve performances ?

Thank you, and best regards !

carsten_neumann · February 13, 2012, 4:08pm

Hmm, I’ve only read the summary problem desciption in your last post, but

Then I generate the texture of light values on surfaces for each source using both cubemap and depth texture to avoid computing value for hidden fragment from scene POV or source.

this sounds like you are switching through the 192 cubemaps for each surface? Instead of rendering a pass for each of those, I guess you may gain some performance by doing say 4 or 8 sources in one pass.

I have to deal with occlusion of surfaces by some other surfaces.

Are you ignoring this aspect for now? Otherwise you also need some way to determine whether the path between source and surface is blocked or not, no?

It almost sounds to me as if it may be worthwhile to use a ray tracer for this problem (maybe Nvidia’s Optix). With that you’d only ever compute values on visible surface parts and occlusion between surfaces is handled nicely. I have absolutely no idea if it has a chance to be faster though

nicolas · February 14, 2012, 1:00am

First, thanks for your answer !

this sounds like you are switching through the 192 cubemaps for each surface? Instead of rendering a pass for each of those, I guess you may gain some performance by doing say 4 or 8 sources in one pass.
Are you ignoring this aspect for now? Otherwise you also need some way to determine whether the path between source and surface is blocked or not, no?

For each source, I bind the corresponding cubemap and then draw surfaces. The depth texture is also bound, but one time for all at the beginning, just after the glUseProgram.
Like I said, this is done to exclude hidden fragments from screen POV (depth texture) and hidden fragments from source POV (cubemap).
I must do this because sources have an orientation (they are punctual but not homogeneous in space, so they have a position and orientation).
All is managed actually by my application. My main issue is that now, I have to do this for many wavelength, repeating these steps 100+ times whereas they are currently taking 380ms for being done in the worth case.

It almost sounds to me as if it may be worthwhile to use a ray tracer for this problem (maybe Nvidia’s Optix). With that you’d only ever compute values on visible surface parts and occlusion between surfaces is handled nicely. I have absolutely no idea if it has a chance to be faster though

I don’t know this, and try to look at. You are in a good way to help me giving some clues, even if they are not fully functional for my problem, because it let me know some other stuf and may let me find a solution that may fit my problem.

So, thank you !