Would instancing be good in this case?

OniDaito · April 26, 2011, 3:22pm

Hi guys. Im trying to render a pinboard over two heads. Ive decided to render both the rear and front views to a 2048 x 768 window which I’ll split over the two screens. The problem Im having is that, according to gDebugger, I’m moving around 1815000 triangles! Im only on a little GeForce 330M on OSX so Im guessing some optimisation is needed here.

I’ve run this through OpenGL Profiler and indeed, the calls to glDrawArrays (as I remember) take up the majority of the time.

Each pin is loaded into a VBO and then called. There are 60 x 49 pins in that image, each one has 180 faces (meshlab doesnt tell me exact triangles) but adding it up and it is pretty close to the figure given by gDebugger.

In addition to drawing the colour step, there is also a step for linear depth (in order to setup some SSAO). At the moment, im getting around 12-15fps. I’d like to get it to 30.

I thought about trying for a non linear depth buffer and reading the depth buffer and colour buffer from the FBO in one go to save a pass but commenting out the depth pass for now seemed to make little difference (oddly).

I tried ‘pseudo instancing’ (i think) by passing in the transformation matrix as a texture to my vertex shader. This didnt give that much in the way of speedup.

As OSX has limited support (annoyingly) the only method I can see to get more speed is to use GL_ARB_instanced_arrays somehow but Im not exactly sure if this will help or improve things. There may be something else I can do to get things a little faster but I’m not sure what. I’ve gotten the triangles per pin down about as far as I can but I’m not sure what else is best. Any thoughts chaps? Cheers

denizdiktas · April 27, 2011, 2:48pm

I guess by ‘pseudo-instacing’ you mean calling glDrawArray 60x49 times, in which case the bottleneck would be in sending commands to the driver. With instanced arrays you should get an increase in instancing performance over the previous method. But with so many triangles and a big viewport your performance might be pixel-bound (depending on the view parameters of course).

Also if your pins are placed according to a predefined rule (say a function or regular spacing as it seems in your screen shot), there is no need to place the transformation matrices for each pin in an additional texture. you can read the value of the gl_InstanceID variable in the vertex shader to compute the transformation matrix for each pin on the fly, avoiding extra texture fetches.

Ilian_Dinev · April 27, 2011, 10:33pm

Try putting the whole object in a display list, and also try a single VBO (single draw-call). (pre-transform the pins)
That will quickly show you if you’re cpu-bottlenecked, and if instancing is necessary. Still, 3k drawcalls per frame shouldn’t be too heavy on the cpu.
Modern Geforces and Radeons handle MRT and MSAA quite well, that’s why you didn’t see a speedup when you removed the linear-depth output.

OniDaito · April 28, 2011, 6:03am

Ok, im not sure what you mean about the linear depth output bit; everything still needs to be rendered anyway I’d have thought.

Well, There is a possibility I could reduce the poly count but it doesnt look great:

http://farm6.static.flickr.com/5268/5664472054_8d420c7e9f_z.jpg

This is with 100 faces as oppose to 180. You can begin to see the polygon outlines which is not nice. Also, you can see i’ve reduced the overall number of pins. This double view runs at 30fps. With the original number of pins, 180 faces vs 100 faces makes almost no difference in speed. They both go at around 10fps. Its almost as if there is a cutoff point, beyond which you get no speed up or change at all.

Im thinking the bottleneck could very well be sending the commands.

I should point out this is static at the moment. The only thing that does change is the Z Value for each pin (i.e, they get ‘pushed’ in and out) so a changing value will need to be passed in for sure.

By display list, do you mean somthing like a vertex array?

Also, yes, one VBO could be a cool idea IF i could still move each individual pin (i suppose I could with a pre-transform).

I dont think its a CPU thing really. The graphics card is only a Geforce 330M 512meg Macbook Pro one but certainly, it should run a little faster.

BionicBytes · April 28, 2011, 7:33am

You never mentioned before that the pins could ‘move’.
A single Display list is no longer an option anymore since this is now a dynamic situation rather than a static model.

BionicBytes · April 28, 2011, 7:42am

Instead of uploading each pin into it’s own VBO each frame, perhaps split the VBO up into separate parts:
One part is the geometry of a pin. As all pins are the same cylinder, this will never change.
The second part is some sort of transform per pin, such as x,y,z position. This will be dynamic, and will need to be uploaded to a VBO each frame. You could use ‘instancing’ as a technique to access the dynamic VBO and retreive the transform for each pin.
This is better than what you have already since the amount of data to upload per frame will be less. Also, you can switch between alternate sets of ‘instance’ data, so the GL can be reading one set and you could be updating the other.

denizdiktas · April 28, 2011, 9:17am

I agree with BionicBytes. display lists are troublesome once you have dynamic geometry. although it depends on how they move (if they simply translate alltogether you could still use display lists). But instancing would be the best & elegant way to go. Also consider the following: using transform feedback mechanism you could update the pin locations entirely on the gpu, so you won’t have to transfer any single byte of info from cpu to gpu during the animation (except in the initialization phase). there is a good example of this in the latest edition of the opengl superbible.

by the way: which opengl profile are you using? core or compatibility? I recommend using the core profile. in core profile, the opengl driver is optimized further (less state checking and tracking). but remember that you can’t use deprecated functionality in the core profile: matrices, display lists, fixed function pipeline and many other things won’t be functional. you will have to implement them on your own, or use a framework that does the heavy lifting for you.

Ilian_Dinev · April 28, 2011, 10:51am

I recommended trying display-lists to simply check in a static situation, what the best framerate could ever be. An easy way to stop wondering what to do if there’s little improvement.

denizdiktas · April 28, 2011, 2:29pm

Yes of course, I got it already. this makes sense.
Thats what I also do usually, quickly check the performance with display lists to see what’s going on. We had some performance issues with a commercial game engine (I don’t want to mention its name): I quickly implemented a very brute-force display list to show the implementors that even using a display list for sending all of the scene to the gpu was much better than their supposedly optimal culling algorithm. we are still waiting for their response

OniDaito · April 30, 2011, 9:40am

Thanks for the input! As a test, and I think this is what you were getting at, I’ve loaded 60 x 40 pins into a single obj file. this is then loaded into a VBO and then run in the static case. I get around 15fps.

Ive then used the same pin, loaded into a VBO and then the VBO is called 60x40 times. The framerate drops to 9fps.

These pins could stand to lose almost half of their faces which might get me closer to the 20fps mark without losing much in the visual appearance, IF i can alter the z position. Seems like you guys were on to something! thanks!

BionicBytes · April 30, 2011, 1:11pm

The z position could come from an other separate vbo and uploaded only when a change is made the pins z position.
To make things faster you could have two of these buffers one for reading and the other for writing.

dukey · April 30, 2011, 3:07pm

Reduce the poly count of the model.
2 ~ million triangles is a lot of any hardware really.

180 faces for a pin … seems to me like massive over kill.

OniDaito · May 1, 2011, 12:28pm

Yeah, I think i was cheating when I said I had a speedup when using one VBO. Turns out that Id recreated the model incorrectly and it was orientated at 90degrees (I believe this happens with OBJ files). Long story short, some of the model was being culled out, hence the speed up.

I’ve reduced the polycount to around 80 triangles per pin. It has given me a couple of extra frames. The real speed increase has been to cull the tips of the pins in the reverse view.

This actually makes the scene look better as being able to see the heads of the pins from the back makes the scene look confusing. Im up to about 15fps again with both the scenes. Its an improvement certainly. I’ll keep looking for these extra frames but it seems like lots of small adjustments will be needed.

Cheers for the advice so far!