optimising GL_POINTS rendering

RenderMonkey · July 3, 2003, 10:44pm

I’m trying to render as many points as I can and am having trouble with optimising…
I’m currently trying to render 30000 points although this is variable according to platform I will run on, the values for each point (color and position ) change every frame so there is unfortunately no chance to use display lists.

The original implementation (at the start of the code) is by far the fastest, even though looking at the OpenGL documentation it should be the slowest…

The simple for->glColor->glVertex->next loop takes 9.5ms, using the glColor4fv and glVertex3fv calls slows the render down to
11ms.
I’ve also tried vertex arrays, two types, the OpenGL version and OpenGL_EXT version (see code segment below) which I also though should be faster, but rendering speed drops now to 13.5ms for both implementations (is there any performance differences between these two implementations?).

So I’m I missing something very obvious? or am I getting the performance people expect…

Platforms this has been tested on include:

NVidia Quadro
NVidia GeForce 440 Go
SGI Onyx IR3
SGI Onyx IR4

Any ideas would be very welcome!

  /* Draw Points */
  glEnable( GL_POINT_SMOOTH );
  glPointSize( sh-&gt;pntSize );

  if( sh-&gt;useVertexArray == 0 )
  {
    /* traditional immediate mode */
    glBegin( GL_POINTS );
    for( i = 0 ; i &lt; sh-&gt;numStars ; i++ )
    {
      /* this is faster than... */
      glColor3f( sh-&gt;col[i][0], sh-&gt;col[i][1], sh-&gt;col[i][2] /*, sh-&gt;col[i][4] */  );
      glVertex3f( sh-&gt;pos[i][0], sh-&gt;pos[i][1], sh-&gt;pos[i][2] ); 
     
       /* than this ... */

// glColor4fv( sh->col[i] );
// glVertex3fv( sh->pos[i] );
}
glEnd();

  } else {
  
    int loop = 0;
    int numLoops;
    int loopVerts;
    
/*--- vertex arrays version 1 ---*/
    
    /*
    glGetIntegerv( GL_MAX_ELEMENTS_VERTICES, &loopVerts );
    
    glEnableClientState( GL_VERTEX_ARRAY );
    glEnableClientState( GL_COLOR_ARRAY );

    glColorPointer(  4, GL_FLOAT, 0, sh-&gt;col );
    glVertexPointer( 3, GL_FLOAT, 0, sh-&gt;pos );
    
    numLoops = sh-&gt;numStars / loopVerts;
    for( i = 0 ; i &lt; numLoops ; i++ )
    {
      glDrawArrays( GL_POINTS, loop, loopVerts );
      loop += loopVerts;
    }
    glDrawArrays( GL_POINTS, loop, (sh-&gt;numStars - loop) );
    
    
    glDisableClientState( GL_VERTEX_ARRAY );
    glDisableClientState( GL_COLOR_ARRAY ); */
    
    
/*--- vertex arrays version 2 ---*/
    
    glEnable( GL_VERTEX_ARRAY_EXT );
    glEnable( GL_COLOR_ARRAY_EXT );
    
    glVertexPointerEXT( 3, GL_FLOAT, 0, sh-&gt;numStars, sh-&gt;pos );
    glColorPointerEXT(  4, GL_FLOAT, 0, sh-&gt;numStars, sh-&gt;col );
    
    glDrawArraysEXT( GL_POINTS, 0, sh-&gt;numStars );
  
    glDisable( GL_VERTEX_ARRAY_EXT );
    glDisable( GL_COLOR_ARRAY_EXT );

  }

Relic · July 4, 2003, 12:07am

I wouldn’t expect the -fv calls to be slower. That’s strange.

Your vertex arrays version #1 is broken or effectively the same as #2.
You check for GL_MAX_ELEMENTS_VERTICES but that only applies to glDrawRangeElements.

Other things to try:

vertex array range extension,
vertex buffer object extension.

Both should be faster than generic vertex arrays.

30000 points/9.5 ms is 3.158 Mpoints/s.
I would expect 16+ Mpoints/s from the given NVIDIA boards.

How fast is you host computer? Maybe you’re CPU limited.
Did you compare the speed to a non-changing display list?
How does it perform with aliased points?

[This message has been edited by Relic (edited 07-04-2003).]

RenderMonkey · July 4, 2003, 12:34am

The computer I’m doing much of my testing on at the moment is a 1.6GHz Dell laptop (NV 440).
The computations behind the vertex positions/colours takes ~30ms, and the draw time is pretty much unaffected whether I do the calcs or not (this version will by run on multi-cpu machines with the vertex calcs threaded over multiple cpus so I can achieve >60Hz for that side of things no problem)

I replaced the glVertex with glColor which I believe should show whether we have any cpu limitations are present, and timings are identical…

I tried the vertex arrays #1 using glDrawArrays( GL_POINTS, 0, sh->numStars );
and got the same timings as i do with the loop.

you mention vertex array extensions to try is this not what I was doing in #2?

I’ll take a look into your other suggestions … many thanks!

Relic · July 4, 2003, 3:37am

“I tried the vertex arrays #1 using glDrawArrays( GL_POINTS, 0, sh->numStars );
and got the same timings as i do with the loop.”

Of course, because GL_MAX_ELEMENTS_VERTICES maybe huge so the loop is not taken and you do effectively the same as in the last.
You’ve mistaken that GL_MAX_ELEMENTS_VERTICES had anything to do with glDrawArrays.

“you mention vertex array extensions to try is this not what I was doing in #2?”

No, you used the extensions present in OpenGL 1.0, pretty old and actually merged in the OpenGL kernel with version 1.1 with slight simplifications to the API. glDrawArrays and glDrawArraysEXT do the same thing.

Read again:

Vertex array range (VAR), an NVIDIA extension.
Vertex buffer object (VBO) extension, a brandnew multi vendor extension with a simpler API (recommended!).

Both allow to store vertices in memory location which can be read faster by the GPU.

Other things:
If this is a double buffered animation, make sure you measure with wait on vertical blank (vsync) switched off in the display control panel.
Otherwise you can only achieve animation performance within integer deviders of the refresh rate (like 60Hz, 30, 20, 15, 12, 10, …)
Try benching without SwapBuffers.

[This message has been edited by Relic (edited 07-04-2003).]

Ysaneya · July 4, 2003, 3:53am

But none of these extensions are available on the Onyx IRx, so if your target plat-form includes these (which i suspect since you mention multi-CPU machines!), you pretty much have to stick to OpenGL 1.1

The Onyx internally converts all the vertex arrays into immediate mode, so you might gain something by using IM (ie. glVertex and glColor) directly. If possible, use display lists, but that’s assuming you don’t have a lot of points (only 15 Mb are available for DLs on these systems…).

Y.

RenderMonkey · July 5, 2003, 10:01am

I’m trying to keep everything at the OpenGL 1.1 level so I don’t have any issues running on Onyx IR platforms. But on my laptop I’m looking at the various extensions that are available, and will adjust everything as necessary.

After re-reading my OpenGL 1.2 manual I spotted that the vertex array extension was a 1.0 thing that got included into 1.1, makes sense that the speeds should be identical!

I disable antialiasing and get no speed up at all.

The app is double buffered, but my timing loop is wrapped tightly to the channel being cleared and rendering of the points hence the timinings of 9ms I’m getting rather than 16.6666666ms or 33.333333ms.

I’ll try using display lists to see if I get overall better performance on rendering the points to see if it is some throughput issue of using immediate mode, unfortunately I can’t use DL’s as the values used by glColor and glVertex change every frame, but it will suggest why I’m seeing performance that is so dramaticly different from the NVidia benchmarks…

Also I suspect my drivers on my laptop may not be able to fully utilise the agp port, and might be throttled down in some way so I’ll look at upgrading to the latest drivers and check the settings…

thanks again!

Robert_Osfield · July 7, 2003, 4:33am

I have done similar work with rendering large arrays of points, all with unqiue colours, with points computed on eary frame on the host CPU, both tested on PC’s and Onyx’s.

I used interleaved vertex arrays to ensure that the write to the arrays was effficient w.r.t CPU cache, and used a straight froward glDrawArrays to minimize the OpenGL calling overhead. Using vertex arrays also make taking advantage of ARB_vertex_buffer_object extension very easy and can be done without too much complication in maintainin the straight OpenGL 1.1 route.

A major problem with Onyx is that the bandwith to the graphics pipe is a bottleneck. Using display lists is really important as it overcomes this by downloading the objects once to be reused many times, I’ve often seen a doubling in throughput when moving to display lists. However, if your data is dynamically change every frame, creating a display list is not an option - you could try it but it’ll be slower as the display list will need to be created and downloaded on every frame, only to be reused once.

The T&L peformance of the Onyx can also be bottleneck, its just can’t match an modern PC graphics card, even the low end cards. The real strength of Onyx lies in its FSAA implementation, which alas you arn’t able to appreciate in this case since the bottleneck almost certainly lies in the bandwith to the graphics pipe

In the end I think you’ll just have to settle for modest peformance on the Onyx, and try make the coding clean enough so you can utilise VBO under Windows/Linux/OSX. The later probably won’t even need optimizing as they’ll probably outpeform the Onyx as is.

Robert.

RenderMonkey · July 7, 2003, 5:39pm

Well, using the vertex arrays range extension I got a +100% improvement in performance running on my laptop, keeping down the route is certainly paying off linux wise.

With respect to the running the same stuff on Onyx IR3/4 I tried drawing quads rather than points, and got an significant performance improvement, which is interesting as I’m calling glVertex and glColor 4 times rather than once per object, suggesting that the number of points is not the bottleneck rather the actual rendering of the points themselves on IR…
Increasing the point size much above 5 pixels significantly degraded performance too…

thanks for you help!

m21 · July 8, 2003, 6:17am

Originally posted by RenderMonkey:
Increasing the point size much above 5 pixels significantly degraded performance too…

It is documented (I wish I could find the reference right now – I thought it was in the IRIX OpenGL manpages) that rendering of points at arbitrary sizes (either fractional or above a certain limit) might fall off the fast path. This is affected by the “niceness” settings. This is also true for commodity graphics hardware.