How to use OpenGL instancing?

Hi, folks
I need draw 16k torus. But if I use glDrawElements(), it cost 15ms. I try to use glDrawElementsInstancedEXT(), but I can’t draw all the torus, just 2 torus. Anyone can give me any advice? This the code snippet. Thanks very much!
void Demo::_initGL()
{

GLuint vShader = glCreateShader(GL_VERTEX_SHADER);
glShaderSource(vShader, 1, &vertexShader, 0);   
glCompileShader(vShader);

program = glCreateProgram();
glAttachShader(program, vShader);

glLinkProgram(program);

// check if program linked
GLint success = 0;
glGetProgramiv(program, GL_LINK_STATUS, &success);

if (!success) {
    char temp[256];
    glGetProgramInfoLog(program, 256, 0, temp);
    printf("Failed to link program:

%s
", temp);
glDeleteProgram(program);
program = 0;
}
}
void Demo::render()
{
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, texture1);
glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_2D, texture2);

GLint texLoc;
texLoc = glGetUniformLocationARB(program, "posTex");
glUniform1iARB(texLoc, 0);
texLoc = glGetUniformLocationARB(program, "quatTex");
glUniform1iARB(texLoc, 1);
glUseProgram(program);
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_NORMAL_ARRAY );
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB,indexVBO);
glBindBufferARB(GL_ARRAY_BUFFER, vertexVBO);
glVertexPointer(3,GL_FLOAT,sizeof(Vertex),0);
glNormalPointer(GL_FLOAT,sizeof(Vertex),(void*)(sizeof(Vertex)/2));
glDrawElementsInstancedEXT(GL_QUADS, 150*4, GL_UNSIGNED_INT, 0, nBodies);
glDisableClientState( GL_VERTEX_ARRAY );
glDisableClientState( GL_NORMAL_ARRAY );
glUseProgram(0);

}

// Vertex shader.
const char * vertexShader =
"#version 120
"
"#extension GL_EXT_gpu_shader4 : enable
"
"uniform sampler2D posTex;
"
"uniform sampler2D quatTex;
"
"const float PI = 3.1415926;
"
"varying vec4 quat;
"
"void main(void)
"
"{
"
" vec4 centerOfMass;
"
" vec2 texCrd = vec2(gl_InstanceID, 0);
"
" centerOfMass = texture2D(posTex, texCrd);
"
" rotatedPos = gl_Vertex + centerOfMass;
"
" rotatedPos.w = 1.0;
"
" gl_Position = gl_ModelViewProjectionMatrix * rotatedPos;
"
"}
"
"
";

" vec2 texCrd = vec2(gl_InstanceID, 0);
"
" centerOfMass = texture2D(posTex, texCrd);
"

“texture2D” takes normalized texture coordinates. gl_InstanceID is an integer. So it’s only ever going to pull one texture coordinate. Also, you should probably be using 1D textures here.

Also, don’t forget that there are two main types of GPU-supported geometry instancing:

The latter can be faster. Which makes some intuitive sense, because it should stream better.

Thanks very much!
How do I get the normalized texture coordinate in the sharder?
I can’t use glTexCoord2f(). posTex is a texture stores nBodies position data.

What Alfonse was referring to for your ARB_draw_instanced code is that texture2D takes a texcoord in the 0…1 range. gl_InstanceID is in the 0…num_instances-1 range. So you need to map this to 0…1 range (or 0…f, if you’re not using the full width of the 2D texture).

Another solution is to just use a texture buffer which doesn’t require normalized texcoords as input. You just specify the absolute texel index (0…N-1).

Now with ARB_instanced_arrays, you don’t even have to mess with any of this (no texture lookup) because the right data is just fed into your shader automatically in vertex attributes.

Thanks a lot!
Now, I find the exe file can’t be executed in release folder.
But, in debugging mode of visual studio, I can run the program.
Even though, I already copy shader.obj in release folder.

I still have a question. I binded 2 texture. Why can I only access the first texture?
// Vertex shader.
const char * vertexShader =
"#version 120
"
"#extension GL_EXT_gpu_shader4 : enable
"
"uniform sampler2D posTex;
"
"uniform sampler2D quatTex;
"
"const float PI = 3.1415926;
"
//"varying vec4 quat;
"
"void main(void)
"
"{
"
" vec4 centerOfMass;
"
" vec2 texCrd;
"
" texCrd = vec2((gl_InstanceID/128)/128.0, (gl_InstanceID-gl_InstanceID/128*128)/128.0);
"
" centerOfMass = texture2D(posTex, texCrd);
"
" vec4 quat = texture2D(quatTex, texCrd);
"

"}
"
"
";
I found quat’s value is same as centerOfMass.
I set these 2 texture like this. It’s weired.
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, texture1);
texLoc = glGetUniformLocationARB(program, “posTex”);
glUniform1iARB(texLoc, 0);

glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_2D, texture2);
texLoc = glGetUniformLocationARB(program, "quatTex");
glUniform1iARB(texLoc, 1);

It’s not immediately obvious what you’re doing wrong. It works for me.

Perhaps post a short GLUT test program that illustrates your problem, so folks can try it and help you out.

I know the reason.
I should put glUseProgram(program) in front of glActiveTexture(GL_TEXTURE0).
It works fine now.
Thanks.

And I measured the timing of glDrawElementsInstancedEXT(GL_QUADS, 150*4, GL_UNSIGNED_INT, 0, nBodies) and glCallList(torus_display_list), but I found the former will be slower than latter. Is it because that too many GL_QUADS for each primitive?
The glCallList way is like this.
for(int i=0;i<nBodies;i++){
glTexCoord2f(i,0);
glCallList(torus_display_list1);
}
I make the same tessellation of two methods.
Anyone can give me any information?

It is expensive for the OpenGL driver to bind VBO buffers.

This is especially true as you have more batches per frame. You end up wasting a lot of your time on “CPU work” in the GL driver setting up to render a batch.

If your buffers aren’t large, even client arrays draw calls are faster than VBO draw calls.

The only way I’ve been able to get even close to display list performance with explicit draw calls is by using NVidia’s bindless extensions (in particular, NV_vertex_buffer_unified_memory). For example, see this simple test prog. The main thing this extension lets you do is provide 64-bit VBO addresses to the GPU instead of VBO handles. That’s it. No rocket science. Very, very simple.

Given that the main purpose of this extension is simply eliminating CPU-side cache misses caused by the driver having to translate VBO handles into VBO addresses (think address = buffer_list[ handle ], then putting two-and-two together, one can surmise that this is the key slowdown at issue between “naked” classic (non-bindless) VBO draw calls and display list draw calls. To fully get rid of this inefficiency, you have two choices:

  1. [li]display lists, or[*]bindless VBOs (currently nVidia-only however)

Also note that in many instances, client arrays batches will bench faster than naked classic (i.e. non-bindless) VBO batches. If you use VBOs intuitively/nievely (1-2 VBOs per batch) they are typically slower than client arrays. So how to get client arrays perf while still streaming? I’ve gotten essentially the same performance as client arrays using streaming VBOs (sans bindless) as described by Rob Barris here (ref this thread for details). All kudos to Rob. It’s not display list perf of course (you only get that if your data is already on the GPU), but it’s pretty darn good! Better than nieve classic VBOs. Especially for streaming!

Is it because that too many GL_QUADS for each primitive?

Don’t think so. Your index list is only 600 elements long… That’s not big.

Thanks Dark Photon.
That’s to say bindless extensions benefits HW instancing.
I’m trying to understand and use it.

Given that the main purpose of this extension is simply eliminating CPU-side cache misses caused by the driver having to translate VBO handles into VBO addresses (think address = buffer_list[ handle ], then putting two-and-two together, one can surmise that this is the key slowdown at issue between “naked” classic (non-bindless) VBO draw calls and display list draw calls.

In the danger of getting off-topic… but I never really grokked that statement. A simple cached array access should slow down the rendering by factor 7? That can’t be… the rest of the app is usually accessing much more data throughout a frame. And think about it… if not the driver does such a lookup, then you’ll do in a very similar fashion, like
my_stored_gpu_address[my_object_id]. I do not deny the effects of memory accesses, I just can’t believe that a simple lookup should slow down things so much. There must be more to it, like re-configuring the HW for changed vertex formats and the like…

Hello Dark Photon. I tried to follow your sample(see this simple test prog). But I find these error,
error C2065: ‘GL_BUFFER_GPU_ADDRESS_NV’ : undeclared identifier
error C3861: ‘glGetBufferParameterui64vNV’: identifier not found
error C3861: ‘glMakeBufferResidentNV’: identifier not found
I thought I didn’t include some headfiles. Such as glext.h. Is it correct? Thanks a lot!

That was NVidia’s assertion. Perhaps possible (bunches and bunches of tiny batches each in their own VBOs might come close), but the closest I’ve seen on our real-world data is 2X, not 7X. The greater the batch count, the greater the speedup. The lower the batch count, the lower the speedup. Which isn’t too surprising.

Still, I’ll take up to 2X the draw-perf with the same hardware anyday! More frame time to render more complex and interesting content. And besides that, I really don’t have to change anything to get it (just swap a few GL calls around).

There must be more to it, like re-configuring the HW for changed vertex formats and the like…

Dunno. Not an NVidia driver guy. I do know that when I compare bindless vs. not bindless, the vertex formats, VBO packings, interleavings, enables, batch composition, batch order, etc. remain exactly the same, and the draw speed-up was up to 2X, real-world, not contrived.

And all I’m doing different is providing 64-bit VBO addresses instead of VBO handles + offsets. Whatever the “special sauce is”, that minor change can yield an impressive results.

Use the latest glext.h from this site, always available under:
http://www.opengl.org/registry/. Namely:

gl.h pulls in glext.h internally, but feel free to #include it yourself. Just ensure that it is after the #define GL_GLEXT_PROTOTYPES
.

After I include glex.h, there are still error C2065 and C3861 as above.

#define GL_GLEXT_PROTOTYPES
#include “glext.h”

And in glext.h, the lines after
#ifndef _glext_h
#define _glext_h
are blocked.
Is _glext_h already defined by others headfiles?

Better not be. That would be ridiculous. It’s not here.