Fastest way of moving and drawing VBOs

Inagawa · August 28, 2012, 10:20am

Hello all,

I am creating a GUI system with colored geometry (i.e. the only textures in the GUI will be a bitmap text). I am using 3.1+ OpenGL. Since everything will be formed by a quad, and since it’s a GUI, it needs to be fast. How would I approach this?

Should I create a VAO for the entire GUI system and a VBO for each of the quads? Do I then move and resize the quads by updating the VBO’s data with glBufferSubData()? I’ve never used VAOs before, so I have no real idea of the benefits.

Or maybe I could have a single VBO quad and instance it with transformations?

I don’t know, I’m a total beginner to OpenGL, so maybe these aren’t viable options. In that case, is there a best way of doing this for my specific purposes? For the GUI, I only have to be able to create quads, move them and resize them.

Also, what about drawing the quads? From what info I found, DrawElements seems to be the best way, assuming I use an element array buffer and don’t upload the indices every frame.

Please, could I get an opinion on this?

codepilot · August 28, 2012, 3:22pm

I have heard of about 4 different ways to get data into VBOs and each is fastest on a different set of hardware. I like using GL_AMD_pinned_memory or glFlushMappedBufferRange.

codepilot · August 28, 2012, 3:30pm

Using AMD’s pinned memory is really the best because it lets the GPU see the CPU memory. That way you can edit the cpu memory, and the gpu sees it already, just use one buffer and cycle through it, like a ring buffer. Just in case you wonder, the gpu can read the cpu memory VBO faster than it can process the vertices, so no speed is lost using it. However, it is only available on AMD, I wish it was on NVIDIA too.

codepilot · August 28, 2012, 3:37pm

See http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-AsynchronousBufferTransfers.pdf, it says everything.

mhagain · August 28, 2012, 3:45pm

For this kind of drawing vertex submission is highly unlilkely to be anything near your primary bottleneck. Having said that, I have seen cases where draw calls can mount up in terms of performance overhead, but these are generally limited to rather extreme and unlikely cases which won’t be encountered in real world apps (e.g. running a few hundred character quads using D3D - not OpenGL - under VMWare’s display driver) so again it’s not something you need to worry overmuch about.

The three main options that seem viable to me are : (1) streaming VBO with 4 verts per quad, (2) streaming VBO with 1 vert per quad and geometry shader, or (3) streaming VBO with 1 vert per quad and instancing. Of these (2) is to be avoided as having the geometry shader stage active will more than wipe out any gains you may get from just 1 vert per quad. That leaves (1) or (3) and generally it’s a total wash - they come out roughly equal in performance. I’ve a preference for (3) as it results in less C/C++ code, but that’s the only reason.

Option (1) is the only one of these where there is a choice between glDrawArrays and glDrawElements, otherwise we’re using glDrawArrays always. With option (1) you get to choose between GL_TRIANGLES with 6 indexes per quad or GL_TRIANGLE_FAN/STRIP with 5 per quad (and primitive restart enabled). GL_TRIANGLES may be slightly faster on slightly older hardware that may not have the best support for primitive restart. Either way your index buffer can be completely static so long as it’s big enough.

Option (3) is, as I said, my preferred approach. This doesn’t need indexes, involves less code, and has a submission of 1 vert per quad but avoids the overhead of having a geometry shader stage enabled. You don’t even need a VBO for your single quad either, as you can use the gl_VertexID builtin instead; so all that you do is set up x, y, w, h and s-low, t-low, s-high, t-high as per-instance data in a streaming buffer, then use gl_VertexID in your vertex shader to figure out the final values for each vertex, and you’re done.

codepilot · August 28, 2012, 4:23pm

I have had a little time to play with glMultiDrawElementsIndirect, and it works very well for simplifying draw commands. I can draw all my stuff in the whole game in one call. Everything comes from buffers, and indirect command buffer, elements buffers, vertex attrib buffers, and texture buffers. The buffers can be updated according to the fastest method in that pdf I mentioned. I agree, stay away from geo shader, it gets wonky with serializing the vertex stream of chunks of unknown size, it really messes with performance. I think the instancing would be best, remember to use glVertexAttribDivisor/glVertexBindingDivisor to get the one attrib per instance to work. But with glMultiDrawElementsIndirect or glMultiDrawArraysIndirect you can get group some attribs by class and draw lots of disparate classes and jump all over the place with instance count, vertex and instance offsets. And all in one call. You can even source and change the buffers on the GPU and draw with null commands in the list to completely remove the cpu from the loop.

Inagawa · August 29, 2012, 2:42am

Awesome, although I understand about half of what you guys wrote, you’ve given me something to think about. I also have read most of the pdf from codepilot - thanks a lot for that, it’s packed full with information.

streaming VBO with 4 verts per quad

That is what I’m doing now. I am drawing a GL_TRIANGLE_STRIP with glDrawElements. I only have 4 indices and they go like 0, 1, 2, 3.
Well, at least I’m doing a part of it, I don’t understand what you mean by streaming. Do I have as many VBOs as the GUI elements need and only update those that change their position/size?

And can/should I try to put all of these VBOs in a VAO? Would it actually have any benefits? I’ve read this post http://www.opengl.org/discussion_boards/showthread.php/167743-VBO-to-VAO?p=1183855&viewfull=1#post1183855

codepilot · August 29, 2012, 9:37am

Say you are drawing 100 buttons, which are really just rectangles I guess. You don’t want 100 vertex buffers. It is so much easier to have 1 vertex buffer with size for 400 vertices. Simply transfer any changes to the buttons to the 1 buffer object using one the six or so methods in the pdf, and draw again. I really like the pinned memory method because you can just write the changes to the buffer and the gpu sees it automatically.

Alfonse_Reinheart · August 29, 2012, 10:27am

codepilot;1241957:

I have had a little time to play with glMultiDrawElementsIndirect, and it works very well for simplifying draw commands. I can draw all my stuff in the whole game in one call. Everything comes from buffers, and indirect command buffer, elements buffers, vertex attrib buffers, and texture buffers. The buffers can be updated according to the fastest method in that pdf I mentioned. I agree, stay away from geo shader, it gets wonky with serializing the vertex stream of chunks of unknown size, it really messes with performance. I think the instancing would be best, remember to use glVertexAttribDivisor/glVertexBindingDivisor to get the one attrib per instance to work. But with glMultiDrawElementsIndirect or glMultiDrawArraysIndirect you can get group some attribs by class and draw lots of disparate classes and jump all over the place with instance count, vertex and instance offsets. And all in one call. You can even source and change the buffers on the GPU and draw with null commands in the list to completely remove the cpu from the loop.

Does this actually improve performance in any noticeable way?

codepilot · August 29, 2012, 10:40am

Well, I’m calling opengl from nodejs. So if I called glDrawElementsIndirect a bunch of times it would make a lot of trips from javascript land to c++ land, and that is really expensive for nodejs. Using just 1 call saves that amount. I would be interested in the speed up in plain c code for the difference. I imagine the multi commands are faster, and couldn’t be slower, but how much, I don’t know.

Alfonse_Reinheart · August 29, 2012, 12:28pm

Are you using WebGL, or some other JavaScript-based runtime?

codepilot · August 29, 2012, 12:44pm

No WebGL, I don’t like “ES”. I made a c++ extension for nodejs. It allows me to use opengl 4.3 and compatibility context or core so getting off the ground is quick.

A whole mess of calls, see below a simple snippet. The javascript calls OpenGL::vertexAttrib4Ii, which queues an APC in the gui thread calling APC_vertexAttrib4Ii, that executes it. Calling c++ from javascript is super super slow. So calling vertexAttrib4Ii in immediate mode, a bunch of times was crawling, >500ms per frame. When I used a display list, back to super fast, but not dynamic. Using glMulti*Indirect it was back to super fast, so I think it is the right direction for me.

VOID CALLBACK OpenGL::APC_vertexAttrib4Ii(__in ULONG_PTR dwParam){
PParam8 params = (PParam8)dwParam;
PSLIST_ENTRY entry = (PSLIST_ENTRY)dwParam;
params->obj->glVertexAttribI4iv(params->i32args[0], &params->i32args[1]);
glGetErrorFileLine(params->obj->tryError(TEXT(“glVertexAttribI4iv”)));
freeParam8(params);
}

Handle<Value> OpenGL::vertexAttrib4Ii(const Arguments& args) {
HandleScope scope;
if(args.Length() < 5){
return scope.Close(Undefined());
}
OpenGL *obj = OpenGL::Unwrap<OpenGL>(args.This());
PParam8 params = getParam8();
params->obj = OpenGL::Unwrap<OpenGL>(args.This());
params->i32args[0] = argnint(args, 0);
params->i32args[1] = argnint(args, 1);
params->i32args[2] = argnint(args, 2);
params->i32args[3] = argnint(args, 3);
params->i32args[4] = argnint(args, 4);
DWORD success = QueueUserAPC(APC_vertexAttrib4Ii, obj->threadHandle, (ULONG_PTR)params);
if(!success){
DWORD err = GetLastError();
OutputDebugString(TEXT(“vertexAttrib4Ni” L" error
"));
}
return scope.Close(Undefined());
}

codepilot · August 29, 2012, 12:45pm

If you are a nodejs junky, I might pass it your way if you are interested.

codepilot · August 29, 2012, 1:22pm

Through lots of magical macro work, see below, the calls define, and instantiate both calls and the apc and everything. The code above is part of what one line gets expanded to.

glGen(genBuffer, params->obj->glGenBuffers);
glGen(genFramebuffer, params->obj->glGenFramebuffers);
glGen(genProgramPipeline, params->obj->glGenProgramPipelines);
glGen(genQuery, params->obj->glGenQueries);
glGen(genRenderbuffer, params->obj->glGenRenderbuffers);
glGen(genSampler, params->obj->glGenSamplers);
glGen(genTexture, glGenTextures);
glGen(genTransformFeedback, params->obj->glGenTransformFeedbacks);
glGen(genVertexArray, params->obj->glGenVertexArrays);
glMethod0(end, glEnd);
glMethod0(loadIdentity, glLoadIdentity);
glMethod0(endList, glEndList);
glMethod_u(activeTexture, params->obj->glActiveTexture);
glMethod_u(clear, glClear);
glMethod_u(enableClientState, glEnableClientState);
glMethod_u(enable, glEnable);
glMethod_u(frontFace, glFrontFace);
glMethod_u(cullFace, glCullFace);
glMethod_u(disable, glDisable);
glMethod_u(begin, glBegin);
glMethod_u(generateMipmap, params->obj->glGenerateMipmap);
glMethod_u(callList, glCallList);
glMethod_eo(bindTexture, glBindTexture);
glMethod_eo(bindTransformFeedback, params->obj->glBindTransformFeedback);
glMethod_eo(bindBuffer, params->obj->glBindBuffer);
glMethod_eo(bindFramebuffer, params->obj->glBindFramebuffer);
glMethod_eo(bindRenderbuffer, params->obj->glBindRenderbuffer);
glMethod_eo(bindSampler, params->obj->glBindSampler);
glMethod_eo(beginQuery, params->obj->glBeginQuery);
glMethod_e(endQuery, params->obj->glEndQuery);
glMethod_uu(hint, glHint);
glMethod_uu(newList, glNewList);
glMethod_uuu(texParameteri, glTexParameteri);
glMethod_f(clearDepth, glClearDepth);
glMethod_ff(depthRange, glDepthRange);
glMethod_fff(scale, glScalef);
glMethod_ffff(clearColor, glClearColor);
glMethod_ffff(texCoord4f, glTexCoord4f);
glMethod_ffff(vertex4f, glVertex4f);
glMethod_ffff(color4f, glColor4f);
glMethod_uuuu(color4ub, glColor4ub);
glMethod_iiii(vertex4i, glVertex4i);
glMethod_ffff(rotate, glRotatef);
glMethod_uuuuu(texStorage2D, params->obj->glTexStorage2D);
glMethod_uuuuuu(texStorage3D, params->obj->glTexStorage3D);

Inagawa · August 31, 2012, 6:16am

glMapBuffer, how I mock thee!

Anyway, I am now playing with glMapBuffer, drawing 75,000 quads with 30 FPS and glMapBuffer-ing all of those vertices every frame. Quite a nice performance, though I have to ask:

When creating a GUI, should I scratch the buffer as I do now and reconstruct all of the vertices every frame, even if only a handful of gui elements change? At first it seems wasteful, but I wouldn’t have to keep any state of the vertices, I could just fill the buffer as I go along.
Or should I instead map the buffer, write only the changes that occur in that frame at once?
Or throw away glMapBuffer and use glBufferSubData? I’ve read the PDF, so I know that glMapBuffer should be the winner, but I’ve read this - http://www.stevestreeting.com/2007/03/16/glmapbuffer-how-i-mock-thee/ and it made me a little bit suspicious.

codepilot · August 31, 2012, 11:18am

I read that link, the replies at the bottom indicate that writer wasn’t orphaning the buffers before mapping them. This causes the mapping process to read back from the gpu the buffer and bring it to the cpu. Orphaning is glBufferData with null for data pointer, it lets the GL know the buffer is no longer needed. After orphaning, the glMapBuffer command maps a new buffer causing no readback, and is much faster. That assumes the GL driver doesn’t get badly confused by the whole process. Orphaning is how the article I posted gets the good numbers with glMapBuffer.

Inagawa · September 1, 2012, 10:52am

They indicate no such thing. In the writer’s own words: Yep, that’s what they say, but it doesn’t work. I went through all the papers and tips articles I could find and tried all combinations of glBufferData with NULL pointers, all the access modes. Nada.

That’s what has got me worried. I will of course test all of this on my own, I just wanted something of a last-minute reassurance that glMapBuffer is a good way to update a lot of disconnected parts of a buffer, which it seems to be.

Anyway, your PDF was very helpful, I appreciate it.

Alfonse_Reinheart · September 1, 2012, 12:51pm

They indicate no such thing. In the writer’s own words: Yep, that’s what they say, but it doesn’t work. I went through all the papers and tips articles I could find and tried all combinations of glBufferData with NULL pointers, all the access modes. Nada.

Did you read the very next page? glMapBuffer vs glBufferSubData, the return · SteveStreeting.com

Also, this is from like five years ago. I wouldn’t trust any OpenGL performance information that old.

Inagawa · September 2, 2012, 6:17am

I haven’t, I stand corrected. And I didn’t trust it, that’s why I asked here.