Fastest way of moving and drawing VBOs

Hello all,

I am creating a GUI system with colored geometry (i.e. the only textures in the GUI will be a bitmap text). I am using 3.1+ OpenGL. Since everything will be formed by a quad, and since it’s a GUI, it needs to be fast. How would I approach this?

Should I create a VAO for the entire GUI system and a VBO for each of the quads? Do I then move and resize the quads by updating the VBO’s data with glBufferSubData()? I’ve never used VAOs before, so I have no real idea of the benefits.

Or maybe I could have a single VBO quad and instance it with transformations?

I don’t know, I’m a total beginner to OpenGL, so maybe these aren’t viable options. In that case, is there a best way of doing this for my specific purposes? For the GUI, I only have to be able to create quads, move them and resize them.

Also, what about drawing the quads? From what info I found, DrawElements seems to be the best way, assuming I use an element array buffer and don’t upload the indices every frame.

Please, could I get an opinion on this?

I have heard of about 4 different ways to get data into VBOs and each is fastest on a different set of hardware. I like using GL_AMD_pinned_memory or glFlushMappedBufferRange.

Using AMD’s pinned memory is really the best because it lets the GPU see the CPU memory. That way you can edit the cpu memory, and the gpu sees it already, just use one buffer and cycle through it, like a ring buffer. Just in case you wonder, the gpu can read the cpu memory VBO faster than it can process the vertices, so no speed is lost using it. However, it is only available on AMD, I wish it was on NVIDIA too.

See, it says everything.

For this kind of drawing vertex submission is highly unlilkely to be anything near your primary bottleneck. Having said that, I have seen cases where draw calls can mount up in terms of performance overhead, but these are generally limited to rather extreme and unlikely cases which won’t be encountered in real world apps (e.g. running a few hundred character quads using D3D - not OpenGL - under VMWare’s display driver) so again it’s not something you need to worry overmuch about.

The three main options that seem viable to me are : (1) streaming VBO with 4 verts per quad, (2) streaming VBO with 1 vert per quad and geometry shader, or (3) streaming VBO with 1 vert per quad and instancing. Of these (2) is to be avoided as having the geometry shader stage active will more than wipe out any gains you may get from just 1 vert per quad. That leaves (1) or (3) and generally it’s a total wash - they come out roughly equal in performance. I’ve a preference for (3) as it results in less C/C++ code, but that’s the only reason.

Option (1) is the only one of these where there is a choice between glDrawArrays and glDrawElements, otherwise we’re using glDrawArrays always. With option (1) you get to choose between GL_TRIANGLES with 6 indexes per quad or GL_TRIANGLE_FAN/STRIP with 5 per quad (and primitive restart enabled). GL_TRIANGLES may be slightly faster on slightly older hardware that may not have the best support for primitive restart. Either way your index buffer can be completely static so long as it’s big enough.

Option (3) is, as I said, my preferred approach. This doesn’t need indexes, involves less code, and has a submission of 1 vert per quad but avoids the overhead of having a geometry shader stage enabled. You don’t even need a VBO for your single quad either, as you can use the gl_VertexID builtin instead; so all that you do is set up x, y, w, h and s-low, t-low, s-high, t-high as per-instance data in a streaming buffer, then use gl_VertexID in your vertex shader to figure out the final values for each vertex, and you’re done.

I have had a little time to play with glMultiDrawElementsIndirect, and it works very well for simplifying draw commands. I can draw all my stuff in the whole game in one call. Everything comes from buffers, and indirect command buffer, elements buffers, vertex attrib buffers, and texture buffers. The buffers can be updated according to the fastest method in that pdf I mentioned. I agree, stay away from geo shader, it gets wonky with serializing the vertex stream of chunks of unknown size, it really messes with performance. I think the instancing would be best, remember to use glVertexAttribDivisor/glVertexBindingDivisor to get the one attrib per instance to work. But with glMultiDrawElementsIndirect or glMultiDrawArraysIndirect you can get group some attribs by class and draw lots of disparate classes and jump all over the place with instance count, vertex and instance offsets. And all in one call. You can even source and change the buffers on the GPU and draw with null commands in the list to completely remove the cpu from the loop.

Awesome, although I understand about half of what you guys wrote, you’ve given me something to think about. I also have read most of the pdf from codepilot - thanks a lot for that, it’s packed full with information.

streaming VBO with 4 verts per quad

That is what I’m doing now. I am drawing a GL_TRIANGLE_STRIP with glDrawElements. I only have 4 indices and they go like 0, 1, 2, 3.
Well, at least I’m doing a part of it, I don’t understand what you mean by streaming. Do I have as many VBOs as the GUI elements need and only update those that change their position/size?

And can/should I try to put all of these VBOs in a VAO? Would it actually have any benefits? I’ve read this post

Say you are drawing 100 buttons, which are really just rectangles I guess. You don’t want 100 vertex buffers. It is so much easier to have 1 vertex buffer with size for 400 vertices. Simply transfer any changes to the buttons to the 1 buffer object using one the six or so methods in the pdf, and draw again. I really like the pinned memory method because you can just write the changes to the buffer and the gpu sees it automatically.

Does this actually improve performance in any noticeable way?

Well, I’m calling opengl from nodejs. So if I called glDrawElementsIndirect a bunch of times it would make a lot of trips from javascript land to c++ land, and that is really expensive for nodejs. Using just 1 call saves that amount. I would be interested in the speed up in plain c code for the difference. I imagine the multi commands are faster, and couldn’t be slower, but how much, I don’t know.

Are you using WebGL, or some other JavaScript-based runtime?

No WebGL, I don’t like “ES”. I made a c++ extension for nodejs. It allows me to use opengl 4.3 and compatibility context or core so getting off the ground is quick.

A whole mess of calls, see below a simple snippet. The javascript calls OpenGL::vertexAttrib4Ii, which queues an APC in the gui thread calling APC_vertexAttrib4Ii, that executes it. Calling c++ from javascript is super super slow. So calling vertexAttrib4Ii in immediate mode, a bunch of times was crawling, >500ms per frame. When I used a display list, back to super fast, but not dynamic. Using glMulti*Indirect it was back to super fast, so I think it is the right direction for me.

VOID CALLBACK OpenGL::APC_vertexAttrib4Ii(__in ULONG_PTR dwParam){
PParam8 params = (PParam8)dwParam;
params->obj->glVertexAttribI4iv(params->i32args[0], &params->i32args[1]);

Handle<Value> OpenGL::vertexAttrib4Ii(const Arguments& args) {
HandleScope scope;
if(args.Length() < 5){
return scope.Close(Undefined());
OpenGL *obj = OpenGL::Unwrap<OpenGL>(args.This());
PParam8 params = getParam8();
params->obj = OpenGL::Unwrap<OpenGL>(args.This());
params->i32args[0] = argnint(args, 0);
params->i32args[1] = argnint(args, 1);
params->i32args[2] = argnint(args, 2);
params->i32args[3] = argnint(args, 3);
params->i32args[4] = argnint(args, 4);
DWORD success = QueueUserAPC(APC_vertexAttrib4Ii, obj->threadHandle, (ULONG_PTR)params);
DWORD err = GetLastError();
OutputDebugString(TEXT(“vertexAttrib4Ni” L" error
return scope.Close(Undefined());

If you are a nodejs junky, I might pass it your way if you are interested.

Through lots of magical macro work, see below, the calls define, and instantiate both calls and the apc and everything. The code above is part of what one line gets expanded to.

glGen(genBuffer, params->obj->glGenBuffers);
glGen(genFramebuffer, params->obj->glGenFramebuffers);
glGen(genProgramPipeline, params->obj->glGenProgramPipelines);
glGen(genQuery, params->obj->glGenQueries);
glGen(genRenderbuffer, params->obj->glGenRenderbuffers);
glGen(genSampler, params->obj->glGenSamplers);
glGen(genTexture, glGenTextures);
glGen(genTransformFeedback, params->obj->glGenTransformFeedbacks);
glGen(genVertexArray, params->obj->glGenVertexArrays);
glMethod0(end, glEnd);
glMethod0(loadIdentity, glLoadIdentity);
glMethod0(endList, glEndList);
glMethod_u(activeTexture, params->obj->glActiveTexture);
glMethod_u(clear, glClear);
glMethod_u(enableClientState, glEnableClientState);
glMethod_u(enable, glEnable);
glMethod_u(frontFace, glFrontFace);
glMethod_u(cullFace, glCullFace);
glMethod_u(disable, glDisable);
glMethod_u(begin, glBegin);
glMethod_u(generateMipmap, params->obj->glGenerateMipmap);
glMethod_u(callList, glCallList);
glMethod_eo(bindTexture, glBindTexture);
glMethod_eo(bindTransformFeedback, params->obj->glBindTransformFeedback);
glMethod_eo(bindBuffer, params->obj->glBindBuffer);
glMethod_eo(bindFramebuffer, params->obj->glBindFramebuffer);
glMethod_eo(bindRenderbuffer, params->obj->glBindRenderbuffer);
glMethod_eo(bindSampler, params->obj->glBindSampler);
glMethod_eo(beginQuery, params->obj->glBeginQuery);
glMethod_e(endQuery, params->obj->glEndQuery);
glMethod_uu(hint, glHint);
glMethod_uu(newList, glNewList);
glMethod_uuu(texParameteri, glTexParameteri);
glMethod_f(clearDepth, glClearDepth);
glMethod_ff(depthRange, glDepthRange);
glMethod_fff(scale, glScalef);
glMethod_ffff(clearColor, glClearColor);
glMethod_ffff(texCoord4f, glTexCoord4f);
glMethod_ffff(vertex4f, glVertex4f);
glMethod_ffff(color4f, glColor4f);
glMethod_uuuu(color4ub, glColor4ub);
glMethod_iiii(vertex4i, glVertex4i);
glMethod_ffff(rotate, glRotatef);
glMethod_uuuuu(texStorage2D, params->obj->glTexStorage2D);
glMethod_uuuuuu(texStorage3D, params->obj->glTexStorage3D);

glMapBuffer, how I mock thee!

Anyway, I am now playing with glMapBuffer, drawing 75,000 quads with 30 FPS and glMapBuffer-ing all of those vertices every frame. Quite a nice performance, though I have to ask:

When creating a GUI, should I scratch the buffer as I do now and reconstruct all of the vertices every frame, even if only a handful of gui elements change? At first it seems wasteful, but I wouldn’t have to keep any state of the vertices, I could just fill the buffer as I go along.
Or should I instead map the buffer, write only the changes that occur in that frame at once?
Or throw away glMapBuffer and use glBufferSubData? I’ve read the PDF, so I know that glMapBuffer should be the winner, but I’ve read this - and it made me a little bit suspicious.

I read that link, the replies at the bottom indicate that writer wasn’t orphaning the buffers before mapping them. This causes the mapping process to read back from the gpu the buffer and bring it to the cpu. Orphaning is glBufferData with null for data pointer, it lets the GL know the buffer is no longer needed. After orphaning, the glMapBuffer command maps a new buffer causing no readback, and is much faster. That assumes the GL driver doesn’t get badly confused by the whole process. Orphaning is how the article I posted gets the good numbers with glMapBuffer.

They indicate no such thing. In the writer’s own words: Yep, that’s what they say, but it doesn’t work. I went through all the papers and tips articles I could find and tried all combinations of glBufferData with NULL pointers, all the access modes. Nada.

That’s what has got me worried. I will of course test all of this on my own, I just wanted something of a last-minute reassurance that glMapBuffer is a good way to update a lot of disconnected parts of a buffer, which it seems to be.

Anyway, your PDF was very helpful, I appreciate it.

They indicate no such thing. In the writer’s own words: Yep, that’s what they say, but it doesn’t work. I went through all the papers and tips articles I could find and tried all combinations of glBufferData with NULL pointers, all the access modes. Nada.

Did you read the very next page?

Also, this is from like five years ago. I wouldn’t trust any OpenGL performance information that old.

I haven’t, I stand corrected. And I didn’t trust it, that’s why I asked here. :slight_smile: