glMapBuffer CPU Usage Peaking

GL3.x AFAIK is not 100% stable or available to many many PC users, just as it is not available at all to anyone on OS X

OpenGL is not 100% stable period, particularly on Windows.

However, a lot of really very useful and well thought out extensions which you yourself say are only available on GL3.x are available on OS X now (across all of Apple’s machines) and will continue to be available when GL3.x becomes part of OS X.

There are no more and no less than 15 Apple-specific extensions. I would only consider the following to be “really very useful and well thought out”:

GL_APPLE_client_storage
GL_APPLE_fence
GL_APPLE_vertex_array_object
GL_APPLE_flush_buffer_range

GL_APPLE_object_purgeable might have made the list, save that it provide no API to tell when such purging has taken place. Without that, it isn’t very well thought out.

Four out of 15 is not a good track record. Better than the ARB’s record. But that is a low bar indeed.

Overall that’s a better situation for developers on that platform than having to worry about which GL driver / context / version they are running, and then also deal with all the various teething problems which are very well discussed on these forums, and supporting the hundreds of different flavours of drivers and OSs out there in PC land.

Portability, platform independence is OpenGL’s one remaining strength. Having to write to the “hundreds of different flavours of drivers and OSs out there in PC land” in addition to the Apple-specific extensions is helpful to no one. Apple-specific extensions are nice for MacOSX developers, but they are just as bad for OpenGL’s progress as nVidia’s bindless graphics API.

Your point is? Where do you want to go with this?
Nothing is 100% stable.

GL2.x and below are however mature APIs which are shipped as standard and readily available. GL3.x is a way off that yet.

Well email Apple then and let their engineers know that Alfonso Reinheart does not approve of 10 of their extensions and that they should recode them to your spec. Good luck with that.

So what would you have Apple do?
Or any manufacturer or that matter?

Stop innovating?
Stop developing?
Simply implement what the ARB tells them is OK to?

Your last point really only confirms my view and the points I made above.

Yep, Apple shamelessly tries to provide the best API to it’s developers working on it’s operating system in the Apple Ecosystem. And that’s exactly what a manufacturer who also writes it’s own OS should do.

They don’t always get it right, or provide us with what we need. But what they do do is give us a stable OS and OpenGL API which is guaranteed to work on any machine they make.

I don’t see any reason for a big manufacturer to have to wait on any governing body or advisory board to offer facilities over and above the standard to their customers.

scratt (and everyone else but Reinheart),

apparently I am ping-ponging wrong, but I have no idea why it is not working. I set the one buffer to TEAL color and the other to Red. When I render, now, I see the TEAL color for the first frame, and then only the red coloring. The CPU for this low poly model is at 0% though! But when I try this same method for higher poly model CPU jumps to the infamous 50-55%.

I set my two buffers up like this

glGenBuffers(1, &ColorVBOID);
glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);
glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), ColorVBO, GL_DYNAMIC_DRAW); // Color Data

glGenBuffers(1, &ColorVBOID2);
glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID2);
glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), ColorVBO2, GL_DYNAMIC_DRAW); // Color Data

This is all done in my init() function.
The code for my display function, below, contains

glEnableVertexAttribArray(attributeIndex);

Then in display, when I ping-pong I have a global counter to keep track of the iteration.

  if (counter%2 != 0)
  {
  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), NULL, GL_DYNAMIC_READ); // Color Data
  glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID2);
  float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
  for (int i=0;i<VerticesInRadii;i++)
  {
  	int index = finalActivations[i].Vertex;
  	//pData[index] = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
  }
  //memcpy(pData, ColorVBO, VertexTotal); */
  glUnmapBuffer( GL_ARRAY_BUFFER );
  //glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);
  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }
  else
  {
  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), NULL, GL_DYNAMIC_READ); // Color Data
  glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);
  float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
  for (int i=0;i<VerticesInRadii;i++)
  {
  	int index = finalActivations[i].Vertex;
  	//pData[index] = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
  }
  //memcpy(pData, ColorVBO2, VertexTotal); */
  glUnmapBuffer( GL_ARRAY_BUFFER );
  //glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID2);
  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }

Am I making a simple mistake in the method of swapping to VBO’s? Any suggestions are appreciated.

Can you describe what’s going wrong?
You’re mixing two systems here by the looks of it.

It appears that you are trying to draw with the data you have just provided to the GPU because the buffer you just uploaded to is still bound when you call glDrawRangeElements.

If you use the “NULL” trick the GPU effectively does what you are trying to do here, but for you behind the scenes, and you may well get away without ping-pong-ing.

Here’s a fairly nice conceptual discussion of buffer use… (scroll to the bottom)
http://developer.apple.com/documentation…vertexdata.html

scratt,

As far as I can tell my double VBO is set up wrong. I believe I need to ping-pong to achieve maximum performance but I don’t know how to bind to the Buffer I want to use as the vertex attribute data. I can explain where the CPU spikes in the following code, but I don’t think that I can solve where my double VBO’s is wrong. I thought the order was

  1. Bind Buffer to Map
  2. Map Buffer with correct void*
  3. Loop through buffer editing what needs to be changed
  4. Unmap buffer
  5. Bind OTHER buffer to draw
  6. glVertexAttribPointer to the OTHER buffer
  7. glDrawRangeElements

Reverse for Other iteration

  if (counter%2 != 0)
  {
  glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID2);
  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), NULL, GL_DYNAMIC_READ); // Color Data
  float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
  for (int i=0;i<VerticesInRadii;i++) // VerticesInRadii
  {
  	int index = finalActivations[i].Vertex;
  	pData[i] = 0.0;
  	//float tempFloat = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
  	//pData[index] = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
  }
  glUnmapBuffer( GL_ARRAY_BUFFER );
  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }
  else
  {
  glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);
  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), NULL, GL_DYNAMIC_READ); // Color Data
  float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
  for (int i=0;i<VerticesInRadii;i++) // VerticesInRadii
  {
  	int index = finalActivations[i].Vertex;
  	pData[i] = 0.0;
  	//float tempFloat = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
  	//pData[index] = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
  }
  glUnmapBuffer( GL_ARRAY_BUFFER );
  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }

In that code I found two terrible errors.
First, when I try to implement the double VBO’s as I previously described by adding glBindBuffer to the OTHER vbo after gUnmapBuffer, my CPU spikes to 50%. No ideas on this one.

Second, when I keep the functioning “double” VBO’s, the one without the correct syntax, I believe, then my error lies in the inner for loop. When I call

float tempFloat = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);

the CPU again spikes to 50%. I can only deduce this is from CPU and GPU not being synchronized. I can’t seem to use the CPU for a custom function when I am using these double VBO’s. If I assign pData[i] = 0.0; the CPU stays at under 5%.

Any clues scratt? This CPU and GPU synchronization is all new grounds for me!

One more anomaly I have seen, when I run my tests on a 5,000 poly, 50,000 poly, and 650,000 poly model it appears that my CPU stalls more on the large poly model than the lower ones. If I use the incorrect, but rendering configuration as posted above.

Addition The CPU Usage is not consistent. One run with 650k poly with yield 2% CPU with no memory increase. I’ll change model I load and go back to 650k poly model without any code modifications and I’ll get 50% CPU with memory usage slowly increasing for the span of the program… help!

Last Addition The CPU “random” increase appears to be consistent with recompiling and executing my program after exiting the program. I.E. I compile and run program.exe in VS2008. CPU is at 3% and works fine. I close program and quickly re-compile and run program.exe in VS2008. CPU is now 50%…

When I run this incorrect configuration with a small change, I get a spike in CPU for every single model, even without using the function call in the inner for loop only if I put the order of the following code like this:

  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), NULL, GL_DYNAMIC_READ);

glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);

Instead of the two lines like this.

glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);
glBufferData(GL_ARRAY_BUFFER, ((VertexTotal) * sizeof(float)), NULL, GL_DYNAMIC_READ); // Color Data
My memory DOES NOT increase though continually. I’m really at a loss on how to get this resolved.

Thanks for input to everyone who contributes.

Second, when I keep the functioning “double” VBO’s, the one without the correct syntax, I believe, then my error lies in the inner for loop. When I call


float tempFloat = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);

the CPU again spikes to 50%. I can only deduce this is from CPU and GPU not being synchronized. I can’t seem to use the CPU for a custom function when I am using these double VBO’s. If I assign pData[i] = 0.0; the CPU stays at under 5%.

Are you saying that now, only when you compute the tempFloat value with the ComputeFinalColor function, the cpu usage is going high?

One more anomaly I have seen, when I run my tests on a 5,000 poly, 50,000 poly, and 650,000 poly model it appears that my CPU stalls more on the large poly model than the lower ones. If I use the incorrect, but rendering configuration as posted above.

This is completely normal since your program is one threaded.
When you reach such amount of data, you may need to update data in a second thread. Simply do all the opengl stuff in the main thread but perform the buffer update (for loop) in the second thread.
When the update thread has died, it means that the vbo is up to date and ready to be unmapped in the main thread. Then you can swith between the vbo used for rendering and this one.

Note that with this method, the vbo update may take several frames depending of its size and the computation complexity of the new float values. But IMO, it is the price to pay to prevent any application stall caused by the update.

When scratt, suggested to call glBufferData with NULL pointer, it requires to fill the entire vbo then, since the whole buffer memory is invalidated. Not sure it would help with such amount of data.

Are you saying that now, only when you compute the tempFloat value with the ComputeFinalColor function, the cpu usage is going high?

I cannot reproduce this error now. I get low CPU usage for under 100k poly model. Even when I use the FinalColor Function (as I hoped it would). This is good.

When the update thread has died, it means that the vbo is up to date and ready to be unmapped in the main thread. Then you can switch between the vbo used for rendering and this one.

I never thought, or knew, that this could be due to my single threaded application. Great suggestion! I don’t think I’m using double buffering right still? It does render, low cpu on low poly models, and I can use my custom function per frame to get the right value I need. But I use these calls

if (counter%2 != 0)
{
glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID2);
glBufferData(GL_ARRAY_BUFFER, ((VertexTotal/3) * sizeof(float)), NULL, GL_DYNAMIC_DRAW);
float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
for (int i=0;i<VerticesInRadii;i++)
{
int index = finalActivations[i].Vertex;
ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
}
glUnmapBuffer( GL_ARRAY_BUFFER );

  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }
  else
  {
  glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);
  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal/3) * sizeof(float)), NULL, GL_DYNAMIC_DRAW);
  float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
  for (int i=0;i&lt;VerticesInRadii;i++)
  {
  	int index = finalActivations[i].Vertex;
  	pData[index] = ComputeFinalColor(finalActivations[i].ElectrodeTriplet,i);
  }
  glUnmapBuffer( GL_ARRAY_BUFFER );
  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }

Is this correct? I don’t seem to be binding the appropriate one. But when I use glBindBuffer(otherVBO) after glUnmapBuffer(), my CPU stalls… And I have no stall for low poly models now with the above code. Thanks dletozeun.

That’s good advice from dletozeun.

I use a similar method for VBOs for a very large number of point sprites, which are actually distant stars with 128 bit coordinates, but projected onto a viewing sphere around the viewpoint. There is some maths involved any time we move a very large distance and the buffers need to be re-factored then.

I actually use a second thread to update a buffer subdivided into 10 sub-buffers for each “galaxy” of stars. In total there are about 100 “galaxies”. Because the stars move very little (even when the viewpoint is moving at great speed) I can get away with not getting round to updates in time for a few frames, and simply use the buffer at whatever stage it is at in being updated.

I also use those buffers to render to Impostors, but that’s a story for another day!

Of course that may not help if you can’t split your geometry.

I don’t think it is. Perhaps I am reading the code wrong, but I think you are updating and drawing from the same buffer at the same time… So you’re kind of negating the whole double buffering thing.

I would try this…
Draw from buffer A.
Update buffer B.
As soon as that is done set A to NULL and then map it and update it.
Once you’ve done that draw from buffer B.

In theory then you are giving each side as much “alone” time with each buffer as it can possibly have. AFAIK DMA still takes some time also, and the idea with the “NULL” call is that you don’t stall and the CPU can go do other things.

As dletozeun says, your larger data-sets are perhaps going to take longer to move around / update than you think.

You may also want to consider using GL_STREAM_DRAW instead of GL_DYNAMIC_DRAW.

The specification is somewhat unclear on this, but the idea behind the usage patterns is this:

STATIC: call glBufferData once, maybe call glBufferSubData once if you used “NULL” to allocate it, and that’s it. Never change what’s there.

STREAM: call glBufferData every frame. If you used “NULL”, then you may map the buffer or use glBufferSubData to modify its contents.

DYNAMIC: call glBufferData once. Call glBufferSubData or map the buffer to change this whenever you feel the need, but you never reallocate it with glBufferData.

I’m fairly sure the idea behind STREAM is that you wouldn’t need to ping-pong, but I’m not 100% sure how that all works in the driver.

I would try this…
Draw from buffer A.
Update buffer B.
As soon as that is done set A to NULL and then map it and update it.
Once you’ve done that draw from buffer B.

  1. glVertexAttribPointer & glDrawRangeElements
  2. do you mean map buffer B and do my updates to it? or just Bind to it??
    3)glBindBuffer(Buffer A), glBufferData(NULL), Is this correct?
    4)repeat?

In no way am I saying that you guys are wrong. The following code appears to produce exactly the optimal code I’m looking for. I’m updating the correct amount of elements in my VertexAttribute Buffer. And the values are changing per frame, with 3% CPU usage even on the 650k poly brain(apparently changing to STREAM draw fixed that, thanks reinhart), but it doesn’t appear to use ping-ponging.
Any ideas?

  if (counter%2 != 0) // 
  {
  glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID2);
  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal/3) * sizeof(float)), NULL, GL_STREAM_DRAW); // Color Data
  float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
  for (int i=0;i&lt;VerticesInRadii;i++)
  {
  	pData[i] = 0.50;
  }
  glUnmapBuffer( GL_ARRAY_BUFFER );
  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }
  else
  {
  glBindBuffer(GL_ARRAY_BUFFER, ColorVBOID);
  glBufferData(GL_ARRAY_BUFFER, ((VertexTotal/3) * sizeof(float)), NULL, GL_STREAM_DRAW); // Color Data
  float* pData = ( float* )glMapBuffer( GL_ARRAY_BUFFER, GL_WRITE_ONLY );
  for (int i=0;i&lt;VerticesInRadii;i++) // 
  {

pData[i] = 1.0;
}
glUnmapBuffer( GL_ARRAY_BUFFER );

  glVertexAttribPointer(attributeIndex, 1, GL_FLOAT, 0, 0, 0);
  glDrawRangeElements(GL_TRIANGLES,0,IndexTotal-1,IndexTotal,GL_UNSIGNED_INT, NULL);
  }

Note
The above code stays at 3% for my three test models. If I change the pData[i] = 0.5; TO pData[i] = ComputeColor() and assign a color in the function, my CPU will spike.

This CPU business is starting to wear thin on my patience!
Thanks for everyones help

Are you sure that ComputeColor() is a quick function? What happens if the function simply returns 0.5?

Glad to see some progress has been made… STREAM_DRAW was an interesting call.
I always find the way it’s defined in the glBufferData spec to be slightly confusing.
But if that works then great.

Have you considered SSE for the function in ComputeColor?
You could potentially speed up the CPU side by about 2.5 - 3 times if you parallelize whatever you are doing in ComputeColor()…

If you are not too confident with SSE directly then compiler intrinsics may be an option also.

After running 15 tests with STREAM_DRAW and the ComputeColor() function, which only return 0.5, two CPU spikes occurred. I couldn’t correlate these to anything and could not replicate them by increasing the for loop and buffer size.

The spike occurred when I did change models though, which is odd.

Now when I change back to DYNAMIC_DRAW, I see more of a coorelation. No CPU spikes over my test runs but when I increased the for loop and buffer size the CPU would go up linearly.

Does this make sense to anyone? The problem still is that my “double buffering” isn’t truly double buffering. I assume if I can ping-pong correctly, I will not have these CPU and GPU difficulties since apparently I’m using only one buffer at a time now.

Can anyone help out with the correct way to implement two buffers.
Thanks!

Lets try this again conceptually.

You make 2 VBOs. On frame N, you update VBO 2 and you render with VBO 1. On frame N+1, you update VBO 1 and you render from VBO 2.

Forget the binding with NULL for now, and implement that and see where you are in terms of performance.

If you use the NULL thing with this method the only real advantage is that perhaps if the GPU is still dealing with VBO1 for drawing when you try to update it, then it will buy you some time and the CPU won’t stall, because the OpenGL drivers will reallocate the VBO you are binding with NULL and any data it was using will be made into a separate temporary buffer while the GPU finishes up with it.

Are you frame locked or running free?

scratt,

First, thanks for having patience with me! I feel like an openGL, for lack of a better word, noob.

Conceptually, I should on frame N, glMapBuffer(VBO2).
Does this mean I must glBindBuffer prior to mapping?

  1. After mapping and changing VBO data, I then use glDrawElements(VBO1).
    Does this mean I must use glBindBuffer prior to glDrawElements, too?

Once I get this settled, I think this ping-pong game will become much clearer to me. Hopefully.

When you bind something you make it the current object other commands refer to.

If you look at the glMapBuffer command it has no way to specify which buffer it is mapping, so it must get it from somewhere else…

glMapBuffer maps to the client’s address space the entire data store of the buffer object currently bound to target

Likewise with glDrawElements.

They all act on the currently bound buffer.

So the short answer to both your questions is yes. :slight_smile:

Aright, so I have this correctly implemented. Bind to one, map that one, then bind to the other and drawElements when that is bound. Rinse and repeat.

With STREAM and DYNAMIC draw, my CPU is at 50% no matter what. The FPS drop linearly with the size of my for loop and/or size of my mapBuffer data. I.E. The more vertices in the model total, and the larger the amount of vertex attributes to change.

What can I do to prevent this CPU and FPS drop? It can’t be right, the FPS are even terribly slow if I color half of a 5,000 poly brain, which it wasn’t doing before when I was using one buffer (in the failed double buffer implementation) but had the glBufferData(NULL).