glMapBuffer CPU Usage Peaking

scratt · July 14, 2009, 7:50pm

So if you then alter your re-factoring code to simply return 0.5 as before does the same problem persist?

Have you tried profiling your code?
I don’t profile on PCs, so can’t recommend an app to do that. But something along the lines of Shark (part of the OS X dev tools) will give you an idea of where your app is spending it’s time. It should become very clear to you then.

The FPS drop linearly with the size of my for loop and/or size of my mapBuffer data.

Although I can’t say for certain, and your poly numbers in your latest post seem low, it does seem that using the “NULL” trick was buying you some extra time when your current frame caught up with the last (so to speak). As I explained before that gives you effectively a second “double buffer” on the GPU side for each VBO which is handled automatically by the GPU.

Without seeing your code in it’s entirety I don’t think I can help much more.

dletozeun · July 15, 2009, 5:37am

If you use the glBufferData(… NULL) trick, you do not need the double vbo and mapping thing. It is worth a try since you can quickly implement this but i would not expect good performances.

Programmatically, with one big vbo with STREAM_DRAW usage , you would do each frame:

bind buffer
invalidate data calling glBufferData( … NULL )
update the entire vbo with a call to glBufferSubData
use it for rendering

Very simple but very brutal and you will rapidly face the bus bandwidth limits. Furthermore if your update function is computationally expensive you can simply forget it.

I do think that the only robust solution is the ping/pong and multithreaded approach. Simply returning 0.5 is ignoring a real problem which is the update delay which is stalling the whole program each frame.

scratt · July 15, 2009, 7:07am

I agree, which is why I suggested a little way back in this thread to look at coding that function in SSE or with compiler intrinsics.

SeanAustin · July 15, 2009, 7:49am

scratt:

So if you then alter your re-factoring code to simply return 0.5 as before does the same problem persist?

CPU usage for every sized poly model, and all different sized for loops all have CPU at 55%. The FPS slow down with larger poly model and/or when I have a larger for loop/ buffer element changes.

scratt:

Have you tried profiling your code?

I have not. I could only find gDebugger which allows only a 3 or 7 day trial and then a license must be bought. I haven’t had any openGL training so I wasn’t sure what is the standard for GPU/CPU usage tracing.

scratt:

But something along the lines of Shark (part of the OS X dev tools)…

Right now I’m programming in Windows in Bootcamp. If I start up into OS X and get the dev tools, would my code port easily? Perhaps I should start programming this in the OS X to take advantage of Shark?

scratt:

Without seeing your code in it’s entirety I don’t think I can help much more.

I hope not! This CPU/GPU stalling is quite out of my league. I have compiled a list of performance stats to help in the understanding of where things are going wrong, hopefully you can still take a look.

dletozeun:

If you use the glBufferData(… NULL) trick, you do not need the double vbo and mapping thing. It is worth a try since you can quickly implement this but i would not expect good performances.

I have two VBO’s set up for the attributes. I then BindBuffer(Buffer1), glBufferData(NULL), glMapBuffer(Buffer1), Update, UnmapBuffer, and DrawElements(Buffer1 is still bound)

I change this on frame N, I do this to buffer1, frame N+1, Buffer2. This provides slightly better performance compared to the double buffering as described above. The stats of this will follow below.

dletozeun:

Very simple but very brutal and you will rapidly face the bus bandwidth limits. Furthermore if your update function is computationally expensive you can simply forget it.

What sort of bus bandwith limits would I be looking at? Am I encountering them now? My update function has a for loop of approx. 5; Each iteration it uses an index to lookup a value in an array (linear time?) and then multiples this by another value in another array (linear time?). It adds this value to a float, which it then returns to the main glMapBuffer for loop to change the element in memory.

dletozeun:

I do think that the only robust solution is the ping/pong and multithreaded approach.

How is it that using the ping/pong method now always has a CPU peak of 55% regardless of buffer element changes, while the other method of glBufferData(NULL) appears to rely entirely upon the amount of the elements I change in my for loop?

Does my CPU max out at 50% because Intel Mac Mini is Dual Core Intel?

With the correct double buffer implementation, CPU always at 55% no matter
what the size of the for loop/changes to buffer are. Slightly higher
average CPU (55%) compared to incorrect implementation of double buffer
which is around 50% when spiking.

Incorrect Buffer Implementation Statistics
One Buffer Bind
glBufferData(null)
MapthatBuffer
Unmap
DrawElements

With pData[i] = 0.5;

650k Model -
10% of Total Model, 64000 Size of For Loop-> CPU 50%
Under 10% of Total Model, < 64000 For Loop -> CPU 4%
Above 10% of Total Model, > 64000 For Loop -> CPU 50%

50k Model -> Buffer Size of 25000 Floats -> Always 3%

25k Model -> 20k Floats Buffer Size -> Always 3%

With pData[index] = Compute()

650k Poly Model - Floats, Size of Buffer

50k Poly Model - 25k Floats, Size of Buffer

Low Poly Model - 20k Floats, Size of Buffer
1)
For loop-576 size
Change 576 Elements - 20% CPU
CPU Jumps to 50%, then drops to 20% after 2 seconds

For Loop-720
Change 720 Elements - 27% CPU

For Loop-886
Change 886 Elements - 33% CPU

For Loop-1061
Change 1061 Elements - 39%

Loop - 1536
Change 1536 Elements - 50%

Addition
When I moved this code with the single buffer with glBufferData(NULL) to my laptop, Lenovo T61 QuadroFX 570M, the data is no longer correct. The brain model has random coloring that looks like noise. The correct coloring is still there, just plus all this random noise. The double buffer method, though, works perfectly on the laptop.
**

dletozeun · July 15, 2009, 9:42am

I have two VBO’s set up for the attributes. I then BindBuffer(Buffer1), glBufferData(NULL), glMapBuffer(Buffer1), Update, UnmapBuffer, and DrawElements(Buffer1 is still bound)

Do not call glBufferData(NULL) if you map the buffer then. If you do it, you tell to the driver to reallocate the buffer memory and you will have to upload to entire vbo content.

Try exactly what i suggested in my last post above.

What sort of bus bandwith limits would I be looking at

The bandwidth between system memory and gpu one is very high but limited. If you upload huge amount of data every frame, it may take a significant time (depending of the hardware) and affect your program performances.

Like Scratt said, it is quite hard to help at this point with distance but I think you are not far to make it work. The problem is IMHO that all this stuff is not entirely clear to you for now and it is totally normal.
Clean your code, re-read carefully all what we suggested in this thread. Read specs, tutorials, opengl wiki about vbo until all opengl command related to vbo, you write is perfectly clear to you.

SeanAustin · July 15, 2009, 10:05am

I guess my Intel Mac integrated graphics card isn’t up to the task. Just ran the program on a better Dell, better graphics, and 3% CPU with double buffering.

I’m still in shock because of this, but Scratt and dletozeun thanks for the incredibly helpful advice.

One day I hope to answer some young openGLer’s questions on the forum, ala the circle of life.

scratt · July 15, 2009, 11:42am

No worries on the help. It’s interesting from this end also.
Seems like you’ve got your answer on the performance problem.

A couple of things to think about for the future:

Indexed Arrays : If you can generate your vertices and detect vertices which are duplicates efficiently (not a mean feat by any means) then you could perhaps limit your upload bandwidth with Indexed Arrays. You mentioned a brain model earlier. It must be pre-generated data so I would of thought it might be possible to do in that case.

SSE / Compiler Intrinsics : I know I keep bleating on about them but you can speed up maths theoretically by up to 4 times. In real terms however it’s more like 2.5 times. But still worth having.

With regards to Apple’s Developers tools. Well you know which camp I am in, but I have used Visual Studio as well and it has it’s good points too. Out of preference I develop in OS X. And it is relatively easy, particularly if you base your code on SDL or GLUT or something like that, to move your code base back and forth from each environment.

**With Apple’s developer tools though you get a cohesive set of OpenGL debuggers and profilers, code profilers, and a really nice integrated dev system with a visual interface builder and XCode which relies on gcc to build code with. And it’s all free and updated more often than is comfortable to live with because the downloads are so huge!! To put that together on a PC to the best of my knowledge is expensive if you want the real deal (i.e. brand name stuff), or a bit tortuous if you try to do it all with open source / free solutions.

**Just my 2c. Not trying to start a flame war!

dletozeun · July 16, 2009, 1:11am

I guess my Intel Mac integrated graphics card isn’t up to the task. Just ran the program on a better Dell, better graphics, and 3% CPU with double buffering.

I’m still in shock because of this

I did not realized before that your graphic card is a intel one… don’t be shocked, Opengl driver implementation for this hardware is, to say the least not optimized and quite not reliable.
My advice is to not take performance results seriously on this hardware which is more targeted for desktop usage than multimedia or game one.

Keep up the good work and above all, don’t despair, it is not an easy task!

scratt · July 16, 2009, 4:19am

I am confused because I read that earlier and then checked the beginning of the thread and Sean said he had a 9400M.

NVidia GeForce 9400m, Driver Version 6.14.11.8585

That is actually a fairly good GPU from NVidia.
It’s so good (for a mobile GPU) that I am disappointed with the performance of my 9600 compared to it.

When I run my procedural planet stuff with atmospheric shaders and all the jazz the 9400 is only about 20% off the performance of my 9600, and it makes my ATIx1600 (a different generation I know) look like an iPhone GPU!

If Sean’s GPU is one of the GMAs, particularly the 950, then I am surprised he got as far as he did!

SeanAustin · July 16, 2009, 4:57am

My card and driver is most assuredly the NVidia GeForce 9400m, Driver Version 6.14.11.8585.

I didn’t know this card would let me down! I thought it was fairly decent.

Scratt, ‘GMA’ means?

And I’m surprised it works, only 50 posts later on this thread and… success!

scratt · July 16, 2009, 5:35am

Graphic Media Accelerator AKA “Really crappy integrated graphics chipset from Intel, with really really ropey drivers.”

Alfonse_Reinheart · July 16, 2009, 10:58am

It may be the memory.

Remember, your issue isn’t with rendering performance; it’s with uploading vertices to video memory. Integrated graphics chips don’t have video memory. It may be that nVidia hasn’t or can’t optimize streamed buffer objects for integrated graphics chips, due to differences in their memory architecture.

SeanAustin · July 21, 2009, 9:11am

Never knew that that integrated graphics were causing me such problems. Thanks everyone for the great information!

system · October 19, 2021, 7:32pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.