Performance impact with small-VBO multiple-writes

Currently, I’m dealing with a rendering engine that uses OpenGL 1.2 for drawing certain graphs. The current design is such that each render, the canvas is cleared and each object that needs to be drawn is issued a separate draw call and are rendered with OpenGL 1.2 API. Separate draw calls are issued because the visibility may be controlled by an external logic.
For a 50000 similar objects performance test, the draw happens seamlessly.

For business reasons, we wanted to move the rendering engine to OpenGL 2.0 + . So I got simple vertex and fragment shaders in, re-wrote certain device rendering classes to land with the above design but calls via the OpenGL shader path. What this implies is that for each object draw we basically write to a single one-object VBO (created and bound once at initialization with GL_DYNAMIC_DRAW) to draw each of the objects for which the calls are being received. Let’s call this Option 1 - the Least change path.
Now we test on 3 cards - Intel, NVIDIA and Matrox.

On NVIDIA this works fairly well - the application loads in 3-4 seconds. On Intel (24.x driver) it worked fine but with newer drivers (27.x and 30.x) the application has become unusable (rendering takes too long at initialization ), upto a minute, and on Matrox it’s sluggish (10-12 seconds to initialize).

Now for technical argument’s sake I re-wrote (dirty hacks) all the calls to go to one big VBO containing all the 50k objects and everything works fine => Application loads in 3-4 seconds. Let’s call this Option 2- the ideal path.

Option 1 is what I want to stick with for it’s simplicity and maintainability AND also Option 2 is expensive maybe infeasible too.

So here’s a list of questions if individually answered will be really helpful for me to conclude this exercise -

Q1) WHY the application becomes sluggish/unusable when a single one-object VBO is written to multiple times to draw many objects? My assumption is that the overhead should be of data transfer to the GPU but which again should be fairly fast. But is there something else about the pipeline or the driver that I’m breaking?

Q2) If technically there’s nothing wrong with Option 1, then what are the aspects I can check are being done right to ensure there is a performance rise. While I understand and appreciate that each vendor may optimize differently, can I get the driver to be performant to this scenario?

Q3) If technically Option 1 is incorrect, then is it implicit that programmable pipelines mean creating larger VBOs by transferring data to the VBO once and only modifying when something changes? Basically lesser transfers and more draw calls?

Q4) If answer to Q3 is that yes larger VBOs are required, then does sticking to older OpenGL versions stand as a valuable alternative obviously until one of the vendors stop support for older GL versions.

Synchronisation. OpenGL commands are, by default, executed asynchronously. Each OpenGL function call appends a command to queue and returns; it doesn’t wait for the GPU to actually process the command. If you modify a buffer which is used as a data source for some pending command, the driver has to wait until that command has completed before it can transfer the data.

Synchronisation is a potential issue for any command which transfers data between CPU memory and GPU memory. Transfers between GPU memory are queued like any other command.

See the Buffer Object Streaming wiki page for some tips on how to transfer dynamic data without synchronisation.

The programmable pipeline under OpenGL 2.0 does not require VBOs.

This is a common mistake so it needs to be said up front.

If you are using OpenGL 2.0 or 2.1, and if you are using shaders, they work perfectly fine and are fully-featured with client-side vertex arrays or even with glBegin/glEnd code.

I know this doesn’t answer your question, but it’s important to stop you now and correct this misunderstanding before you go too far down the rabbit hole.

It seems to be THE answer that I’m looking for. In fact buffer re-specification seems most apt.
I will experiment and share soon.
Can I expect the said GPU drivers to behave consistently to this strategy or it may vary? Anything to tell with experience?

Thank you, that’s useful.
In my case, the design is to support further GL versions as need arises and hence a more generic shader-based design.

The more efficient buffer object streaming methods depend on having glMapBufferRange, which isn’t available on OpenGL 2.x (but check for the GL_ARB_map_buffer_range extension).

If you’re constrained to the GL calls from the original version 1.5 buffer objects, then you’ll probably get more efficiency from decoupling your buffer updates from your draw calls. Doing it that way, you’d write your data to an intermediate system memory staging array, then when you have all of your data written you make a single glBuffer(Sub)Data call to update it, then finally set up and issue all of your draw calls.

If you know you’re going to be constrained to these older versions, then it’s also perfectly valid to use buffer objects for static data and client-side vertex arrays for dynamic data, at least until such a time as you’re ready to step up to a GL version that does implement MapBufferRange.