Using Only Core Profile to Render Many Changing Objects

Hello,

How should I go about rendering too many low polygon objects that have different vertex formats and render states?
I tried using the Map/Unmap buffer for every object but this reduced performance significantly compared to what I get using immediate mode glBegin/glEnd.
Also another problem is the vertex buffer I need to allocate for all objects, since the objects change dynamically and very frequently, it’s hard to predict a vertex buffer that fits all objects, so I sometimes need to render the object in multiple iterations to fit the buffer size with additional map/unmap’s.

Any suggestions/guidelines?

Thanks

Don’t map/unmap frequently, and don’t overwrite data that has been used by recently-submitted commands (because “recently-submitted” means “not yet executed”, so the CPU will have to wait until the commands have completed before copying the data).

Figure out how much data you have in total, allocate one or more buffers, copy all of the data. If these are “low polygon” objects, the issue isn’t going to be the total amount of data, it will be the per-command overheads.

And mainly, don’t try to create an engine which will just take whatever is thrown at it. “Real” engines don’t do that; the program (and by extension the developer) is forced to make some choices in advance. Typically, the structure of the data is fixed (including limits on its size), only the actual data varies.

Don’t map/unmap frequently

Adding to this, don’t unmap at all. Use persistent mapped buffers and employ proper buffer object streaming techniques.

the objects change dynamically and very frequently, it’s hard to predict a vertex buffer that fits all objects, so I sometimes need to render the object in multiple iterations to fit the buffer size with additional map/unmap’s.

For this kind of rendering scenario, I would suggest imposing slightly on the outside world here.

The big problem with glBegin/End is that this API doesn’t provide enough information for you to know how much storage the vertex data in the glBegin/End pair will actually need. The format of vertex data is not provided, and there is no indication of how many vertices they will provide. Both of these pieces of information are critical.

The thing is, most people using glBegin/End already know how many vertices they’re about to send (or at least, it’s easy for them to figure it out). And their vertex data format is hard-coded into which glVertex/TexCoord/etc functions they call. So in both cases, the information is known by the caller.

So simply require that the caller provide that information to you.

Your equivalent to glBegin ought to take a vertex format descriptor (of some kind) and a vertex count. That way, you can compute how many bytes of data you will need.

Given that information, what you need next is a ring of buffer objects of known allocation size. If the number of bytes exceeds the size of the buffer you currently have “open”, then you “close” that buffer (if you’re not using coherent mapping, this is where you flush the buffer), pull the next one out of the ring, “open” it up and start writing to it. If the next buffer is still in use from previous rendering commands (verified via a fence placed when the buffer is “closed”)… then you’re out of luck and have to stall the CPU until the GPU is done with that buffer.

How many buffers you need in the ring and what sizes to use? That’s up to you. You may even have multiple rings for different use cases. GUI rendering might only need to use 2x8MB buffers, while serious vertex rendering could use 4x 128MB buffers.

Thanks!

I managed to create one relatively large buffer and use persistence mapping, and this requires synchronization. How fast is using fence sync objects compared to glFinish()? Is glDeleteSync slow as it may require memory deallocation?

How fast is using fence sync objects compared to glFinish()?

If you’re talking about glClientWaitSync, I can’t imagine how it could be slower than glFinish, because then it would just be glFinish.

The fences are generally not something that any particular time should be spent waiting on. You should have enough buffers/suballocations to ensure that, most of the time, fences will have completed when you get back around to that buffer/suballocation.

Is glDeleteSync slow as it may require memory deallocation?

It might. It might not. There’s no way to know without profiling it. Even so, it’s not like you’re calling it a bunch every frame.

Worst-case, you can add fences that you’re done with to some array and delete them all outside of the main rendering loop.

Now lets say the primitive vertices don’t fit into this vertex buffer, is there a way I can tell OpenGL to split the drawing of this large primitive into multiple iterations? For instance a triangle strip contains 10K vertices but the buffer size is only 4K vertices.

You can’t split a “connected” primitive (strip, loop, or fan) across draw calls or across buffers. You’ll have to deal with the connection manually, i.e. the common vertices will have to be included in each buffer.

is there a way I can tell OpenGL to split the drawing of this large primitive into multiple iterations?

If you mean, besides issuing multiple rendering commands, no. Even if you did manual pulling of vertex shader inputs, you’d be talking about indexing into an array of SSBOs with a non-dynamically uniform index. That’s not allowed.

If this sort of thing is happening frequently, then you need bigger blocks of storage. Or perhaps to sort (where possible) by approximate length of vertex data.