OpenGL UBO performance issue XOR fence sync problem

Asylum · August 15, 2017, 2:56am

(this is a duplication of the discussion I started on the AMD community forums)

This topic is aimed to discuss two different problems:

UBO with glBufferSubData is extremely slow
sync objects seem to ignore that the buffer I would like to reuse is no longer in use by the GPU (same thing happens on Intel, but with somewhat different effects).

First of all, screenshots from the expected and actual results:

[ATTACH=CONFIG]1529[/ATTACH] [ATTACH=CONFIG]1530[/ATTACH]

And some measurements on two different cards (I refer to the methods by their number):

Intel HD 4600:

1 - 23 fps
2 - 23 fps
3 - 7 fps
4 - 23 fps and the above bug (camera movement reveals that the missing meshes are drawn to the same place as the visible ones)

AMD R7 360:

1 - 103 fps
2 - 5 fps
3 - 46 fps
4 - 96 fps and the above bug (missing objects are flickering)

Repro source code: https://www.dropbox.com/s/x04p8zq1lvskv7d/FOR_AMD_4.zip?dl=0

I have two questions:

why the (huge) difference between methods 2 and 3 wrt. to Intel/AMD?
why are the other teapots missing in method 4 (unless I orphan, which drops performance again)

mhagain · August 15, 2017, 6:03am

For problem 1, “UBO with glBufferSubData is extremely slow”, performance will heavily depend on how often you update your UBO(s).

Generally speaking, there are three main update strategies you can use.

Assume that you are drawing 1000 objects.

Strategy 1 is to have one UBO, sized for a single object, and update it 1000 times.
Strategy 2 is to have 1000 UBOs, each sized for a single object, and update each once.
Strategy 3 is to have one UBO, sized for 1000 objects, and update it once only.

The important thing to realize is that of these strategies, and under OpenGL, strategy 3 is the only one that will run fast.

The problem isn’t UBOs, it’s OpenGL’s buffer object API which is causing your performance loss. These problems don’t happen in Direct3D where strategies 1 or 2 are also viable (with 1 having the performance edge). Note also that I haven’t benchmarked this using the GL_ARB_buffer_storage API so things may be different with that too.

Asylum · August 15, 2017, 6:32am

@mhagain:

Yes, these are (mostly) obvious. But as I mentioned, on the Intel HD 4600 my 2nd method is just as fast as the 1st (and from your solutions as the 3rd). So looking at like this it’s heavily hardware dependent which of your suggestions are going to be fast.

Another thing: this is a test app. In the real application I can’t preallocate a large enough UBO because it would waste memory, so I have to stream (thus my 4th approach).

And lastly: I came to realize the reason behind the artifact in my 4th method: it’s frame queueing. So a little modification solves it and it’s almost as fast as my 1st method. Theoretically it could be improved further by ringbuffering, but as it turned out the number of fences has negligible effect on performance (at least on AMD).

mhagain · August 15, 2017, 9:29am

The Intel is a shared memory architecture; in general you can expect differrent performance, yes.

…I can’t preallocate a large enough UBO because it would waste memory…

Define “waste memory”, please. This is something we see a lot of; people can be reluctant to use extra memory because they see it as “waste”, whereas in reality if you’re allocating extra memory and getting a return for it (in this case the return is performance) it’s not “waste”, it’s “use”. This is often a very worthwhile tradeoff. Seeing the term “waste memory” always raises my suspicions, because it’s almost always a false tradeoff - less memory but performance falls off a cliff.

If you’re really concerned about memory usage then update and draw in groups of 1000 or some other arbitrary number which you’ll determine by benchmarking. The important thing is to get the number of UBO updates you do as low as possible, because that’s what’s going to be slow.

Asylum · August 15, 2017, 9:44am

Well it is a CAD program and it already has a lot of problems with GPU memory. Of course I don’t know what the driver does in the background with glUniformXX, but I certainly don’t want to allocate drawcalls * ubosize sized buffers as they can be pretty large (for example ubosize is around 1 kB). Not to mention the memory used up for framebuffers, geometry data and shadow volumes (geeeeeez…)

Nowadays of course this “shouldn’t be” a problem, but how do you convince a user to buy a video card with sufficient RAM? (depending on the project he is working on…)
But really, I am with you in this one. I always say, “fck that lame”… however the management doesn’t agree with me because that means they lose customers…

Therefore I have to prepare this thing to work efficiently on GL 3.2 (3.3) hardware… Now that’s a challenge indeed…

stimulate · August 21, 2017, 12:11am

glBufferSubData + syncing is indeed very slow. What you can try, if your target hardware supports it, are persistently mapped buffers.

The idea is that you keep around a pointer to a block of memory and directly update the data without any driver overhead. In order to avoid collisions with data which is currently being used, you allocate about 2-3 times as much memory as your maximum upload will take up and you will update the block like a ring buffer. So you increment the upload offset and jump back to 0 when an upload would exceed the storage size.

You do this by first allocating the storage you need using gl(Named)BufferStorage with the GL_MAP_WRITE_BIT, GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags set. Then you map the entire buffer using glMap(Named)BufferRange with the same flags set. You may need to use glBindBufferBase to permanently bind your uniform buffer object to a binding point of GL_UNIFORM_BUFFER.

Now you can use the pointer to update the storage. To make the shaders aware of what range to use for their interface blocks, call glBindBufferRange after each upload. Because of this, you need to always keep your upload offset aligned with GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, which can be queried using glGetIntegerv

Asylum · August 21, 2017, 12:46am

I never said that I use the two together… these are distinct solutions to the same problem, that is only my 4th method uses fence (and as I already said I realized what the driver was doing in that case, so a little modification solves the “half missing” problem).

I also mentioned that my target is GL 3.3 as OS X doesn’t support GL 4.3 (which is required for persistent mapping).

Now my problem is that while the test app runs okay, the real app (with unsynced_bit + fence) runs very slow; indicating that fences do more than just syncing buffers… For now, most of the CPU (!) time is spent in glBindBufferRange and glDrawBaseElements. I just can’t figure out why…

Dark_Photon · August 22, 2017, 6:07am

A suggestion: You might describe what you’re doing for Method #4 with your assumptions, and then list a few brief code snippets that show the implementation. This’ll probably spark more suggestions from the group on things you can try.

Scanning your code, your glClientWaitSync() logic bothers me, but you implied above that you fixed that.

…indicating that fences do more than just syncing buffers…

I don’t think the premise implies that conclusion, but you may be implicitly factoring in other data you’ve collected that you didn’t describe here.

That said, if you think the fences are a problem, you can get rid of them completely using buffer orphaning (see Buffer Object Streaming). With that, the synchronization is implicit (handled internally by the driver) rather than explicit. As a data point, this works exceptionally well on NVidia GL drivers. It can also work well on mobile/sort-middle architectures (Intel is likely one of these) with GL-compliant drivers, but you have to be more careful with mid-frame render target flushes as these can generate artifacts.

For now, most of the CPU (!) time is spent in glBindBufferRange and glDrawBaseElements. I just can’t figure out why…

What are your current app frame times on Intel, AMD, (and NVidia if you’ve got it) on Method #4:

[ol]
[li]as is (updates every frame) [/li][li]with no buffer updates (i.e. no BindBuffer/MapBufferRange/UnmapBuffer nor glClientWaitSync), just re-using the previous buffer contents. [/li][li]with no bind buffer range calls [/li][li]with no buffer updates “nor” bind buffer range calls (i.e. using the previous buffer contents and buffer bindings) [/li][/ol]
?

Also, please list frame times (in milliseconds), not fps. fps is nearly useless (Performance).

Asylum · September 26, 2017, 8:36am

The (fixed) demo code will be available once I finished my article (“research”). In the meantime AMD is fixing the glBufferSubData slowness.
Buffer orphaning doesn’t help too much (but I must assume that it works on all hardware which is unlikely). Persistent mapping on the other hand helps a lot (unfortunately I can’t use it because of macOS )

Btw. I kinda fixed the slowness issue in the real app; as it turned out the high-level implementation overfed the GPU with drawing commands (which resulted in a lot of fences too). Now it runs relatively ok, but tbh. it’s low performance is more like a high-level design issue (“draw a line here” <— do that X million times).

(ps.: for comparison purposes on this scale, fps is fine)