glMapBuffer time reduction

Hi all,

Recently I have encountered something very interesting and maybe a little bit frustrating since it cannot be hidden on slow GPUs.
Namely, I have discovered glMapBuffer/glMapBufferRange function calls are synchronized with a GPU frame time. This is the code used in my renderer to retrieve values from TF buffer.

void GLRenderer::CreateTFBuffer()
     unsigned int size = ...;//
     glGenBuffers(1, &m_TF_ID);
    glBindBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, m_TF_ID);        

bool GLRenderer::QueryTF()
    if(!m_bTFRead) return false;
        GLint available = 0;
        glGetQueryObjectiv(m_tfQuery, GL_QUERY_RESULT_AVAILABLE, &available);
        if(available == 0) return false;
//... Other code ...

     int first = 0, count = ...;
     glDrawArrays(GL_POINTS, first, count);
    m_bInitTF = false;
    m_bTFRead = false;
    return true;

float GLRenderer::ReadTF()
    if(m_bInitTF) return -1000.0f;
    GLint available = 0;
    glGetQueryObjectiv(m_tfQuery, GL_QUERY_RESULT_AVAILABLE, &available);
    if(available == 0) return -1000.0f;
    //float* ptr = (float*)glMapBuffer(GL_TRANSFORM_FEEDBACK_BUFFER, GL_READ_ONLY); // The same as following
    float* ptr = (float*)glMapBufferRange(GL_TRANSFORM_FEEDBACK_BUFFER, 0, 4 * sizeof(float), GL_MAP_READ_BIT); // <= Extremely costly

As you can see I’m reading only when the result is available. Well, maybe the mechanism of reporting availability is not correct, but the result is the same on various NV drivers and cards. The following formula is always true:

GPU_frame_time – (n-1) * CPU_frame_time >
MapBuffer_time > [i]GPU_frame_time – n * CPU_frame_time

[/i]where n is the number of frames across which ReadTF() waits for TF buffer to be available. It’s aways 3. That means values can be read every third frame. But every third frame has an extremely long CPU time. For slow GPUs that may mean 30 times longer than normal ones (since CPU time is less than 1ms).

Can anyone explain why this happens? And is there any way for performance improvement?
Be aware that it is reading from buffer, so GL_MAP_UNSYNCHRONIZED_BITis not applicable.

Why using mapping and not glGetBufferSubData? Mapping on AMD/NV suffers of app/driver thread synchronization.

Or take a look at buffer storage with permanent mapping. I’m not sure, but I suspect that buffer storage is supported on a lot of older hardware.

Another option you might look at and bench is using a bounce buffer to do the GPU->CPU readback in the background. For instance, using DSA and bindless with readpixels:

    glBindBuffer       ( GL_PIXEL_PACK_BUFFER, buf1 );
    glReadPixels       ( 0, 0, res[0], res[1], GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );
    glBindBuffer       ( GL_PIXEL_PACK_BUFFER, 0 );
    glNamedCopyBufferSubDataEXT( buf1, buf2, 0, 0, size ); lots of other work here...
    GLuint *p = (GLuint *) glMapNamedBufferRangeEXT( buf2, 0, size, GL_MAP_READ_BIT );

Do the appropriate thing to force buf1 to be a GPU-mem buffer and buf2 to be a CPU-mem buffer. There’s probably a better way to force this nowadays than I’m doing.

Thanks guys for the suggestions!

Let’s elaborate what I achieved in the meanwhile…

  1. glGetBufferSubData() suffers from the same “disease”. It behaves exactly the same as glMapBuffer/glMapBufferrange. Although it is faster (as a function call) than glMapBuffer.

  2. glGetBufferSubData() has the fastest function call. glMapBuffer() is slower about 26-36%. glMapBufferRange() is slower than glGetBufferSubData() about 35-43% . Values depend upon the system and drivers, and they range from 2us to 10us. So it is very difficult to measure precisely (the error marine is too high in such range). But, nevertheless, all those functions behave approximately equally.

  3. The only way to solve the problem is to wait until the result is really available. The availability reported by glGetQueryObjectiv() is not quite correct. After adding a countdown counter after reporting availability and wait additional two frames, the latency of the glGetBufferSubData()/glMapBuffer()/glMapBufferrange() is decreased for the three orders of magnitude (it is removed completely). The only drawback of the solution is that I have 5 frames old result which means inaccurate collision (that’s what the code is used for).

On the other hand, buffer storage is not supported in older drivers, so it is not a way to go, at least for a while. Although I’m not sure whether it would help. Bouncing buffer coping would probably behave the same as adding additional wait as in (3), plus adds additional buffer copy. Currently I have a very little CPU workload on the drawing thread, so the latency can hardly be hidden in a single frame.

  1. glGetBufferSubData() suffers from the same “disease”. It behaves exactly the same as glMapBuffer/glMapBufferrange. Although it is faster (as a function call) than glMapBuffer.

I’ve found exactly the same thing, whether it be mapping a buffer for read or write. I’ve been replacing all glMapBuffers()'s I can find with glSubBufferData() and getting a nice little performance boost. Not exactly confidence-inspiring :slight_smile:

However, I’m a few days out from testing coherent-persistent buffers. I’ll let you know how that works out. Even if you’re targeting GL3 hardware, you could still branch on GL_ARB_buffer_storage (as long as you don’t mind maintaining two codepaths, that is).

GL_ARB_buffer_storage gave me a significant improvement but I am writing to the buffer not reading.

The 3 frame latency should be pretty standard, but if I’m reading this right the specific problem is that it’s not 3 frames, it’s actually 5? And that the driver is telling you the readback is ready after 3 frames but it’s really not?

The first thing I’d do to tackle this is try putting some strategically-placed glFlush calls around the code and see if they can give the driver a hint that you’d really like it to start processing buffered-up work now, please. At the end of each frame might be a good place to start, and maybe after the code that builds the TF buffer in the first place might be another.

If that doesn’t work, then another approach may be to put a glFinish or other sync object at the end of each frame. You’ll run slower overall, but at least your framerates will be consistent, which seems better overall than getting fast/fast/fast/fast/sloooooooooooooooooowwwww/fast/fast/etc.

Also consider if you actually need the TF data each frame. I’m assuming that you’re using this for per-polygon collision, but I’m not certain if you’re running the collision on the GPU and reading back a result, or if you’re reading back transformed meshes in order to run the collision on the CPU. In either case you may be able to cache and reuse a result. In the former, if two meshes don’t move between frames then the previous result is good to reuse. In the latter, if any given mesh doesn’t move then the previous result is good to reuse. You may already be doing that, of course.

Another thing you may also be doing is coarser bounding-box tests before the finer-grained per-polygon tests. If not you should do so: getting a fast reject would mean that you don’t even need to run the TF stage.

If there is a 3 frame delay between command submission and execution, it makes sense that you get the delay twice, because when you submit the ReadBufferData command in frame 3, there are two more frames of commands in the pipeline, and the driver has to process the commands in the order they are submitted.

First I have to apologize for this very long delay, but I was on the trip last week unable to try anything.

Yes, that is what is going on. If there is no additional waiting, two consecutive frames are about 0.88ms, while the third one is 2.2ms (CPU time). GPU average time is about 2.74ms (it depends on the scene, but is pretty steady).

Great hint! Thanks a lot!
glFlush after filling TF buffer actually removes additional waiting.
It is quite interesting that putting glFlush at the end of the frame (SwapBuffers actually calls it under the hood) changes nothing. That is quite strange and proves I actually don’t know how glFlush works.

I’m doing a collision test on the CPU, but need data from the GPU since they are created on the GPU. Maybe some efficient way of reading values from the texture would be better, but I really doubt texture reading could outperform the current approach.
Also, I’ll certainly optimize TF and prevent unnecessary readings, but at the moment I’m forcing it in each frame to find the most efficient way to solve slow readings.

Completely agree! Optimization will follow as soon as I find the best solution for reading.

Probably! As previous experiment showed, if glFlush is called after filling TF buffer, everything works like expected, but if it is delayed there is no influence.
But the problem is: if there are new commands waiting for the execution, why the previous ones are not already flushed?

I have tried buffer storage and got no improvement. Maybe there can be some, but I have to find the right combination of flags. (GL_MAP_PERSISTENT_BIT | GL_MAP_READ_BIT) gains no speedup.