How to apply pbo to multiple glReadPixels?

I am using pbo on a multiple glReadPixels circumstance.
The original code is like below:

for(int i = 0; i < 4; i++)
{
    glReadPixels(x0[i], y0[i], w0[i], h0[i], GL_RGBA, GL_UNSIGNED_BYTE, data0[i]);
    glReadPixels(x1[i], y1[i], w1[i], h1[i], GL_RGBA, GL_UNSIGNED_BYTE, data1[i]);
}

Now, I am using PBO:
First, initiating pbos

int _read = 0, _dma = 1;
for(int i = 0; i < 4; i++)
{
    int size0 = w0[i] * h0[i] * 4;
    int size1 = w1[i] * h1[i] * 4;

    glGenBuffers(2, pbo0[i]);
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo0[i][0]);
    glBufferData(GL_PIXEL_PACK_BUFFER, size0, 0, GL_STREAM_READ );
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo0[i][1]);
    glBufferData(GL_PIXEL_PACK_BUFFER, size0, 0, GL_STREAM_READ );
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

    glGenBuffers(2, pbo1[i]);
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo1[i][0]);
    glBufferData(GL_PIXEL_PACK_BUFFER, size1, 0, GL_STREAM_READ );
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo1[i][1]);
    glBufferData(GL_PIXEL_PACK_BUFFER, size1, 0, GL_STREAM_READ );
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
}

Now, using pbos(in a rendering while loop)

for(int i = 0; i < 4; i++){
   int size0 = w0[i] * h0[i] * 4;
   glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo0[i][_dma]);
   glReadPixels(x0[i], y0[i], w0[i], h0[i], GL_RGBA, GL_UNSIGNED_BYTE, NULL);
   glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
            
   glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo0[i][_read]);
   GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, size0, GL_MAP_READ_BIT);
   memcpy(data0[i], ptr, size0);
   glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
   glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

   int size1 = w1[i] * h1[i] * 4;
   glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo1[i][_dma]);
   glReadPixels(x1[i], y1[i], w1[i], h1[i], GL_RGBA, GL_UNSIGNED_BYTE, NULL);
   glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

   glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo1[i][_read]);
   ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, size1, GL_MAP_READ_BIT);
   memcpy(data1[i], ptr, size1);
   glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
   glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
}
swap(_read, _dma);

However, the screen went black, at first. And when it went back to normal, the fps didn’t get any higher.
Could someone give me some advice on how to correct my code? Thanks!
Please forgive my poor English and bad text format.

Please describe what problem you’re seeing, what you’ve tried, and what you’re trying to accomplish by the above. Also, what GPU and GPU driver are you running on?

You seem to be trying to do deferred buffer map after readback, but there’s nothing (no frames) queued for rendering between each pass. You’re also using a blocking mapbuffer call. So it’s unclear what you expect to gain with this.

Yes, using a PBO will give you no benefit if you grab the data from it immediately after your ReadPixels call. The intended use is that you ReadPixels to the PBO, wait at least one frame, then map and read the PBO contents. That allows the driver to perform the data transfer asynchronously. Otherwise you may as well be just using ReadPixels to a system memory buffer - your code will certainly be much simpler, and may even be faster.

A more subtle performance bug lies in the parameters to your ReadPixels call. If your format and type don’t match your current read buffer (typically the back buffer) then your driver will need to do a format conversion. This may be a slow operation, may be done in software, and may involve transferring the data to system memory anyway. Even with an otherwise well-behaved PBO usage pattern, this can totally destroy your performance. Resolving this typically means using GL_BGRA instead of GL_RGBA but I’d encourage you to try different combinations until you find what works best for you.

But according to the code, he’s mapping a different buffer. There’s the _dma buffer index and the _read buffer index.

Hmmm, yes, I missed that. Padding for 30 characters.

Thank you for replying!

The glReadPixels code is actually getting pixels value for white balance between camera images.
The original code will cause a obvious stutter. I saw a performance boost in the examples using pbo on SongHo’s website. So, I thought pbo would solve the stutter problem by reducing the glReadPixels execution time and gave a fps increase. And finally the above code came out.

When running, the above code will leave one camera image normal while others black in the first round of white balance. Then, in the second round, all camera images will appear. Using pbo in the above way, did not give a fps increase, but a subtle drop instead. However, the stutter problem seems being alleviated somehow.

I also reduced the white balance algorithm execution time from 1k ms or so to under 15 ms, which didn’t help to the problem. I also tried using GL_BGR_IMG instead of GL_RGBA, which didn’t make a difference, either. GL_BGRA cannot be recognized in my code, maybe because of lack of some library.

I am running the code on a qnx system. I cannot get the gpu information.

Thanks for the background information.

By your comment " I saw a performance boost in the examples using pbo on SongHo’s website."…

…do you mean that he reported a performance boost, or you tried his code and got a performance boost on this same QNX platform (incl. GPU+GPU driver)?

Also I too missed that you are updating 1 of 2 disjoint sets of 8 buffer objects each for each frame (presumably swap(_read, _dma) is performed once per frame).

Ok. A couple thoughts here.

First, you should consult the OpenGL ES Developers Guide for QNX for perf recommendations on optimizing for this case. If you don’t find this, you need to request (demand) it from the driver providers. What will be most efficient on that platform depends on the details of the OpenGL ES implementation, the GPU graphics driver, and the GPU that you are running on. QNX is most often used on mobile platforms, which suggests an embedded tile-based GPU, which makes it that much more critical that you follow these recommendations exactly.

Tile-based GPUs have deep pipelines and long frame latency, often performing rasterization a frame or two late, to compensate for the slow access speeds of ordinary DRAM. Any blocking or potentially blocking operations such as readbacks or buffer mapping need to be handled with care or they will flush this long pipeline and totally trash your frame rate.

Second, if there’s a decent GPU profiling tool for your QNX platform, use it! A good one will graphically show you 1) your app-side CPU queuing work (via the front-end GLES driver) and 2) the GPU execution work in the back-end graphics driver on a single timeline graph. This’ll give you considerable insight on whether your work is being scheduled efficiently and, if not, what exactly is “clogging up the pipe”.

Third, in your code, _dma = 0 and _read = 1 (or versa), correct? You might double-check this. You might also try triple-buffering instead, particularly if this is a mobile GPU. Since you’re reading back the results of rasterization, on a mobile GPU those results will most likely not be available until the 2nd or 3rd frame after the frame you submit the draw work, not the very next frame. When exactly depends on the details of the GPU the GPU driver. If detiling isn’t hardware accelerated, you’re reading back large regions, and/or you’re having the driver perform pixel format conversions, the performance may be poor regardless.

Also as a test, instead of doing 8 glReadPixels() and 8 glMapBufferRange() calls per frame (16 total), cut that back to 1 glReadPixels() and 1 glMapBufferRange() per frame. Re-check perf. You could just be overloading the driver or memory system. If perf is still bad, comment out the glMapBufferRange() and re-check perf with just 1 glReadPixels() per frame. This’ll give you some idea where the problem is coming in. Possibly you need to switch to a different method to query the pixels to app CPU memory, such as glGetBufferSubData().

You can also reduce the width/height of this 1 readback region to cut back mem and driver CPU overhead costs to assess the impact of that on your total cost. Also, be sure to bench with all readbacks and buffer maps disabled to verify that your bottleneck is definitely associated with these glReadPixels() and/or glMapBufferRange() commands you’re queuing with the driver.

I’d also follow-up on the comment you got about checking the format you have glReadPixels() return to avoid needless per-pixel format conversions. Given a non-optimal format, these conversions may need to be performed manually on the CPU down in the driver. Again, check the OpenGL ES Developers Guide for your platform for details on this. If your driver supports something like ARB_internalformat_query, then you might be able to just ask the driver to tell you what format it’d prefer that you request. See:

For instance:

glGetInternalformativ( GL_TEXTURE_2D, GL_RGBA8, GL_INTERNALFORMAT_SUPPORTED,  1, &res );
glGetInternalformativ( GL_TEXTURE_2D, GL_RGBA8, GL_INTERNALFORMAT_PREFERRED,  1, &res );
glGetInternalformativ( GL_TEXTURE_2D, GL_RGBA8, GL_READ_PIXELS,               1, &res );
glGetInternalformativ( GL_TEXTURE_2D, GL_RGBA8, GL_READ_PIXELS_FORMAT,        1, &res );
glGetInternalformativ( GL_TEXTURE_2D, GL_RGBA8, GL_READ_PIXELS_TYPE,          1, &res );

Just noting on the format conversions.

Any 3-component format (assuming it’s 24-bits) will definitely require a format conversion, unless you have really weird hardware that actually supports 24-bit 3-component formats. So a 24-bit GL_BGR format can be expected to perform no better than GL_RGBA. This often confuses some people new to OpenGL fairly badly, as it might seem logical to assume that a 24-bit format would use less memory and therefore transfer faster than a 32-bit one, but factors such as format matching and alignment are actually far more important than size in memory.

You might actually have a 16-bit backbuffer on your hardware, the backbuffer format might be something like 555x, 5551 or 565, and the component ordering might be RGB or BGR. OpenGL has quite a comprehensive set of formats and types so you should be able to find a match; if you’re just missing a #define in your GL headers, then you can download updated headers or copy/paste from updated headers.

If you require the data in a different format to whatever is native on your hardware, you’re typically faster reading it in the native format anyway, then converting to the other format in software in your own code, rather than letting the GL driver do the conversion for you.

Thanks for the notice of the format part. And sorry the GL_BGR_IMG should be GL_BGRA_IMG in the post before. It’s just weird that only one cameras was working normally while others went black when using GL_RGB or GL_BGR_EXT. Maybe I got it wrong somehow.
The input format is YUYV. My colleague wrote #define IMG_BPP 2 //RGB565. It is nature to think rgb or bgr will be faster than rgba or bgra all by means. I think I have to go back to GL_RGBA temporarily.

Thank you for such a comprehensive reply. I just find out using one pbo is simpler than and just efficient as swapping two pbos, and wihout causing black out. It seems that the current bottleneck is moved from glReadPixels to glMapBufferRange and memcpy part. I think I have to dig and try more.

Whoah. That’s a completely different pixel format than the one you’re requesting. It’s likely that this is triggering a slow CPU-based pixel conversion path in the graphics driver. In your app, this’ll just look like a long stall in some GL call.

See if you can request the pixel data in the format that the graphics driver is storing the pixel data in on the GPU.

At this stage I’m regretting the lack of a GL_DONT_CARE option for these params that just transfers the raw data without any intermediate jiggery-pokery, and lets the programmer deal with interpreting the format in their own code. There are probably edge cases that need to be worked out though.

There is one. You can send the data as unsigned integers (perhaps GL_R32UI with GL_RED_INTEGER and GL_UNSIGNED_INT for the pixel transfer parameters) and let the shader do whatever it wants. Of course, there’s no filtering on unsigned integers, so you’ll have to do that yourself where appropriate. And in my parenthetical example, you’ll have to deal with any endian issues.

__I am now using a single pbo without swap and original GL_RGBA to do glReadPixels.
__Except that, I replaced memcpy sentence with the following:
data = Mat(x.height, x.width, CV_8UC4, ptr);
__Then the data will be used for white balance, like:
Mat_<Vec4b>::iterator it = data.begin<Vec4b>();
Mat_<Vec4b>::iterator itend = data.end<Vec4b>();
(for loop to do some simple pixel assignment)
__In this way, the major pixels copy process cost about 0.013 sec, while the major white balance process cost about 1.192 sec.
__Then, I tried replacing the "data = " sentence with a deep copy:
data = Mat(x.height, x.width, CV_8UC4, ptr).clone();
__The major pixels copy process cost raised to 0.1 sec or so, while the major white balance process cost dropped to 0.025 sec or so (which is close to the one without using pbo).
__I checked that the “data” variable was the same, however I could not reach the lower time cost on both the pixels copy and white balance sides.
__I know this may not be an opengl problem, but could someone sheds some light on why the deep copy and shallow copy of cv::Mat caused such a big difference.