Fast Readbacks on Intel and NVIDIA

Hello @Dark_Photon,

I have read your post and merged the part doReadbackFAST function on a qt quick qml application. I have run the implementation on nvidia jetson xavier nx platform for getting 16bit RGB colors as the following:

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
void WaylandEgl::initFastBuffers()
{
    if (!buffCreated)
    {
        pbo_size = mWinHeight * mWinWidth *2;
        pixels = new unsigned char[pbo_size];
        Readback_buf = (GLchar *) malloc( pbo_size );

        glGenBuffers( PBO_COUNT, pboIds );

        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;

        glBindBuffer( target, pboIds[0] );
        glBufferData( target, pbo_size, 0, GL_STATIC_COPY );


        glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
        if (!glGetBufferParameterui64vNV)
        {
            qDebug() << "glGetBufferParameterui64vNV not fouynded!";
            return;
        }

        glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
        if (!glMakeBufferResidentNV)
        {
            qDebug() << "glMakeBufferResidentNV not fouynded!";
            return;
        }

        glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
        if (!glUnmapBufferARB)
        {
            qDebug() << "glUnmapBufferARB not fouynded!";
            return;
        }

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        qDebug() << "Run the optimizatiosn";


        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV( target, GL_READ_ONLY );

        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, pboIds[1] );
        glBufferData( target, pbo_size, 0, GL_STREAM_READ );

        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT);
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
        buffCreated = true;
        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
    }
}

void WaylandEgl::doReadbackFAST()
{
    // Work-around for NVidia driver readback crippling on GeForce.

    initFastBuffers();

    //glFinish();
    Timer t1;
    t1.start();
    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );

    glReadPixels( 0, 0, mWinWidth, mWinHeight,
                  GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         pbo_size );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
                        Readback_buf );

    //sendImage((unsigned char*)Readback_buf,pbo_size);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    qDebug() << "Read Time " << readTime;
    qDebug() << "Process Time " << processTime;
}
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Here are the results for Cpu consumption, read time and process time on both intel linux pc and nvidia jetson xavier nx:

  • Intel : 11-12% cpu usage , Read Time: 0.039ms, Process Time:2.014ms
  • Nvidia: 28-32% cpu usage, Read Time:3.118 ms, Process Time:1.659 ms

I know that getting RGBA is less cpu consuming and faster but my requirement is getting 16 bit RGB colors.So Do you have any idea how I can reduce the cpu consumption on nvidia jetson xavier nx platform ?

Regards

Hello,

The default glReadPixels without any pbo with the following parameters consuming less cpu and similar readTime:

glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf );

Nvidia: 24-26 % cpu usage, Read Time:3.029ms
Intel: 9-10% cpu usage, Read Time:Read time 1.906ms

So Is there any idea to reduce the cpu usage with pbo or with some other approach on nvidia ?

Regards

A few thoughts for you.

I was testing this on a high-end NVIDIA desktop (discrete) GPU, which has dedicated high-speed graphics memory and a fast bus between GPU memory and system memory, but nevertheless separate memory pools. You are testing on two low-end SoC CPU+GPU systems, which use slower system memory for both the GPU memory store and CPU memory. The paths being used here for the readback are totally different, and the drivers are as well. So desktop/discrete perf on NVIDIA desktop drivers in this case shouldn’t really correlate to mobile/embedded GPU perf with embedded drivers.

What resolution are you reporting Read/Process Times and CPU usages for?

Also, you have the glFinish() commented out between the init and the test. Why? You may not be timing what you think you’re timing.

On that note, I would do a full render and readback at least once before you time anything, to verify that all of the resources have actually been fully allocated and backed with memory in the graphics driver. OpenGL defers most commands you give it, deferring the execution until later.

Re CPU usage…

  1. Your two test cases are running on SoC GPUs, where the CPU and GPU are married into the same silicon. So it could be that your “CPU Usage” is pulling in GPU usage as well. If not, what exactly is being reported for CPU usage?

  2. It also occurs to me that the render target memory you’re reading back from is probably tiled, and the readback of course provides linear texels. So one question that comes to mind is, in your readbacks, where does the driver perform the de-tiling work? On the CPU, the GPU, or some separate component? That may be what’s partly driving your CPU usage.

  3. To ensure there are no expensive texel format conversions going on, I would check with glGetInternalformat*() to verify that 1) the render target(s) internal formats you’re using have full support for framebuffer rendering and 2) you are using the driver-preferred format/type to readback the texel data from those formats. See Image_Format#Image_format_queries and glGetInternalformat*(); in particular, GL_FRAMEBUFFER_RENDERABLE, GL_COLOR_RENDERABLE, GL_READ_PIXELS, GL_READ_PIXELS_FORMAT, GL_READ_PIXELS_TYPE, and possibly a few others. You probably are, but just to double-check. If not, then there could be some expensive CPU-based texel format conversions going on in the driver.

  4. Since your GPU and CPU are sharing the same RAM (LPDDR4X, for the NVIDIA Tegra Xavier NX), I would check into whether there is a faster path to get GPU-rendered content into CPU-accessible memory. Either by a faster transfer path. Or a path which doesn’t even involve a copy. Look at what EGL might provide you here. Possibly EGL images.

  5. And one more note: Look for options to rework your algorithm so that you don’t actually need to get the texel data out of GPU accessible memory in the first place. Then your readback cost just goes away.

See this:

Hello,

Thank you so much for your kind answers.The example code that you provided is also reducing the cpu usage on intel. Also I performed another test run for RGBA, the blocking glReadPixels consumes 31-32% cpu and your doFastBack algorithm consumes ~ 15% cpu , so if cpu usage is reduced for RGBA than paths being used seems to be correct ?

*The resolution is 1280 X 720.

  • I disabled glFinish() call because I did not see big differences for timings but I will add according to your feedback.
  • The timings values are written after full render and readback at least 5-6 cycle :slight_smile:
  1. For cpu usage I just checked top command for my executable program result for a short(2min~) amount of time. On another application which directly runs on gpu, I have observed no cpu consumption from top command.

  2. I really do not know about where the driver performs de-tiling works. How I can check this ?

  3. Yes I tried to check GL_READ_PIXELS_FORMAT, GL_READ_PIXELS_TYPE as the following:

void WaylandEgl::checkTypes()
{
    GLint pref_format[1],pref_type[1],devForm;
    GlGetInternalformativ = (PFNGLGETINTERNALFORMATIVPROC)eglGetProcAddress("glGetInternalformativ");
    if (!GlGetInternalformativ)
    {
        qDebug() << "GlGetInternalformativ not fouynded!";
        return;
    }
    GlGetInternalformativ(GL_RENDERBUFFER,GL_RGB565,GL_INTERNALFORMAT_PREFERRED,1,&devForm);
    GlGetInternalformativ(GL_RENDERBUFFER,GL_RGB565,GL_READ_PIXELS_FORMAT,1,pref_format);
    GlGetInternalformativ(GL_RENDERBUFFER,GL_RGB565,GL_READ_PIXELS_TYPE,1,pref_type);

    qDebug() << "Device format :" << devForm;
    qDebug() << "Format: " << pref_format[0];
    qDebug() << "Pref type" << pref_type[0];
}

So Results are :

Device format : 36194 = GL_RGB565
Format: 6407 = GL_RGB
Pref type 33635 = GL_UNSIGNED_SHORT_5_6_5

So It seems to me values are supported. Could you please have a look at it ?

  1. Do you have any document or link to find out a faster path to get GPU-rendered content ? So My goal is to send 16bit RGB screen color pixels from one device to another simple device which may not have dma or opengl, for this reasons I could not use EGLImageKHR object from render buffer and It seems the only way using asychronous glReadPixels.

  2. I will investigate that.

Hello,

Thank you for the link. I will check it and share my findings with you when It is ready.

Regards

Hello,

The links you provided contains cuda lib specific operations.So I need more generic solutions not to use cuda libs :slight_smile:

Please inform me if you have some more ideas :slight_smile:

Regards

Better at least (in terms of CPU usage). But possibly not best.

In OpenGL ES and OpenGL, it’s all internal.

I’d run a CPU profiler (e.g. like Callgrind on Linux or VerySleepy on Windows) to collect stats on where your time
is being spent during this interval. If it’s largely in the GLES driver, then that gives you something to go on.

Looks good. You might also check the others I suggested as well.

Not really. I haven’t done any dev on NVIDIA Tegra/Jetson SOCs. I’ve attached a few links below you might check out. They could give you more clues on what you can do to maximize CPU/GPU perf on NVIDIA Jetson Xavier NX systems.

I see. I don’t see an NVENC HW unit on that Jetson Xavier NX. Otherwise I’d suggest that as an option.

A few Tegra/Jetson pages to scan (some with links to docs):

Relevant to your situation…

NVIDIA Jetson Linux Developer Guide:

Also…

Also, if you aren’t already using it, it appears tegrastats will give you some good information on HW unit utilizations:

1 Like

Hello,

Thanks for the sharings. I will check all the links you provided.

Regards

Hello,

A quick update ,
I tried to add PACK_ROW_LENGTH support as the following on initialization functions:

    nBytesPerLine = 1280; // width of qml
    int rowL;
    glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
    qDebug() << "Rowl before" << rowL;

    glPixelStorei( GL_PACK_ALIGNMENT, 1 );
    glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);
    qDebug() << "Pixel st" << glGetError();
    glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
    qDebug() << "Rowl after" << rowL;

and It did not reduce cpu usage for both GL_RGBA and GL_RGB.

I will continue checking the rest.

Regards

Hello,

I observed that I forgot to add default cpu usage of the qml application. For Cpu usage , The default qt application consumes nearly 10-11% cpu alone without running any algorithms to get color pixels. So When I run the algorithm for 16 bit RGB color with type GL_UNSIGNED_SHORT_5_6_5, It is causing additional 18-22% cpu load.

I checked all the links but could not get any benefit from the links for reducing the cpu load.Currently I am trying to use cuda to see if it helps.

Regards

Sounds good. One other thought for you:

You are rendering to an FBO with an internal format of GL_RGB565 for the color buffer, aren’t you? Re-reading this comment above makes me think you may not be.

If you’re not, then naturally the driver is going to have to not only untile (if it’s tiled) but also convert the texel format for each texel in the color buffer into GL_RGB565 / GL_UNSIGNED_SHORT_5_6_5 format. And this is likely to hit your perf somewhere, probably on the CPU in the GL-ES driver. This may explain your CPU usage.

I don’t know which GL-ES version you’re targeting. But according to the OpenGL ES 3.2 Specification, GL_RGB565 is a color-renderable internal format (see Table 8.10 on pp. 163-164). So you can render to it (if targetting GL-ES 3.2). That would avoid this conversion on readback.

If you’re not rendering to GL_RGB565 color, I would definitely try making this FBO color format change, timing the readback, and looking at CPU usage when there’s no pixel format conversion going on. That will give you good information going forward.

If you’re not rendering to GL_RGB565 color and if for some reason you can’t switch rendering to that format, then you might explore alternate transfer paths in the driver for converting those pixels from whatever format you’re using for your color buffer (e.g. GL_RGBA8) to GL_RGB565 and then doing the readback. For instance, you could:

  1. Create 2 FBOs: FBO1 with GL_RGBA8 color buffer. FBO2 with GL_RGB565 color buffer.
  2. Render to FBO1
  3. glBlitFramebuffer() color from FBO1 to FBO2
  4. glReadPixels() back the color buffer from FBO2

If texel format conversions via Blit are faster than via ReadPixels, then this could be a win. However, this is more mem B/W on your slow system RAM, so even so it may be slower.

Hello,

Currently I am using only PBO not using render buffer for FBO as you said.
My nvidia platform GL-ES Version is : 4.6.0 NVIDIA 32.4.4 and as you said that
GL_RGB565 should be an internal format.Previously I created a render buffer for
color attachment GL_RGB565 and could not get color pixels only a black screen
when reading with glReadPixels. I double check the implementation which is
as the following and result is again a black screen.

void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth *2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;
        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }
}

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;

    createRenderBuffer16();
    
    glFinish();
    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
    //glClearColor(1.0,0.0,0.0,1.0); // debug purpose red color.
    //glClear(GL_COLOR_BUFFER_BIT);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf );

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Read Time " << readTime;
}

For test purposes, When I activated glClearColor I can see that glReadPixels is getting correct red color.

I also tested blit suggestion with qt default framebuffer previously which is as the following:

void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth *2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;
        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }
}

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;

    createRenderBuffer16();

    glFinish();
    t1.start();
    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);
    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf );

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Blit Time " << blitTime;
    //qDebug() << "Read Time " << readTime;
}

Total CPU usage is between 21-24%.Blit Time 0.161ms and Read Time 1.885ms.
If I commented out glReadPixels call then glBlitFramebuffer cpu load is : 11-12% which is quite good since
default qt application cpu load is around 10-11% cpu but when I activated glReadPixel part cpu load is reaching
to 21-24%.

I added pixel store call for optimizations as the following:

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,1280);

and Total CPU load is same 21-24%.As you said that blit is faster and consumes less cpu but when I added glReadPixels
again it is causing high cpu load.So Is there anything that I am missing or Is there any other way to read without glReadPixels?

Regards

One more update as the following:

I modified the algorithm as the following with glReadBuffer and cpu load is between 19-24%.

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;
    createRenderBuffer16();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    glReadBuffer(GL_COLOR_ATTACHMENT0); // frameBuffer16 also works
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf );

    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    /*GLubyte *ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT);
    if (ptr)
    {
        memcpy(Readback_buf, ptr, pbo_size);

        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
    }
    else
    {
        qDebug() << "NULL bokk";
    }*/
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);


    //t1.stop();
    //readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Blit Time " << blitTime;
    //qDebug() << "Read Time " << readTime;
}

and also run the test with a pbo as the following and got the similar cpu load :

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;
    createRenderBuffer16();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    glReadBuffer(GL_COLOR_ATTACHMENT0); // frameBuffer16 also works
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );

    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    GLubyte *ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT);
    if (ptr)
    {
        memcpy(Readback_buf, ptr, pbo_size);

        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
    }
    else
    {
        qDebug() << "NULL bokk";
    }
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);


    //t1.stop();
    //readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Blit Time " << blitTime;
    //qDebug() << "Read Time " << readTime;
}

It seems to me it is more closer the blit algorithm that you ment above.If not Could you please lemme know how to manage it :slight_smile:

Regards

Hello,

Additionally I modified your doReadbackFAST algorithm with glBlit and Total cpu load is
28-32% similar to when there is no glBlit :

void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth *2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;

        int rowL;

        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);
        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }
}


void WaylandEgl::initFastBuffers16()
{
    if (!buffCreated)
    {
        checkTypes();
        createRenderBuffer16();

        glGenBuffers( PBO_COUNT, pboIds );

        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;

        glBindBuffer( target, pboIds[0] );
        glBufferData( target, pbo_size, 0, GL_STATIC_COPY );


        glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
        if (!glGetBufferParameterui64vNV)
        {
            qDebug() << "glGetBufferParameterui64vNV not fouynded!";
            return;
        }

        glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
        if (!glMakeBufferResidentNV)
        {
            qDebug() << "glMakeBufferResidentNV not fouynded!";
            return;
        }

        glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
        if (!glUnmapBufferARB)
        {
            qDebug() << "glUnmapBufferARB not fouynded!";
            return;
        }

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        qDebug() << "Run the optimizatiosn16";


        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV( target, GL_READ_ONLY );

        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, pboIds[1] );
        glBufferData( target, pbo_size, 0, GL_STREAM_READ );

        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT);
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
        buffCreated = true;
    }
}

void WaylandEgl::doReadbackFAST16() // perfect on intel..
{
    // Work-around for NVidia driver readback crippling on GeForce.

    initFastBuffers16();

    glFinish();
    Timer t1;
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    glReadBuffer(GL_COLOR_ATTACHMENT0); // frameBuffer16 also works

    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );
    glReadPixels( 0, 0, mWinWidth, mWinHeight,
                  GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         pbo_size );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
                        Readback_buf );

    //sendImage((unsigned char*)Readback_buf,pbo_size);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    //qDebug() << "Read Time " << readTime;
    //qDebug() << "Process Time " << processTime;
}

Do I miss something?
Regards

I’m not sure what else to tell you. You’re just going to have to profile your app and determine where exactly the extra CPU usage is coming from (from what component). With that information, you then need to figure out what option(s) you have to reduce it, if any.

Just a few notes and suggestions for you to try:

  • Having a glBlitFramebuffer() in all this doesn’t serve any useful purpose unless you are rendering to a format besides GL_RGB565. For the case that you are rendering to GL_RGB565, just remove this needless overhead.

  • Did you ever try the export __GL_YIELD=USLEEP, just in case the extra overhead was in the GL driver twiddling its thumbs with a busy-wait? (Other possible options to try and compare against: export __GL_YIELD= and export __GL_YIELD=NOTHING

  • Try doing 2 glReadPixels() operations back-to-back (for different pixels in the FBO). Time each separately. If the 2nd takes significantly less time, it could be that the 1st glReadPixels() is including pipeline “flush” behavior, which could help explain some of your extra CPU usage.

  • Consider comparing the glReadPixels() timings and CPU usage against glGetTexImage().

  • A tip suggested by @GClements recently (LINK)… See if your NVIDIA Jetson Xavier NX platform supports one of the EGL lock surface extensions (EGL_KHR_lock_surface, EGL_KHR_lock_surface2, EGL_KHR_lock_surface3). From gpuinfo.org, apparently some NVIDIA Tegra platforms do. If so, it appears that you may be able to request that the graphics driver exactly match the RGB565 format you’re targeting with a specified byte order in an EGL surface it creates, as well as provide you a way to access the RGB565 framebuffer data directly without doing a glReadPixels(). If supported, that may be what you need to reduce CPU usage accessing the rendered pixel data.

Hello,

I profiled my application’s cpu usage and it is as the following according to algorithms:

algorithm performRenderBuffer16: 6-7% CPU usage from glMapBufferRange,glUnmapBuffer function calls
and %2 by memcpy().

algorithm doReadbackFAST16: 5-6% CPU usage from glGetBufferSubData function.

1.It seems to me glBlitFramebuffer not adding too much cpu load.

2.I tried to export __GL_YIELD=USLEEP, export __GL_YIELD= and export __GL_YIELD=NOTHING
for performRenderBuffer16() algorithm above and export __GL_YIELD=USLEEP is the best fit
and reducing the Total cpu usage up to 12-16%.

Note that for performRenderBuffer16() algorithm, I added following modifications:

   glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    //glReadBuffer(frameBuffer16); // frameBuffer16 also works
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);

for doReadbackFAST16 algorithms I observed that I forget to add

	glBindFramebuffer(GL_FRAMEBUFFER,frameBuffer16);

call before glReadPixels and now Total cpu usage with this algorithm is 13-17%.
When I added export __GL_YIELD=USLEEP, export __GL_YIELD= and export __GL_YIELD=NOTHING
for doReadbackFAST16 algorithms they all have similar results for total Cpu usage 13-16%.

So It seems to me I need to use export __GL_YIELD=USLEEP.

  1. I have implemented a function as the following for reading 24bit colors and 16bit colors back to back as you suggested
    with blocking glReadPixels:
void WaylandEgl::createDarkRenderBuffers()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size16 = mWinHeight * mWinWidth * 2;
        pbo_size24 = mWinHeight * mWinWidth * 3;
        nBytesPerLine = mWinWidth ;
        Readback_buf24 = (GLchar *) malloc( pbo_size24 );
        Readback_buf16 = (GLchar *) malloc( pbo_size16 );

        glGenRenderbuffers( 1, &darkRenderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, darkRenderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &darkFrameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, darkRenderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }

        glGenRenderbuffers( 1, &darkRenderBuffer24 );
        glBindRenderbuffer( GL_RENDERBUFFER, darkRenderBuffer24 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &darkFrameBuffer24 );
        glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer24);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, darkRenderBuffer24);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;


        int rowL;

        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei( GL_UNPACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);
        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;

        glBindFramebuffer(GL_FRAMEBUFFER, 0);

    }
}

void WaylandEgl::performDarkRenderBuffers()
{
    Timer t1;
    createDarkRenderBuffers();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,darkFrameBuffer24);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer24);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_BYTE, Readback_buf24);

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();
    qDebug() << "Blit Time1 is: " << blitTime;
    qDebug() << "Read Time1 is: " << readTime;

    t1.start();
    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,darkFrameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer16);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf16);

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    qDebug() << "Blit Time2 is: " << blitTime;
    qDebug() << "Read Time2 is: " << readTime;
}

As you told all the time 2nd takes significantly less time such as:

Read Time1 is: 4.531 ms
Read Time2 is: 1.144 ms

So for this pipeline “flush” behavior should I do anything ?

  1. Previously I did not use glGetTexImage and I think I need to find out how I can reach qt’s default GL_TEXTURE_2D.
    When I try to get already binded texture object from qt as the following:
void WaylandEgl::performTextureStaff()
{
    if (!buffCreated)
    {
        int sizeX, sizeY;
        GLenum format;

        glGetIntegerv ( GL_TEXTURE_BINDING_2D, (int *) &boundTex );

        glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_WIDTH, &sizeX);
        glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_HEIGHT, &sizeY);
        glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_INTERNAL_FORMAT, (GLint*)&format);

        qDebug() << "Size x" << sizeX << "size y" << sizeY << "format" << format;

        qDebug() << "Text id " << boundTex;
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 4;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        buffCreated = true;
    }

    Timer t1;
    glFinish();
    t1.start();

    glBindTexture ( GL_TEXTURE_2D, boundTex );
    qDebug() << "Bind error is " << glGetError();
    glGetTexImage ( GL_TEXTURE_2D, 0, GL_RGBA , GL_UNSIGNED_BYTE, Readback_buf );
    qDebug() << "glGetTexImage error is " << glGetError();
    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();
    //qDebug() << "Read Text time" << readTime;
}

I do not have correct result for width and height and have only black screen.So need to check this.

  1. I need to check EGL lock surface extensions on xavier and will update here.

Regards

An update for glGetTexImage, I implemented following running example and test one by one
as removing comments for glReadPixels and glGetTexImage:

void WaylandEgl::performTextureStaff()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glGenFramebuffers(1, &textFrameBuffer);
        glBindFramebuffer(GL_FRAMEBUFFER, textFrameBuffer);

        glGenTextures(1, &boundTex);
        glBindTexture(GL_TEXTURE_2D, boundTex);
        glTexImage2D(GL_TEXTURE_2D, 0,GL_RGB, mWinWidth, mWinHeight, 0,GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);

        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);

        glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,GL_TEXTURE_2D, boundTex, 0);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Texture Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }

    glFinish();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,textFrameBuffer);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    Timer t1;
    t1.start();

    glBindTexture(GL_TEXTURE_2D, boundTex);
    //glGetTexImage(GL_TEXTURE_2D,0,GL_RGB,GL_UNSIGNED_SHORT_5_6_5,Readback_buf);
    //glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    qDebug() << "Process time is " << processTime;
}

Results:

glGetTexImage:28-35% CPU load without export __GL_YIELD=USLEEP.21-25% with export __GL_YIELD=USLEEP and Process time is 2.151 ms.
glReadPixels:25-30% CPU load without export __GL_YIELD=USLEEP. 22-26% with export __GL_YIELD=USLEEP and Process time is 2.758 ms.

I will put a bit more focus on glGetTexImage to see if it helps on other algorithm as well.

For EGL_KHR_lock_surface extension I have checked it as the following and it does not exist in support extensions:

    qDebug("EGL Version: \"%s\"\n", eglQueryString(eglGetCurrentDisplay(), EGL_VERSION));
    qDebug("EGL Vendor: \"%s\"\n", eglQueryString(eglGetCurrentDisplay(), EGL_VENDOR));
    qDebug("EGL Extensions: \"%s\"\n", eglQueryString(eglGetCurrentDisplay(), EGL_EXTENSIONS));

EGL Version: “1.5”

EGL Vendor: “NVIDIA”

EGL Extensions: “EGL_ANDROID_native_fence_sync EGL_EXT_buffer_age EGL_EXT_client_sync EGL_EXT_create_context_robustness EGL_EXT_image_dma_buf_import EGL_EXT_image_dma_buf_import_modifiers EGL_EXT_output_base EGL_EXT_output_drm EGL_EXT_protected_content EGL_EXT_stream_consumer_egloutput EGL_EXT_stream_acquire_mode EGL_EXT_sync_reuse EGL_IMG_context_priority EGL_KHR_config_attribs EGL_KHR_create_context_no_error EGL_KHR_context_flush_control EGL_KHR_create_context EGL_KHR_display_reference EGL_KHR_fence_sync EGL_KHR_get_all_proc_addresses EGL_KHR_partial_update EGL_KHR_swap_buffers_with_damage EGL_KHR_no_config_context EGL_KHR_gl_colorspace EGL_KHR_gl_renderbuffer_image EGL_KHR_gl_texture_2D_image EGL_KHR_gl_texture_3D_image EGL_KHR_gl_texture_cubemap_image EGL_KHR_image EGL_KHR_image_base EGL_KHR_reusable_sync EGL_KHR_stream EGL_KHR_stream_attrib EGL_KHR_stream_consumer_gltexture EGL_KHR_stream_cross_process_fd EGL_KHR_stream_fifo EGL_KHR_stream_producer_eglsurface EGL_KHR_surfaceless_context EGL_KHR_wait_sync EGL_MESA_image_dma_buf_export EGL_NV_context_priority_realtime EGL_NV_cuda_event EGL_NV_nvrm_fence_sync EGL_NV_stream_cross_display EGL_NV_stream_cross_object EGL_NV_stream_cross_process EGL_NV_stream_flush EGL_NV_stream_metadata EGL_NV_stream_remote EGL_NV_stream_reset EGL_NV_stream_socket EGL_NV_stream_socket_unix EGL_NV_stream_sync EGL_NV_stream_fifo_next EGL_NV_stream_consumer_gltexture_yuv EGL_NV_stream_attrib EGL_NV_system_time EGL_NV_output_drm_flip_event EGL_WL_bind_wayland_display EGL_WL_wayland_eglstream”

Thanks for the hint I will continue with other optimizations you ment above.

Regards