Fast Readbacks on Intel and NVIDIA

esahin · February 12, 2021, 5:20am

Hello,

I observed that I forgot to add default cpu usage of the qml application. For Cpu usage , The default qt application consumes nearly 10-11% cpu alone without running any algorithms to get color pixels. So When I run the algorithm for 16 bit RGB color with type GL_UNSIGNED_SHORT_5_6_5, It is causing additional 18-22% cpu load.

I checked all the links but could not get any benefit from the links for reducing the cpu load.Currently I am trying to use cuda to see if it helps.

Regards

Dark_Photon · February 12, 2021, 1:31pm

Sounds good. One other thought for you:

You are rendering to an FBO with an internal format of GL_RGB565 for the color buffer, aren’t you? Re-reading this comment above makes me think you may not be.

If you’re not, then naturally the driver is going to have to not only untile (if it’s tiled) but also convert the texel format for each texel in the color buffer into GL_RGB565 / GL_UNSIGNED_SHORT_5_6_5 format. And this is likely to hit your perf somewhere, probably on the CPU in the GL-ES driver. This may explain your CPU usage.

I don’t know which GL-ES version you’re targeting. But according to the OpenGL ES 3.2 Specification, GL_RGB565 is a color-renderable internal format (see Table 8.10 on pp. 163-164). So you can render to it (if targetting GL-ES 3.2). That would avoid this conversion on readback.

If you’re not rendering to GL_RGB565 color, I would definitely try making this FBO color format change, timing the readback, and looking at CPU usage when there’s no pixel format conversion going on. That will give you good information going forward.

If you’re not rendering to GL_RGB565 color and if for some reason you can’t switch rendering to that format, then you might explore alternate transfer paths in the driver for converting those pixels from whatever format you’re using for your color buffer (e.g. GL_RGBA8) to GL_RGB565 and then doing the readback. For instance, you could:

Create 2 FBOs: FBO1 with GL_RGBA8 color buffer. FBO2 with GL_RGB565 color buffer.
Render to FBO1
glBlitFramebuffer() color from FBO1 to FBO2
glReadPixels() back the color buffer from FBO2

If texel format conversions via Blit are faster than via ReadPixels, then this could be a win. However, this is more mem B/W on your slow system RAM, so even so it may be slower.

esahin · February 14, 2021, 10:03am

Hello,

Currently I am using only PBO not using render buffer for FBO as you said.
My nvidia platform GL-ES Version is : 4.6.0 NVIDIA 32.4.4 and as you said that
GL_RGB565 should be an internal format.Previously I created a render buffer for
color attachment GL_RGB565 and could not get color pixels only a black screen
when reading with glReadPixels. I double check the implementation which is
as the following and result is again a black screen.

void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth *2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;
        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }
}

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;

    createRenderBuffer16();
    
    glFinish();
    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
    //glClearColor(1.0,0.0,0.0,1.0); // debug purpose red color.
    //glClear(GL_COLOR_BUFFER_BIT);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf );

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Read Time " << readTime;
}

For test purposes, When I activated glClearColor I can see that glReadPixels is getting correct red color.

I also tested blit suggestion with qt default framebuffer previously which is as the following:

void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth *2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;
        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }
}

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;

    createRenderBuffer16();

    glFinish();
    t1.start();
    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);
    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf );

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Blit Time " << blitTime;
    //qDebug() << "Read Time " << readTime;
}

Total CPU usage is between 21-24%.Blit Time 0.161ms and Read Time 1.885ms.
If I commented out glReadPixels call then glBlitFramebuffer cpu load is : 11-12% which is quite good since
default qt application cpu load is around 10-11% cpu but when I activated glReadPixel part cpu load is reaching
to 21-24%.

I added pixel store call for optimizations as the following:

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,1280);

and Total CPU load is same 21-24%.As you said that blit is faster and consumes less cpu but when I added glReadPixels
again it is causing high cpu load.So Is there anything that I am missing or Is there any other way to read without glReadPixels?

Regards

esahin · February 14, 2021, 11:54am

One more update as the following:

I modified the algorithm as the following with glReadBuffer and cpu load is between 19-24%.

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;
    createRenderBuffer16();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    glReadBuffer(GL_COLOR_ATTACHMENT0); // frameBuffer16 also works
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf );

    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    /*GLubyte *ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT);
    if (ptr)
    {
        memcpy(Readback_buf, ptr, pbo_size);

        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
    }
    else
    {
        qDebug() << "NULL bokk";
    }*/
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);


    //t1.stop();
    //readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Blit Time " << blitTime;
    //qDebug() << "Read Time " << readTime;
}

and also run the test with a pbo as the following and got the similar cpu load :

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;
    createRenderBuffer16();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    glReadBuffer(GL_COLOR_ATTACHMENT0); // frameBuffer16 also works
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );

    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    GLubyte *ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT);
    if (ptr)
    {
        memcpy(Readback_buf, ptr, pbo_size);

        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
    }
    else
    {
        qDebug() << "NULL bokk";
    }
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);


    //t1.stop();
    //readTime = t1.getElapsedTimeInMilliSec();

    //qDebug() << "Blit Time " << blitTime;
    //qDebug() << "Read Time " << readTime;
}

It seems to me it is more closer the blit algorithm that you ment above.If not Could you please lemme know how to manage it

Regards

esahin · February 15, 2021, 7:03am

Hello,

Additionally I modified your doReadbackFAST algorithm with glBlit and Total cpu load is
28-32% similar to when there is no glBlit :

void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth *2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;

        int rowL;

        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);
        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }
}


void WaylandEgl::initFastBuffers16()
{
    if (!buffCreated)
    {
        checkTypes();
        createRenderBuffer16();

        glGenBuffers( PBO_COUNT, pboIds );

        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;

        glBindBuffer( target, pboIds[0] );
        glBufferData( target, pbo_size, 0, GL_STATIC_COPY );


        glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
        if (!glGetBufferParameterui64vNV)
        {
            qDebug() << "glGetBufferParameterui64vNV not fouynded!";
            return;
        }

        glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
        if (!glMakeBufferResidentNV)
        {
            qDebug() << "glMakeBufferResidentNV not fouynded!";
            return;
        }

        glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
        if (!glUnmapBufferARB)
        {
            qDebug() << "glUnmapBufferARB not fouynded!";
            return;
        }

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        qDebug() << "Run the optimizatiosn16";


        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV( target, GL_READ_ONLY );

        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, pboIds[1] );
        glBufferData( target, pbo_size, 0, GL_STREAM_READ );

        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT);
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
        buffCreated = true;
    }
}

void WaylandEgl::doReadbackFAST16() // perfect on intel..
{
    // Work-around for NVidia driver readback crippling on GeForce.

    initFastBuffers16();

    glFinish();
    Timer t1;
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    glReadBuffer(GL_COLOR_ATTACHMENT0); // frameBuffer16 also works

    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );
    glReadPixels( 0, 0, mWinWidth, mWinHeight,
                  GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         pbo_size );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
                        Readback_buf );

    //sendImage((unsigned char*)Readback_buf,pbo_size);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    //qDebug() << "Read Time " << readTime;
    //qDebug() << "Process Time " << processTime;
}

Do I miss something?
Regards

Dark_Photon · February 16, 2021, 3:16am

I’m not sure what else to tell you. You’re just going to have to profile your app and determine where exactly the extra CPU usage is coming from (from what component). With that information, you then need to figure out what option(s) you have to reduce it, if any.

Just a few notes and suggestions for you to try:

Having a glBlitFramebuffer() in all this doesn’t serve any useful purpose unless you are rendering to a format besides GL_RGB565. For the case that you are rendering to GL_RGB565, just remove this needless overhead.
Did you ever try the export __GL_YIELD=USLEEP, just in case the extra overhead was in the GL driver twiddling its thumbs with a busy-wait? (Other possible options to try and compare against: export __GL_YIELD= and export __GL_YIELD=NOTHING
Try doing 2 glReadPixels() operations back-to-back (for different pixels in the FBO). Time each separately. If the 2nd takes significantly less time, it could be that the 1st glReadPixels() is including pipeline “flush” behavior, which could help explain some of your extra CPU usage.
Consider comparing the glReadPixels() timings and CPU usage against glGetTexImage().
A tip suggested by @GClements recently (LINK)… See if your NVIDIA Jetson Xavier NX platform supports one of the EGL lock surface extensions (EGL_KHR_lock_surface, EGL_KHR_lock_surface2, EGL_KHR_lock_surface3). From gpuinfo.org, apparently some NVIDIA Tegra platforms do. If so, it appears that you may be able to request that the graphics driver exactly match the RGB565 format you’re targeting with a specified byte order in an EGL surface it creates, as well as provide you a way to access the RGB565 framebuffer data directly without doing a glReadPixels(). If supported, that may be what you need to reduce CPU usage accessing the rendered pixel data.

esahin · February 16, 2021, 9:25am

Hello,

I profiled my application’s cpu usage and it is as the following according to algorithms:

algorithm performRenderBuffer16: 6-7% CPU usage from glMapBufferRange,glUnmapBuffer function calls
and %2 by memcpy().

algorithm doReadbackFAST16: 5-6% CPU usage from glGetBufferSubData function.

1.It seems to me glBlitFramebuffer not adding too much cpu load.

2.I tried to export __GL_YIELD=USLEEP, export __GL_YIELD= and export __GL_YIELD=NOTHING
for performRenderBuffer16() algorithm above and export __GL_YIELD=USLEEP is the best fit
and reducing the Total cpu usage up to 12-16%.

Note that for performRenderBuffer16() algorithm, I added following modifications:

   glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    //glReadBuffer(frameBuffer16); // frameBuffer16 also works
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);

for doReadbackFAST16 algorithms I observed that I forget to add

	glBindFramebuffer(GL_FRAMEBUFFER,frameBuffer16);

call before glReadPixels and now Total cpu usage with this algorithm is 13-17%.
When I added export __GL_YIELD=USLEEP, export __GL_YIELD= and export __GL_YIELD=NOTHING
for doReadbackFAST16 algorithms they all have similar results for total Cpu usage 13-16%.

So It seems to me I need to use export __GL_YIELD=USLEEP.

I have implemented a function as the following for reading 24bit colors and 16bit colors back to back as you suggested
with blocking glReadPixels:

void WaylandEgl::createDarkRenderBuffers()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size16 = mWinHeight * mWinWidth * 2;
        pbo_size24 = mWinHeight * mWinWidth * 3;
        nBytesPerLine = mWinWidth ;
        Readback_buf24 = (GLchar *) malloc( pbo_size24 );
        Readback_buf16 = (GLchar *) malloc( pbo_size16 );

        glGenRenderbuffers( 1, &darkRenderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, darkRenderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &darkFrameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, darkRenderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }

        glGenRenderbuffers( 1, &darkRenderBuffer24 );
        glBindRenderbuffer( GL_RENDERBUFFER, darkRenderBuffer24 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &darkFrameBuffer24 );
        glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer24);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, darkRenderBuffer24);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;


        int rowL;

        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei( GL_UNPACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);
        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;

        glBindFramebuffer(GL_FRAMEBUFFER, 0);

    }
}

void WaylandEgl::performDarkRenderBuffers()
{
    Timer t1;
    createDarkRenderBuffers();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,darkFrameBuffer24);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer24);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_BYTE, Readback_buf24);

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();
    qDebug() << "Blit Time1 is: " << blitTime;
    qDebug() << "Read Time1 is: " << readTime;

    t1.start();
    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,darkFrameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindFramebuffer( GL_FRAMEBUFFER, darkFrameBuffer16);
    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf16);

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    qDebug() << "Blit Time2 is: " << blitTime;
    qDebug() << "Read Time2 is: " << readTime;
}

As you told all the time 2nd takes significantly less time such as:

Read Time1 is: 4.531 ms
Read Time2 is: 1.144 ms

So for this pipeline “flush” behavior should I do anything ?

Previously I did not use glGetTexImage and I think I need to find out how I can reach qt’s default GL_TEXTURE_2D.
When I try to get already binded texture object from qt as the following:

void WaylandEgl::performTextureStaff()
{
    if (!buffCreated)
    {
        int sizeX, sizeY;
        GLenum format;

        glGetIntegerv ( GL_TEXTURE_BINDING_2D, (int *) &boundTex );

        glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_WIDTH, &sizeX);
        glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_HEIGHT, &sizeY);
        glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_INTERNAL_FORMAT, (GLint*)&format);

        qDebug() << "Size x" << sizeX << "size y" << sizeY << "format" << format;

        qDebug() << "Text id " << boundTex;
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 4;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        buffCreated = true;
    }

    Timer t1;
    glFinish();
    t1.start();

    glBindTexture ( GL_TEXTURE_2D, boundTex );
    qDebug() << "Bind error is " << glGetError();
    glGetTexImage ( GL_TEXTURE_2D, 0, GL_RGBA , GL_UNSIGNED_BYTE, Readback_buf );
    qDebug() << "glGetTexImage error is " << glGetError();
    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();
    //qDebug() << "Read Text time" << readTime;
}

I do not have correct result for width and height and have only black screen.So need to check this.

I need to check EGL lock surface extensions on xavier and will update here.

Regards

esahin · February 16, 2021, 10:36am

An update for glGetTexImage, I implemented following running example and test one by one
as removing comments for glReadPixels and glGetTexImage:

void WaylandEgl::performTextureStaff()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glGenFramebuffers(1, &textFrameBuffer);
        glBindFramebuffer(GL_FRAMEBUFFER, textFrameBuffer);

        glGenTextures(1, &boundTex);
        glBindTexture(GL_TEXTURE_2D, boundTex);
        glTexImage2D(GL_TEXTURE_2D, 0,GL_RGB, mWinWidth, mWinHeight, 0,GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);

        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);

        glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,GL_TEXTURE_2D, boundTex, 0);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Texture Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        glBindFramebuffer(GL_FRAMEBUFFER, 0);
    }

    glFinish();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,textFrameBuffer);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    Timer t1;
    t1.start();

    glBindTexture(GL_TEXTURE_2D, boundTex);
    //glGetTexImage(GL_TEXTURE_2D,0,GL_RGB,GL_UNSIGNED_SHORT_5_6_5,Readback_buf);
    //glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, Readback_buf);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    qDebug() << "Process time is " << processTime;
}

Results:

glGetTexImage:28-35% CPU load without export __GL_YIELD=USLEEP.21-25% with export __GL_YIELD=USLEEP and Process time is 2.151 ms.
glReadPixels:25-30% CPU load without export __GL_YIELD=USLEEP. 22-26% with export __GL_YIELD=USLEEP and Process time is 2.758 ms.

I will put a bit more focus on glGetTexImage to see if it helps on other algorithm as well.

esahin · February 16, 2021, 11:18am

For EGL_KHR_lock_surface extension I have checked it as the following and it does not exist in support extensions:

    qDebug("EGL Version: \"%s\"\n", eglQueryString(eglGetCurrentDisplay(), EGL_VERSION));
    qDebug("EGL Vendor: \"%s\"\n", eglQueryString(eglGetCurrentDisplay(), EGL_VENDOR));
    qDebug("EGL Extensions: \"%s\"\n", eglQueryString(eglGetCurrentDisplay(), EGL_EXTENSIONS));

EGL Version: “1.5”

EGL Vendor: “NVIDIA”

EGL Extensions: “EGL_ANDROID_native_fence_sync EGL_EXT_buffer_age EGL_EXT_client_sync EGL_EXT_create_context_robustness EGL_EXT_image_dma_buf_import EGL_EXT_image_dma_buf_import_modifiers EGL_EXT_output_base EGL_EXT_output_drm EGL_EXT_protected_content EGL_EXT_stream_consumer_egloutput EGL_EXT_stream_acquire_mode EGL_EXT_sync_reuse EGL_IMG_context_priority EGL_KHR_config_attribs EGL_KHR_create_context_no_error EGL_KHR_context_flush_control EGL_KHR_create_context EGL_KHR_display_reference EGL_KHR_fence_sync EGL_KHR_get_all_proc_addresses EGL_KHR_partial_update EGL_KHR_swap_buffers_with_damage EGL_KHR_no_config_context EGL_KHR_gl_colorspace EGL_KHR_gl_renderbuffer_image EGL_KHR_gl_texture_2D_image EGL_KHR_gl_texture_3D_image EGL_KHR_gl_texture_cubemap_image EGL_KHR_image EGL_KHR_image_base EGL_KHR_reusable_sync EGL_KHR_stream EGL_KHR_stream_attrib EGL_KHR_stream_consumer_gltexture EGL_KHR_stream_cross_process_fd EGL_KHR_stream_fifo EGL_KHR_stream_producer_eglsurface EGL_KHR_surfaceless_context EGL_KHR_wait_sync EGL_MESA_image_dma_buf_export EGL_NV_context_priority_realtime EGL_NV_cuda_event EGL_NV_nvrm_fence_sync EGL_NV_stream_cross_display EGL_NV_stream_cross_object EGL_NV_stream_cross_process EGL_NV_stream_flush EGL_NV_stream_metadata EGL_NV_stream_remote EGL_NV_stream_reset EGL_NV_stream_socket EGL_NV_stream_socket_unix EGL_NV_stream_sync EGL_NV_stream_fifo_next EGL_NV_stream_consumer_gltexture_yuv EGL_NV_stream_attrib EGL_NV_system_time EGL_NV_output_drm_flip_event EGL_WL_bind_wayland_display EGL_WL_wayland_eglstream”

Thanks for the hint I will continue with other optimizations you ment above.

Regards

esahin · February 16, 2021, 12:41pm

One more update, I modified your doReadbackFAST algorithm to use glGetTexImage as the following:

void WaylandEgl::doFastReadBackTexture() // 12.4  cpu load :)
{
    // Work-around for NVidia driver readback crippling on GeForce.

    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glGenFramebuffers(1, &textFrameBuffer);
        glBindFramebuffer(GL_FRAMEBUFFER, textFrameBuffer);

        glGenTextures(1, &boundTex);
        glBindTexture(GL_TEXTURE_2D, boundTex);
        glTexImage2D(GL_TEXTURE_2D, 0,GL_RGB, mWinWidth, mWinHeight, 0,GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);

        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);

        glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,GL_TEXTURE_2D, boundTex, 0);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Texture Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        glBindFramebuffer(GL_FRAMEBUFFER, 0);

        glGenBuffers( PBO_COUNT, pboIds );

        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;

        glBindBuffer( target, pboIds[0] );
        glBufferData( target, pbo_size, 0, GL_STATIC_COPY );


        glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
        if (!glGetBufferParameterui64vNV)
        {
            qDebug() << "glGetBufferParameterui64vNV not fouynded!";
            return;
        }

        glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
        if (!glMakeBufferResidentNV)
        {
            qDebug() << "glMakeBufferResidentNV not fouynded!";
            return;
        }

        glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
        if (!glUnmapBufferARB)
        {
            qDebug() << "glUnmapBufferARB not fouynded!";
            return;
        }

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        qDebug() << "Run the optimizatiosn16";


        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV( target, GL_READ_ONLY );

        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, pboIds[1] );
        glBufferData( target, pbo_size, 0, GL_STREAM_READ );

        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT);
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
        buffCreated = true;

        int rowL;
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);

        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;
    }


    glFinish();
    Timer t1;
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,textFrameBuffer);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );
    glBindTexture(GL_TEXTURE_2D, boundTex);
    //glReadPixels( 0, 0, mWinWidth, mWinHeight,
      //            GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
    glGetTexImage(GL_TEXTURE_2D,0,GL_RGB,GL_UNSIGNED_SHORT_5_6_5,0);

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         pbo_size );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
                        Readback_buf );

    //sendImage((unsigned char*)Readback_buf,pbo_size);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    //qDebug() << "Read Time " << readTime;
    //qDebug() << "Process Time " << processTime;
}

and glGetTexImage consumes 12-14 % CPU . Read Time 0.216 ms and Process Time 3.296 ms.
It is similar to performRenderBuffer16 algorithm CPU load which is good number

Regards

esahin · February 23, 2021, 8:42am

Hello ,

I have some more questions as the following:

I can not find GL_BUFFER_GPU_ADDRESS_NV,GLuint64EXT etc inside <GLES* includes.They are existing on #include <GL/glext.h> which is for Desktop. Since I want to run the algorithms on embedded device I need to know what should I do ? Should I add <GL/gl.h> and <GL/glext.h> includes ?

Also, glReadPixels are transforming pixels in wrong direction because of nature of glReadPixels internal implementation. In order to transform the pixels I performed an algorithm that may cause a bit cpu load as well.So Do you know if there is a ready function on opengl to transform pixels in correct order before sending to another device ?

Best Regards

Alfonse_Reinheart · February 23, 2021, 2:58pm

If you want to run code on non-NVIDIA platforms, then don’t use NVIDIA-only extensions, like most of their bindless stuff.

esahin · February 23, 2021, 3:58pm

Hello ,

I am sorry but I could not understand what you mean by telling “don’t use NVIDIA-only extensions”, if you mean not using GL_BUFFER_GPU_ADDRESS_NV because it is not exist in <GLES* includes on nvidia platform ,do you have any other suggestions? Since it seems to me it is helpful for reducing cpu usage on nvidia xavier nx platform I used it. Also should I add or is it logical to add <GL/gl.h> and <GL/glext.h> when running the algorithm on embedded platform ?

Also, glReadPixels are transforming pixels in wrong direction because of nature of glReadPixels internal implementation. In order to transform the pixels I performed an algorithm that may cause a bit cpu load as well.So Do you know if there is a ready function on opengl to transform pixels in correct order before sending to another device ?

Best Regards

Alfonse_Reinheart · February 23, 2021, 4:38pm

I can’t suggest an alternative because I don’t know your code. I don’t know what your algorithm is, nor do I know why you’re using bindless memory and such like this.

I could say to use SSBOs, but that wouldn’t explain how to use them to get equivalent behavior, since I have no idea what that behavior would be. Without knowing anything about your particular use case, I can’t say.

You said that before, but it still doesn’t make sense. What is the “wrong direction” exactly, and what would the right direction be?

esahin · February 24, 2021, 5:00am

Hello,

In fact I mentioned my requirement and algorithms on previous posts.So I will
try to summarize my requirement and algorithms as the following:

I am trying to save screenshot of a qml quick controls application on nvidia jetson xavier nx platform (running QT on wayland) by using native opengl functions.
What I need is to get 16 bit RGB color buffer pixels and send it to another device which does not have dma buf or opengl support, so I have to send color pixels as
byte array. I managed to implement my goal with GL_RGB565 render buffer and with PBO with asychronous read back.And tried to reduce the cpu load with different
optimizations as ment in previous posts.My two different algorithms are as the following:

First Algorithm:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;

        int rowL;

        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei( GL_UNPACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);
        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        glBindFramebuffer(GL_FRAMEBUFFER, 0);

        glGenBuffers(PBO_COUNT,pboIds);
        glBindBuffer(GL_PIXEL_PACK_BUFFER,pboIds[0]);
        glBufferData(GL_PIXEL_PACK_BUFFER, pbo_size, 0, GL_STREAM_READ);
        glBindBuffer(GL_PIXEL_PACK_BUFFER,pboIds[1]);
        glBufferData(GL_PIXEL_PACK_BUFFER, pbo_size, 0, GL_STREAM_READ);
        glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
    }
}

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;
    createRenderBuffer16();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    //glReadBuffer(frameBuffer16); // frameBuffer16 also works
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    GLubyte *ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT);
    if (ptr)
    {
        memcpy(Readback_buf, ptr, pbo_size);

        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
    }
    else
    {
        qDebug() << "NULL bokk";
    }
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
    glBindFramebuffer(GL_FRAMEBUFFER, 0);
    //eglMakeCurrent(eglGetCurrentDisplay(), eglGetCurrentSurface(EGL_DRAW), eglGetCurrentSurface(EGL_READ), eglGetCurrentContext());
    //qDebug() << "Err"<< eglGetError();

    //t1.stop();
    //readTime = t1.getElapsedTimeInMilliSec();

    qDebug() << "Blit Time " << blitTime;
    qDebug() << "Read Time " << processTime;
}
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Second Algorithm:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
void WaylandEgl::initFastBuffers16()
{
    if (!buffCreated)
    {
        checkTypes();
        createRenderBuffer16();

        glGenBuffers( PBO_COUNT, pboIds );

        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;

        glBindBuffer( target, pboIds[0] );
        glBufferData( target, pbo_size, 0, GL_STATIC_COPY );


        glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
        if (!glGetBufferParameterui64vNV)
        {
            qDebug() << "glGetBufferParameterui64vNV not fouynded!";
            return;
        }

        glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
        if (!glMakeBufferResidentNV)
        {
            qDebug() << "glMakeBufferResidentNV not fouynded!";
            return;
        }

        glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
        if (!glUnmapBufferARB)
        {
            qDebug() << "glUnmapBufferARB not fouynded!";
            return;
        }

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        qDebug() << "Run the optimizatiosn16";


        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV( target, GL_READ_ONLY );

        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, pboIds[1] );
        glBufferData( target, pbo_size, 0, GL_STREAM_READ );

        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT);
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
        buffCreated = true;
    }
}

void WaylandEgl::doReadbackFAST16() // perfect on intel..
{
    initFastBuffers16();

    glFinish();
    Timer t1;
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );
    glBindFramebuffer(GL_FRAMEBUFFER,frameBuffer16);
    glReadPixels( 0, 0, mWinWidth, mWinHeight,
                  GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         pbo_size );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
                        Readback_buf );

    //sendImage((unsigned char*)Readback_buf,pbo_size);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    //qDebug() << "Read Time " << readTime;
    //qDebug() << "Process Time " << processTime;
}
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So I am thinking that using nvidia specific function calls are reducing the cpu load when I run the algorithm
on nvidia platform maybe I am wrong because I could not see some of the function in <GLES* includes.For this
reason I asked to add <GL/gl.h> and <GL/glext.h> includes to understand if it is logical when running the algorithm on
embedded platforms. Also Does <GL/gl.h> and <GL/glext.h> includes should only be used for Desktop systems or can be added
when running the algorithms on embedded devices?

Since I will transfer the color pixels to another device , the glReadPixels read pixel should be flipped because
it is reading the pixels in vice versa order which cause the screenshot as reversed.For this reason I have to
flip the pixels before sending it to another device and want to know if there is a ready function inside opengl libs.

Best Regards

GClements · February 24, 2021, 5:47am

For ES, you shouldn’t use any of those headers, only <GLES3/gl3*.h>.

ES extensions are distinct from desktop OpenGL extensions.

glReadPixels always returns data starting at the lower-left corner (you can’t specify a negative value for GL_PACK_ROW_LENGTH). You can use glBlitFramebuffer to flip an image by specifying Y1<Y0 for either the source or destination rectangle. That may or may not be faster than performing the flip in software. You could also read the data a row at a time with multiple glReadPixels calls, but I suspect that would be slow.

Dark_Photon · February 24, 2021, 2:43pm

You could try rendering the image inverted in Y. Then the glReadPixels() should give you the pixel order you want.

Read up on a Y-inverted projection matrix and glFrontFace().

esahin · February 28, 2021, 9:39am

Hello,

As you suggested , I removed include <GL/gl.h> and #include <GL/glext.h> for ES and only using <GLES3* includes but for example some functions such as glGetTexImage does not exist inside <GLES* includes and this function is not bindless on my nvidia xavier platform since I performed a test with this function which is succesfull.So for such kinds of functions , Should I create a header file and put the declaration such as :

GLAPI void GLAPIENTRY glGetTexImage( GLenum target, GLint level,
GLenum format, GLenum type,
GLvoid *pixels );

or what option do I need to use ?

Also, I used glBlitFramebuffer which is successfull and less cpu consuming since I was already having a glBlitFramebuffer call in my algorithm.

Best Regards

esahin · February 28, 2021, 9:40am

Hello,

I will check Y-inverted projection matrix and glFrontFace().

Best Regards

Dark_Photon · February 28, 2021, 2:45pm

It’s academic where the header prototypes come from. The issue is which API you are targeting: OpenGL or OpenGL ES. This determines:

which physical library you link with: libGLESv2 or libGL,
which API prototypes you need to match the APIs in that library (whether included or not), and
which type of graphics context you create.

You just have to choose.

According to the NVIDIA Linux Jetson Developer’s Guide (Software Features : Graphics), the platform supports both OpenGL 4.6 and OpenGL ES 3.2. So you have a choice.

Flipping back and forth between them dynamically on one graphics context like there’s no difference might work on some vendors. But AFAIK, this is not required to work per spec. Your app could start crashing or misbehaving at any time. To avoid problems, you should choose one and stick with it.

If you’ve decided that you do need glGetTexImage(), then you should choose OpenGL, as this doesn’t exist in OpenGL ES. This then suggests you should be:

including the GL includes (e.g. <GL/gl.h> and <GL/glext.h>, not <GLES/*>),
linking with the GL library (e.g. -lGL, not -lGLESv2)
creating an OpenGL context (not an OpenGL ES context).
calling OpenGL APIs (not OpenGL-ES APIs).