Fast Readbacks on Intel and NVIDIA

One more update, I modified your doReadbackFAST algorithm to use glGetTexImage as the following:

void WaylandEgl::doFastReadBackTexture() // 12.4  cpu load :)
{
    // Work-around for NVidia driver readback crippling on GeForce.

    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glGenFramebuffers(1, &textFrameBuffer);
        glBindFramebuffer(GL_FRAMEBUFFER, textFrameBuffer);

        glGenTextures(1, &boundTex);
        glBindTexture(GL_TEXTURE_2D, boundTex);
        glTexImage2D(GL_TEXTURE_2D, 0,GL_RGB, mWinWidth, mWinHeight, 0,GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);

        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);

        glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,GL_TEXTURE_2D, boundTex, 0);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Texture Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        glBindFramebuffer(GL_FRAMEBUFFER, 0);

        glGenBuffers( PBO_COUNT, pboIds );

        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;

        glBindBuffer( target, pboIds[0] );
        glBufferData( target, pbo_size, 0, GL_STATIC_COPY );


        glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
        if (!glGetBufferParameterui64vNV)
        {
            qDebug() << "glGetBufferParameterui64vNV not fouynded!";
            return;
        }

        glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
        if (!glMakeBufferResidentNV)
        {
            qDebug() << "glMakeBufferResidentNV not fouynded!";
            return;
        }

        glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
        if (!glUnmapBufferARB)
        {
            qDebug() << "glUnmapBufferARB not fouynded!";
            return;
        }

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        qDebug() << "Run the optimizatiosn16";


        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV( target, GL_READ_ONLY );

        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, pboIds[1] );
        glBufferData( target, pbo_size, 0, GL_STREAM_READ );

        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT);
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
        buffCreated = true;

        int rowL;
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);

        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;
    }


    glFinish();
    Timer t1;
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,textFrameBuffer);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );
    glBindTexture(GL_TEXTURE_2D, boundTex);
    //glReadPixels( 0, 0, mWinWidth, mWinHeight,
      //            GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
    glGetTexImage(GL_TEXTURE_2D,0,GL_RGB,GL_UNSIGNED_SHORT_5_6_5,0);

    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         pbo_size );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
                        Readback_buf );

    //sendImage((unsigned char*)Readback_buf,pbo_size);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    //qDebug() << "Read Time " << readTime;
    //qDebug() << "Process Time " << processTime;
}

and glGetTexImage consumes 12-14 % CPU . Read Time 0.216 ms and Process Time 3.296 ms.
It is similar to performRenderBuffer16 algorithm CPU load which is good number :slight_smile:

Regards

Hello ,

I have some more questions as the following:

I can not find GL_BUFFER_GPU_ADDRESS_NV,GLuint64EXT etc inside <GLES* includes.They are existing on #include <GL/glext.h> which is for Desktop. Since I want to run the algorithms on embedded device I need to know what should I do ? Should I add <GL/gl.h> and <GL/glext.h> includes ?

Also, glReadPixels are transforming pixels in wrong direction because of nature of glReadPixels internal implementation. In order to transform the pixels I performed an algorithm that may cause a bit cpu load as well.So Do you know if there is a ready function on opengl to transform pixels in correct order before sending to another device ?

Best Regards

If you want to run code on non-NVIDIA platforms, then don’t use NVIDIA-only extensions, like most of their bindless stuff.

Hello ,

I am sorry but I could not understand what you mean by telling “don’t use NVIDIA-only extensions”, if you mean not using GL_BUFFER_GPU_ADDRESS_NV because it is not exist in <GLES* includes on nvidia platform ,do you have any other suggestions? Since it seems to me it is helpful for reducing cpu usage on nvidia xavier nx platform I used it. Also should I add or is it logical to add <GL/gl.h> and <GL/glext.h> when running the algorithm on embedded platform ?

Also, glReadPixels are transforming pixels in wrong direction because of nature of glReadPixels internal implementation. In order to transform the pixels I performed an algorithm that may cause a bit cpu load as well.So Do you know if there is a ready function on opengl to transform pixels in correct order before sending to another device ?

Best Regards

I can’t suggest an alternative because I don’t know your code. I don’t know what your algorithm is, nor do I know why you’re using bindless memory and such like this.

I could say to use SSBOs, but that wouldn’t explain how to use them to get equivalent behavior, since I have no idea what that behavior would be. Without knowing anything about your particular use case, I can’t say.

You said that before, but it still doesn’t make sense. What is the “wrong direction” exactly, and what would the right direction be?

Hello,

In fact I mentioned my requirement and algorithms on previous posts.So I will
try to summarize my requirement and algorithms as the following:

I am trying to save screenshot of a qml quick controls application on nvidia jetson xavier nx platform (running QT on wayland) by using native opengl functions.
What I need is to get 16 bit RGB color buffer pixels and send it to another device which does not have dma buf or opengl support, so I have to send color pixels as
byte array. I managed to implement my goal with GL_RGB565 render buffer and with PBO with asychronous read back.And tried to reduce the cpu load with different
optimizations as ment in previous posts.My two different algorithms are as the following:

First Algorithm:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
void WaylandEgl::createRenderBuffer16()
{
    if (!buffCreated)
    {
        qDebug() << "Heiht" << mWinHeight << "Width" << mWinWidth;
        pbo_size = mWinHeight * mWinWidth * 2;
        nBytesPerLine = mWinWidth ;
        Readback_buf = (GLchar *) malloc( pbo_size );

        glInfo glInfo;
        glInfo.getInfo();
        glInfo.printSelf();

        glGenRenderbuffers( 1, &renderBuffer16 );
        glBindRenderbuffer( GL_RENDERBUFFER, renderBuffer16 );
        glRenderbufferStorage( GL_RENDERBUFFER, GL_RGB565, mWinWidth, mWinHeight );
        glBindRenderbuffer(GL_RENDERBUFFER, 0);

        if (glGetError()==GL_NO_ERROR)
        {
            qDebug() << "Render buff storage is OK" << glGetError();
        }
        else
        {
            qDebug() << "Render buff storage error is " << glGetError();
        }

        glGenFramebuffers( 1, &frameBuffer16 );
        glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);
        glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, renderBuffer16);

        if( glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
        {
            qDebug() << "Framebuffer error is " << glGetError();
        }
        else
        {
            qDebug() << "Framebuffer is OK" << glGetError();
        }
        buffCreated = true;

        GLint format = 0, type = 0;
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_FORMAT, &format);
        glGetIntegerv(GL_IMPLEMENTATION_COLOR_READ_TYPE, &type);

        qDebug() << "Format" << format;
        qDebug() << "Type" << type;

        int rowL;

        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl before" << rowL;

        glPixelStorei( GL_PACK_ALIGNMENT, 1 );
        glPixelStorei( GL_UNPACK_ALIGNMENT, 1 );
        glPixelStorei(GL_PACK_ROW_LENGTH,nBytesPerLine);
        qDebug() << "Pixel st" << glGetError();
        glGetIntegerv(GL_PACK_ROW_LENGTH, &rowL);
        qDebug() << "Rowl after" << rowL;

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        glBindFramebuffer(GL_FRAMEBUFFER, 0);

        glGenBuffers(PBO_COUNT,pboIds);
        glBindBuffer(GL_PIXEL_PACK_BUFFER,pboIds[0]);
        glBufferData(GL_PIXEL_PACK_BUFFER, pbo_size, 0, GL_STREAM_READ);
        glBindBuffer(GL_PIXEL_PACK_BUFFER,pboIds[1]);
        glBufferData(GL_PIXEL_PACK_BUFFER, pbo_size, 0, GL_STREAM_READ);
        glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
    }
}

void WaylandEgl::performRenderBuffer16()
{
    Timer t1;
    createRenderBuffer16();

    glFinish();
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    t1.stop();
    blitTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);
    glBindFramebuffer( GL_FRAMEBUFFER, frameBuffer16);

    //glReadBuffer(frameBuffer16); // frameBuffer16 also works
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    glReadPixels( 0, 0, mWinWidth, mWinHeight, GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0);
    //glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIds[0]);

    GLubyte *ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT);
    if (ptr)
    {
        memcpy(Readback_buf, ptr, pbo_size);

        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
    }
    else
    {
        qDebug() << "NULL bokk";
    }
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
    glBindFramebuffer(GL_FRAMEBUFFER, 0);
    //eglMakeCurrent(eglGetCurrentDisplay(), eglGetCurrentSurface(EGL_DRAW), eglGetCurrentSurface(EGL_READ), eglGetCurrentContext());
    //qDebug() << "Err"<< eglGetError();

    //t1.stop();
    //readTime = t1.getElapsedTimeInMilliSec();

    qDebug() << "Blit Time " << blitTime;
    qDebug() << "Read Time " << processTime;
}
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Second Algorithm:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
void WaylandEgl::initFastBuffers16()
{
    if (!buffCreated)
    {
        checkTypes();
        createRenderBuffer16();

        glGenBuffers( PBO_COUNT, pboIds );

        // Buffer #0: glReadPixels target
        GLenum target = GL_PIXEL_PACK_BUFFER;

        glBindBuffer( target, pboIds[0] );
        glBufferData( target, pbo_size, 0, GL_STATIC_COPY );


        glGetBufferParameterui64vNV = (PFNGLGETBUFFERPARAMETERUI64VNVPROC)eglGetProcAddress("glGetBufferParameterui64vNV");
        if (!glGetBufferParameterui64vNV)
        {
            qDebug() << "glGetBufferParameterui64vNV not fouynded!";
            return;
        }

        glMakeBufferResidentNV = (PFNGLMAKEBUFFERRESIDENTNVPROC)eglGetProcAddress("glMakeBufferResidentNV");
        if (!glMakeBufferResidentNV)
        {
            qDebug() << "glMakeBufferResidentNV not fouynded!";
            return;
        }

        glUnmapBufferARB = (PFNGLUNMAPBUFFERARBPROC)eglGetProcAddress("glUnmapBufferARB");
        if (!glUnmapBufferARB)
        {
            qDebug() << "glUnmapBufferARB not fouynded!";
            return;
        }

        glGetBufferSubData = (PFNGLGETBUFFERSUBDATAPROC)eglGetProcAddress("glGetBufferSubData");
        if (!glGetBufferSubData)
        {
            qDebug() << "glGetBufferSubData not fouynded!";
            return;
        }

        qDebug() << "Run the optimizatiosn16";


        GLuint64EXT addr;
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV( target, GL_READ_ONLY );

        // Buffer #1: glCopyBuffer target
        target = GL_COPY_WRITE_BUFFER;
        glBindBuffer( target, pboIds[1] );
        glBufferData( target, pbo_size, 0, GL_STREAM_READ );

        glMapBufferRange( target, 0, 1, GL_MAP_WRITE_BIT);
        glUnmapBufferARB( target );
        glGetBufferParameterui64vNV( target, GL_BUFFER_GPU_ADDRESS_NV, &addr );
        glMakeBufferResidentNV     ( target, GL_READ_ONLY );
        buffCreated = true;
    }
}

void WaylandEgl::doReadbackFAST16() // perfect on intel..
{
    initFastBuffers16();

    glFinish();
    Timer t1;
    t1.start();

    glBindFramebuffer(GL_READ_FRAMEBUFFER,mwindow->openglContext()->defaultFramebufferObject());
    glBindFramebuffer(GL_DRAW_FRAMEBUFFER,frameBuffer16);
    glBlitFramebuffer(0, 0, mWinWidth, mWinHeight, 0, 0, mWinWidth, mWinHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);

    // Do a depth readback to BUF OBJ #0
    glBindBuffer( GL_PIXEL_PACK_BUFFER, pboIds[0] );
    glBindFramebuffer(GL_FRAMEBUFFER,frameBuffer16);
    glReadPixels( 0, 0, mWinWidth, mWinHeight,
                  GL_RGB, GL_UNSIGNED_SHORT_5_6_5, 0 );
    t1.stop();
    readTime = t1.getElapsedTimeInMilliSec();

    t1.start();
    // Copy from BUF OBJ #0 to BUF OBJ #1
    glBindBuffer( GL_COPY_WRITE_BUFFER, pboIds[1] );
    glCopyBufferSubData( GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0,
                         pbo_size );

    // Do the readback from BUF OBJ #1 to app CPU memory
    glGetBufferSubData( GL_COPY_WRITE_BUFFER, 0, pbo_size,
                        Readback_buf );

    //sendImage((unsigned char*)Readback_buf,pbo_size);
    t1.stop();
    processTime = t1.getElapsedTimeInMilliSec();
    glBindBuffer( GL_PIXEL_PACK_BUFFER, 0 );
    //qDebug() << "Read Time " << readTime;
    //qDebug() << "Process Time " << processTime;
}
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So I am thinking that using nvidia specific function calls are reducing the cpu load when I run the algorithm
on nvidia platform maybe I am wrong because I could not see some of the function in <GLES* includes.For this
reason I asked to add <GL/gl.h> and <GL/glext.h> includes to understand if it is logical when running the algorithm on
embedded platforms. Also Does <GL/gl.h> and <GL/glext.h> includes should only be used for Desktop systems or can be added
when running the algorithms on embedded devices?

Since I will transfer the color pixels to another device , the glReadPixels read pixel should be flipped because
it is reading the pixels in vice versa order which cause the screenshot as reversed.For this reason I have to
flip the pixels before sending it to another device and want to know if there is a ready function inside opengl libs.

Best Regards

For ES, you shouldn’t use any of those headers, only <GLES3/gl3*.h>.

ES extensions are distinct from desktop OpenGL extensions.

glReadPixels always returns data starting at the lower-left corner (you can’t specify a negative value for GL_PACK_ROW_LENGTH). You can use glBlitFramebuffer to flip an image by specifying Y1<Y0 for either the source or destination rectangle. That may or may not be faster than performing the flip in software. You could also read the data a row at a time with multiple glReadPixels calls, but I suspect that would be slow.

You could try rendering the image inverted in Y. Then the glReadPixels() should give you the pixel order you want.

Read up on a Y-inverted projection matrix and glFrontFace().

Hello,

As you suggested , I removed include <GL/gl.h> and #include <GL/glext.h> for ES and only using <GLES3* includes but for example some functions such as glGetTexImage does not exist inside <GLES* includes and this function is not bindless on my nvidia xavier platform since I performed a test with this function which is succesfull.So for such kinds of functions , Should I create a header file and put the declaration such as :

GLAPI void GLAPIENTRY glGetTexImage( GLenum target, GLint level,
GLenum format, GLenum type,
GLvoid *pixels );

or what option do I need to use ?

Also, I used glBlitFramebuffer which is successfull and less cpu consuming since I was already having a glBlitFramebuffer call in my algorithm.

Best Regards

Hello,

I will check Y-inverted projection matrix and glFrontFace().

Best Regards

It’s academic where the header prototypes come from. The issue is which API you are targeting: OpenGL or OpenGL ES. This determines:

  1. which physical library you link with: libGLESv2 or libGL,
  2. which API prototypes you need to match the APIs in that library (whether included or not), and
  3. which type of graphics context you create.

You just have to choose.

According to the NVIDIA Linux Jetson Developer’s Guide (Software Features : Graphics), the platform supports both OpenGL 4.6 and OpenGL ES 3.2. So you have a choice.

Flipping back and forth between them dynamically on one graphics context like there’s no difference might work on some vendors. But AFAIK, this is not required to work per spec. Your app could start crashing or misbehaving at any time. To avoid problems, you should choose one and stick with it.

If you’ve decided that you do need glGetTexImage(), then you should choose OpenGL, as this doesn’t exist in OpenGL ES. This then suggests you should be:

  • including the GL includes (e.g. <GL/gl.h> and <GL/glext.h>, not <GLES/*>),
  • linking with the GL library (e.g. -lGL, not -lGLESv2)
  • creating an OpenGL context (not an OpenGL ES context).
  • calling OpenGL APIs (not OpenGL-ES APIs).