Effective / Resourceful Video Streaming

wustus · February 4, 2023, 10:20am

Hi, I’m fairly new to OpenGL and therefore tried to avoid posting here - but I’ve been scratching my head the past couple of weekends.
I’m trying to display video using OpenGL to eventually run on a Banana PI (Raspberry Pi but cheaper).
Thus far I read a video, decode it, store every frame in a vector (for now) and display it using glTexImage2D/glTexSubImage2D. Initially, i followed the guide on learnopengl(dot)com to get a grasp of what to do. Then I read the Wiki about useful techniques like the use of PBOs and Orphaning.

My previous (simpler) attempt looked something like this:

while (!glfwWindowShouldClose(window)) {

    process_input(window);

    // clear the colorbuffer
    glClearColor(0.0f, 0.0f, 0.0f, 1.0f);
    glClear(GL_COLOR_BUFFER_BIT);

    shader.use();

    uint8_t *frame = *&frames.at(current_frame);

    current_frame = ++current_frame % frames.size();

    if (initial_frame) {
        // load texture into OpenGL
        glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, frame);
        // bind texture
        glBindTexture(GL_TEXTURE_2D, texture);
    } else {
        // update texture
        glTexSubImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, frame);
    }

    // create mipmap - the texture scaled to different sizes
    glGenerateMipmap(GL_TEXTURE_2D);

    // bind vertices
    glBindVertexArray(VAO);
    glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
    glBindVertexArray(0);

    glfwSwapBuffers(window);
    glfwPollEvents();
}

It ran somewhat well on my main machine but got maybe 1/3 to 1/2 the FPS of the video on the Pi.
After adding the PBO to store multiple frames on the GPU and orphaning the data to continuously update, it seemed like the video didn’t run any faster on my main machine and I’m not sure where to go from here.
The following is a minimal test I’m playing around with right now. I copy several frames into the PBO and give glTexSubImage2D the offset for each frame at each iteration:

int current_frame = 0;

unsigned int PBO;
glGenBuffers(1, &PBO);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, PBO);

// get the next bigger number of RGBA_FRAME_SIZE s.t. RGBA_FRAME_SIZE % n = 0
int BUFFER_SIZE = get_next_aligned_number(64);

int FRAMES_IN_BUFFER = 64;

glBufferData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * FRAMES_IN_BUFFER, NULL, GL_STREAM_DRAW);

// bind texture
glBindTexture(GL_TEXTURE_2D, texture);

// allocate memory for texture
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);

bool initial = true;

shader.use();

// render loop
while (!glfwWindowShouldClose(window)) {
    
    process_input(window);
    
    if (initial) {
        // orphan memory (not useful here because memory is only copied once)
        glBufferData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * FRAMES_IN_BUFFER, NULL, GL_STREAM_DRAW);
        
        for (int i=0; i!=FRAMES_IN_BUFFER; i++) {
            // copy frames into PBO with offset
            glBufferSubData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * i, BUFFER_SIZE, frames[(current_frame + i) % frames.size()]);
        }
        initial = false;
        
        glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
    }
    
    // tell opengl to use frame at offset [0,FRAMES_IN_BUFFER) * BUFFER_SIZE
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(intptr_t)(BUFFER_SIZE * (current_frame % FRAMES_IN_BUFFER)));
    glBindVertexArray(VAO);
    glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
    glBindVertexArray(0);
    
    current_frame = ++current_frame % frames.size();
    
    glfwSwapBuffers(window);
    glfwPollEvents();
}

Shouldn’t this run much faster than the code before because I’ve already copied the data?
I’ve also read about the use of glMapBuffer but wanted to try glBufferData / glBufferSubData for now because the Pi is running OpenGL ES and glMapBuffer isn’t part of the Core OpenGL there.

If anyone spots any obvious mistakes or could point me in the right direction I’d be very grateful.
Thanks for reading, have a nice day.

wustus · February 4, 2023, 8:16pm

Made some progress today.
After asking the great oracle named ChatGPT I disabled VSync on my main machine which drastically improved the performance.
Building the project on the Pi still results in suboptimal performance. Disabling VSync makes no difference (i assume because the video plays below 30 FPS anyways), reducing the texture resolution to half or quarter doesn’t seem to impact the FPS. The code uses minimal vertex and fragment shaders.
I’ve just tried to switch them to #version 100 instead of #version 330 core to no avail.

made me wonder how 330 core could even compile but some drivers apparently support newer GLSL versions.

Dark_Photon · February 5, 2023, 9:05pm

wustus:

I’m trying to display video using OpenGL to eventually run on a Banana PI (Raspberry Pi but cheaper).

Thus far I read a video, decode it, store every frame in a vector (for now) and display it using glTexImage2D/glTexSubImage2D. …

My previous (simpler) attempt looked something like this:
...
        glTexSubImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, frame);
...
    glGenerateMipmap(GL_TEXTURE_2D);
...
    glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
It ran somewhat well on my main machine but got maybe 1/3 to 1/2 the FPS of the video on the Pi.

After adding the PBO to store multiple frames on the GPU and orphaning the data to continuously update, it seemed like the video didn’t run any faster on my main machine and I’m not sure where to go from here.

The following is a minimal test I’m playing around with right now. …
while (!glfwWindowShouldClose(window)) {
...
        for (int i=0; i!=FRAMES_IN_BUFFER; i++) {
            // copy frames into PBO with offset
            glBufferSubData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * i, BUFFER_SIZE, frames[(current_frame + i) % frames.size()]);
        }
        initial = false;
        
        glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
    }
    
    // tell opengl to use frame at offset [0,FRAMES_IN_BUFFER) * BUFFER_SIZE
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(intptr_t)(BUFFER_SIZE * (current_frame % FRAMES_IN_BUFFER)));
    glBindVertexArray(VAO);
    glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
...
Shouldn’t this run much faster than the code before because I’ve already copied the data?

I’ve also read about the use of glMapBuffer but wanted to try glBufferData / glBufferSubData for now because the Pi is running OpenGL ES and glMapBuffer isn’t part of the Core OpenGL there.

To summarize, you’re trying to display video using OpenGL on a Banana Pi (which has an ARM Mali 400 MP2 GPU) via OpenGL ES. The resulting frame rate you’re seeing is 1/3 to 1/2 of the video’s frame rate. And you don’t know why.

Let me suggest that finding out the answer to that “why” question is your top priority. That is, what is your primary bottleneck in not being able to render faster on the Pi? There is one. You just don’t know what it is yet.

I’m going to give you some ideas. But ultimately, you’re going to have to root out what that primary bottleneck is. And once you know what it is, then you can determine what options are there for eliminating or reducing its impact on your rendering performance.

Anyway, some ideas…:

Is there a decent ARM Mali GPU profiler you can run on the Pi? That’ll likely point out where your bottleneck is. In the absence of that, you’ll have to try some targeted tests to infer what the bottleneck is.

What’s the target video resolution and frame rate? You didn’t mention what that was.

You did say that cutting the texture resolution back to 1/2 or 1/4 the size didn’t make any difference. So you know you’re not source resolution bound.

What do you know about how tile-based mobile GPUs (like the ARM Mali 400) render? If nothing, you should definitely read-up. Here’s one page from ARM (LINK) but there are others out there that go into more detail. In short, one of the implications of this and the slow memory mobile GPUs make use of is that buffer object and/or texture updates from the CPU draw thread may block your CPU draw thread or “ghost” (generate copies) behind the scenes – neither of which is good for performance. You need to know about this so you can avoid it. One easy way to avoid this is to never change a buffer object or a texture that you’ve told the GPU to render with until at least 3 frames after you’ve told the GPU to read from it. Then changing the resources shouldn’t block your draw thread’s execution or create transient “ghosts” for those resources behind-the-scenes, which can improve your performance.

How this applies to your code is the texture and buffer object (PBO) you’re changing in the above code. Try uploading to the GL texture once up-front and then just render frames with that same texture over-and-over. How’s your frame rate now? For an even simpler case, just render with a full screen quad and don’t even apply a texture to it. How’s the frame rate? Try just clearing the screen each frame and not rendering anything else. How’s the frame rate? Keep making this simpler and simpler until the bottleneck you’re observing is gone. And then start adding things back in to determine “what breaks it”.

If the key bottleneck does end up being buffer object and/or texture resource updating, then there are ways to make it cheaper. For instance, multibuffering resources (e.g. having a ring-buffer of 3 resources, and then rotate around that resource list each frame).

Another thought about the cost of resource updates…

Uploading new texel data to a GL texture is easy. Just call glTexSubImage(). Simple. It just copies pixels to the GPU texture in a big memcpy, right? Nope. That’s a common misconception. It doesn’t just copy pixels. Under-the-hood, it does a scattered write to “interleave” the texels in the image so that texels close to each other in X or Y are close to each other in memory (Vulkan calls this “tiling”). Hopefully mentioning “scattered writes” here is sufficient, but this memory reorganization process isn’t free. It takes time. On top of that, there may be implicit texel format conversions going on. Both of these can add latency to your frame, if you’re bottlenecked on it. So what can you do? Besides controlling the resolution, you can read-up in the GPU docs on which texel formats+layouts are the most efficient for the GPU to upload from and operate on, and then try to get your video frame generator to match that format. You might even find there’s a faster path to ship video frames to the GPU for display than 32-bit RGBA8 (e.g. YUV). Which brings me to…

Above you seem to be requesting a GL texture be created that’s GL_RGB (3 components; probably GL_RGB8 under-the-hood). However, you’re using GL_RGBA + GL_UNSIGNED_BYTE (4 components) to upload to it. This is likely to generate an extra cost when updating the texture with implicit format conversions. That is, I believe here you’re asking the driver to dynamically convert each of your uploaded video frames from GL_RGBA8 (4-byte) to GL_RGB8 (3-byte), which it’s probably going to do with slow CPU code. Instead of this, I would upload using the same format that you’ve asked the driver to allocate internally for the texture. This has the most likelyhood of avoiding costly implicit format conversion under-the-hood when uploading new video frames.

Also above, I see this:

    glGenerateMipmap(GL_TEXTURE_2D);

unless you’re going to be rendering your video minified by quite a bit, with a min filter of (e.g.) GL_LINEAR_MIPMAP_LINEAR, you should get rid of this. Generating MIPmaps takes time. And if you don’t need it, it’s a waste of both time and memory.

Above I also see:

        glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);

From your code above, you’re not mapping buffer objects. So you shouldn’t be unmapping. That you didn’t see this suggests that you’re not checking for GL errors. You should add this, as it will save you wasting time shooting yourself in the foot, when the driver’s trying to tell you what you’re doing wrong.

wustus · February 11, 2023, 6:37pm

This took me a time to work through and I wanted to come somewhat prepared with my answer.
I also now see that there are a looooot of intricacies I was not aware of.

But first of all a huge thanks for your detailed reply and also for being active in this community for such a long time. I first read your name in the now infamous ‘Rob Barris thread’ on Buffer Object Streaming about 13 years ago. That’s really cool.

I’ll answer a few questions and add missing notes in no particular order.

Screen Resolution / Frame Rate:
The Banana Pi is plugged into a 1024x768 screen. The video I’d want to play would run at 30 FPS, although 24 would also be acceptable.
The Pi Wiki states that 1080p, with H.264 encoding should be possible at 30 frames.
Since I don’t have a keyboard for the Pi and and effectively only work through SSH, I thought about switching from a Desktop Distribution to a CLI. But apparently rendering the Desktop shouldn’t hinder my performance that much.

Image Format:
I seemed to have the internal image format confused with the image format of the data.
I switched it up and tried a few variants. I also looked for the ‘preferred Mali 400 image format’. It seemed to always come back to RGBA. I tried using GL_RGB with glPixelStorei(GL_UNPACK_ALIGNMENT, 1) as an internal format and packed RGB data. The result seemed to be slower - apparently the data is converted to RGBA internally anyways.
I also tried GL_BGRA as it was described as faster in the OpenGL wiki but it’s not available in OpenGL ES.

Profiler:
I looked around a bit and I guess ‘Perfetto’ seemed to be a viable option to run in a CLI environment.
I had my problems setting the whole thing up though and decided to run tests by myself.

Window- / Resolution-Tests:
In my second comment I stated, that decreasing the texture resolution didn’t improve performance. This still holds at the moment: I scale down the video to one eighth of its size, decrease the buffer to one eighth but scale the resulting frame across the window. (Almost) Same result - I get approximately 17 FPS instead of 12.
What I didn’t test was changing the window size. Changing the window to 64x64, gave me ~40 FPS.

Simplifying The Pipeline:
This is where I honestly got a bit confused.
Stripping call by call naturally improved performance until there was virtually nothing left.
My naked render loop looked something like this:

while (true) {
    // read start time
    // ...

    glClearColor(0.0f, 0.0f, 0.0f, 1.0f);
    glClear(GL_COLOR_BUFFER_BIT);
    glfwSwapBuffers(window);
    glfwPollEvents();
    glFinish();

    // read finish time and calculate fps
    // ... 
}

This gave me a repeated 80 FPS, 30 FPS, 80 FPS, 30 FPS, 80 FPS, …
I couldn’t figure out why It fluctuated that much. I didn’t have anything bound, neither did I render anything.
But I figured it may be glfwSwapBuffers blocking and thought this might clear up when I actually render something.
So I did and stopped there.

I only bound my vertices and rendered them using glDrawElements. I didn’t even display a texture so I still had a black screen… but measured around 18 FPS.
The only thing I do is define two triangles (even using the preferred learnopengl method).

Basically this:

GLfloat vertices[] = {
         // vertices     padding  texture
         1,  1,  0,       0,            1, 1,
         1, -1,  0,       0,            1, 0,
        -1, -1,  0,       0,            0, 0,
        -1,  1,  0,       0,            0, 1
    };
    
    GLuint indices[] = {
        0, 1, 3, // first triangle
        1, 2, 3  // second triangle
    };

    // create vertex buffer object, which is sent to GPU as a whole
    unsigned int VBO;
    glGenBuffers(1, &VBO);
    
    // create element buffer object, which is sent to GPU as a whole
    // EBO uses indices to draw triangles in a given order to avoid overlap
    unsigned int EBO;
    glGenBuffers(1, &EBO);
    
    unsigned int VAO;
    glGenVertexArrays(1, &VAO);
    
    // bind vertex array object
    glBindVertexArray(VAO);
    
    // copy vertices array in a buffer for OpenGL to use
    glBindBuffer(GL_ARRAY_BUFFER, VBO);
    glBufferData(GL_ARRAY_BUFFER, sizeof(vertices) + 8, vertices, GL_STATIC_DRAW);
    
    // position attributes
    glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 6 * sizeof(GLfloat), (void*) 0);
    glEnableVertexAttribArray(0);
    
    // vertex sequence
    glVertexAttribPointer(1, 4, GL_FLOAT, GL_FALSE, 6 * sizeof(GLfloat), (void*)(4 * sizeof(GLfloat)));
    glEnableVertexAttribArray(1);
    
    // copy index array in element buffer for OpenGL to use
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, EBO);
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);

…bind every buffer only once and only call glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0) in the render loop.
Could drawing the vertices block up my render call? Should I orphan them?
Do I need to draw the vertices for every frame? Technically I only have display the frames - the vertices stay the same.

Dark_Photon · February 11, 2023, 8:46pm

Ok, this is good info! Will comment some about your perf stats below, but first…

Learn OpenGL is a great tutorial. But keep in mind that it’s geared to OpenGL (desktop GPUs), and not OpenGL ES (for mobile GPUs). The GPU and the driver behave very differently with mobile. And the bandwidths of the pipeline are orders of magnitude less with a mobile GPU. There are special concerns here that don’t matter as much (if at all) for a discrete desktop GPU.

Also… When optimizing realtime rendering, use “frame time” not FPS. FPS is nearly useless for perf work and counterintuitive to work with. This issue has been discussed many times before, so I’ll just link to a few:

Perf Testing Results Analysis

So if we take your FPS data, and convert it to Frame Time, we see some interesting trends:

Frame Time    FPS    Description
----------  -------  ----------------------------------------
            
  33.3 ms     30     Desired Frame Rate

  83.3 ms     12     Starting
  58.8 ms     17     Cut 1024x768 tex/video to 1/8 res; upscale to 1024x768
  25.0 ms     40     Cut 1024x768 tex/video to 1/8 res; scale   to   64x 64

  12.5 ms     80     Clear screen only (Frame 1 of 2)
  33.3 ms     30     Clear screen only (Frame 2 of 2)
  -------            --------------------------------
  45.8 ms    ~20     Clear screen only (2 Frames)
  22.2 ms    ~40     Clear screen only (1 Frame Average)

  55.5 ms     18     Render 2 tris, no texture

First:

As you can see, 12 fps and 17 fps aren’t even close to the same. They differ by ~25 milliseconds (ms), which is over 70% of the total frame time you’re targeting here (33.3 ms = 30 fps)!

Also interesting to note is that you’re getting about the same frame rate (40 fps = ~25.0 ms/frame) when you don’t do anything but clear the screen as when you render your quad to a tiny 64x64 window. Given the huge difference between that 25.0 ms frame time and when you render the same video res to a 1024x768 window (25.0 ms → 58.8 ms = +34 ms), that strongly suggests you’re window fill limited for some reason, and that 40fps may be your peak frame rate (for some reason). More on that in a second.

As to why when you do nothing but clear the screen your frame times alternate between ~12.5 ms and ~33.3 ms, I’m not sure. Could be the driver is triple buffering and blocking you until it’s only got 1 rendered frame in-the-queue, but then letting you run until you fill up the buffer queue (rendering 2 frames back-to-back before blocking you). In any case, it appears your current peak output frame rate is 40fps avg (with VSync enabled presumably), and you currently can’t reach that with a simple clear window with a window size of 1024x768 (but 64x64 works).

Near-term Goal

So ignoring the video rendering piece, it seems the key problem to solve is figuring out how you can render to a 1024x768 window at 40 fps (or at least 30fps) with just a clear screen (glClear()). Currently, you’re at 20fps. One possibility is that the driver is being asked (implicitly or explicitly) to do more work here than is necessary.

Watch out for glFinish() on Mobile GPUs

I missed this the first time through. On mobile GPUs, this is HUGE!

Simplifying The Pipeline:
…

while (true) {
    // read start time
    // ...

    glClearColor(0.0f, 0.0f, 0.0f, 1.0f);
    glClear(GL_COLOR_BUFFER_BIT);
    glfwSwapBuffers(window);
    glfwPollEvents();
    glFinish();      // <-------------------------------------- !!!

Remove that glFinish(). On mobile GPUs, that’ll cut your peak frame rate by 2X or 3X! Why? On mobile GPUs, the memory is incredibly slow. So the only way the GPU can keep a decent frame rate is to completely restructure how a frame is rendered to minimize main memory bandwidth. It does this, by pre-sorting all of the screen “draw work” by tile, and rendering all the work for each tile together at once time. This “pre-sorting” means buffering up the entire frame of draw commands first (or at least all commands for a specific framebuffer), then sorting the work by screen tile, and then rasterizing the fragments for each screen tile together). What this means is that the GPU really wants to perform the fragment shading as much as 1-2 frames “after” you queue the commands. glFinish() thwarts that. It says to the driver “I don’t care what you want to do! I’m gonna wait right here until you’re 100% through with all of the work I’ve given you, including the fragment work!” So in practice, due to the design of the GPU, this results in a lot of idle time, on the CPU and on the GPU. Instead of you queuing up the draw commands for frame N+1 while the GPU finishes drawing frame N, glFinish() forces your app to twiddle its thumbs a while in frame N+1 while the GPU finishes drawing frame N. Consequence: You may not actually start drawing frame N+1 until the time slice for frame N+2 or frame N+3. Result: You get 1/2 or 1/3 of the frame rate you’d otherwise get if you didn’t “stop to wait” for the back-end GPU driver to finish drawing a frame.

Discarding / Invalidating Buffers

Which buffers did you allocate for your default framebuffer (window)? COLOR obviously. How about DEPTH? STENCIL? Are you clearing them all at the beginning of the frame. If you’re allocating either DEPTH and/or STENCIL, are you discarding or invalidating them at the end of the frame? Are you using EGL to allocate your window surface? There are some EGL config hints you can pass that will save the GPU time+effort when rendering to surfaces using those configs.

Aside from that, are you allocating and rendering to any intermediate framebuffers (FBOs) before rendering the result to the window? If so, you need to be clearing those buffers at beginning of frame and invalidating/discarding them at end-of-frame.

So what’s up with these clear and discard/invalidate questions? This is a mobile GPU specific issue. The memory backing the render targets is slooooooooow! It’s stock, cheap, DRAM. So you want to prevent the GPU from performing any main memory reads and writes that it doesn’t need to. Because those will be slow and cost you valuable frame time, which can reduce your max frame rate. So try these tips and see if they get you anything. You could also disable depth and stencil writes and tests for good measure.

Also, what kind of COLOR buffer(s) are you allocating? 1X? MSAA? If MSAA, trying reducing to 1X and see if you get a perf++ for it. Anyway, this just goes back to the “simplify anything you can” to see if you can save the driver needless work. It might also give some insight as to what’s slowing things down.

Misc

Are you even sure that the Banana Pi should be able to rasterize and display a 1024x768 fullscreen window at 30-40fps? How do you know? What kind of window manager is it running? Is it a compositor? Can you disable that or bypass it, so only your full-screen window is being rendered to the GPU and scanned out by the display hardware directly?

Also, which GPU does your Banana Pi have?:

ARM Mali 400 MP2
ARM Mali 470 MP4
ARM Mali 450 MP4
ImgTech PowerVR SGX544MP2

The latter speaks to h.264 “encoding” (raw video -to- h.264 video). The former talks about playing (presumably) raw video. These are apples-and-oranges different tasks. So you can’t use one to infer anything about the other.

wustus · February 13, 2023, 3:25pm

TL;DR: I’m abandoning my board.

Let me go in reverse and put the rest of the information I have out there.

Misc

The project I had (and still have) in mind is to connect several Pi’s to old laptop screens and synchronously display video on each of them using some sort of master/slave architecture.
I failed (or maybe the board failed me) at the first step.
Maybe OpenGL wasn’t the optimal solution since one could probably just write a script which opens up the video files on each Pi and keeps them in sync. But I wanted to learn about OpenGL and better my lacking fluency in C++.

Was I sure that the Pi could rasterize and display at 1024x764? No. Maybe it’s my naivety speaking but I thought if the Pi could decode 1920x1080 at 30 FPS it meant displaying at that resolution.
Honestly I should have probably just opened the video using the installed video player. I did earlier today and the video played back somewhat smoothly. Maybe 24 FPS?
Perhaps there is some more magic that can be applied to boost that framerate. But considering the time spent and the option to upgrade the board, at this point I’ll go for a slightly better board.

The Board

Weirdly enough I haven’t stated which board I’m doing my experiments with.
I bought a Banana Pi M2 Zero with Mali 400 MP2 a few months ago. I didn’t think too much about it and was honestly just happy there was a cheap alternative to Raspberry Pi scalpers.

Buffers

Since my objective was to only stream a video frame by frame onto a texture, I didn’t (knowingly) allocate anything but the Color Buffer.
I’ve queried OpenGL ES on my Pi and there a 1X Color buffer allocated - which is the default from what I’ve read.

Code

In the interest of sharing everything I have for potential future readers, my main method:

int main(int argc, const char * argv[]) {
    
    glfwInit();
    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3);
    glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);
    glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);
    glfwWindowHint(GLFW_DOUBLEBUFFER, GL_TRUE);
    
    // get primary display for fullscreen
    GLFWmonitor *primary = glfwGetPrimaryMonitor();
    // create window
    GLFWwindow* window = glfwCreateWindow(VIEWPORT_WIDTH, VIEWPORT_HEIGHT, "LearnOpenGL", NULL, NULL);
    
    if (window == NULL) {
        std::cout << "Failed to create GLFW window" << std::endl;
        glfwTerminate();
        return -1;
    }
    
    // TODO: WRONG - GET REAL MONITOR RESOLUTION
    const GLFWvidmode *video_mode = glfwGetVideoMode(primary);
    MONITOR_WIDTH = video_mode->width;
    MONITOR_HEIGHT = video_mode->height;

    glfwSwapInterval(0);
    
    // hide cursor
    glfwSetInputMode(window, GLFW_CURSOR, GLFW_CURSOR_HIDDEN);

    glfwMakeContextCurrent(window);
    
    #ifdef __APPLE__
        if (!gladLoadGLLoader((GLADloadproc) glfwGetProcAddress)) {
            std::cout << "Failed to initialize GLAD" << std::endl;
            return -1;
        }
    #endif
    
    #ifdef __unix
        if (!gladLoadGLES2Loader((GLADloadproc) glfwGetProcAddress)) {
            std::cout << "Failed to initialize GLAD" << std::endl;
            return -1;
        }
    #endif
    
    // load and compile shaders
    Shader shader("path/to/vertex_shader.vs", "path/to/fragment_shader.fs");
    
    // change viewPort (renderable area) with window size
    glfwSetFramebufferSizeCallback(window, framebuffer_size_callback);
    
    // open video reader
    VideoReaderContext video_ctx;
    if(!open_video_reader("path/to/video.mp4", &video_ctx)) {
        std::cout << "Couldn't read frame" << std::endl;
    }
    
    constexpr int ALIGNMENT = 128;
    VIDEO_WIDTH = video_ctx.width;
    VIDEO_HEIGHT = video_ctx.height;
    
    RGB_FRAME_SIZE = VIDEO_WIDTH * VIDEO_HEIGHT * 4;
    
    float VIEWPORT_WIDTH_RATIO = 1 - VIDEO_WIDTH / VIEWPORT_WIDTH;
    float VIEWPORT_HEIGHT_RATIO = 1 - VIDEO_HEIGHT / VIEWPORT_HEIGHT;
    
    GLfloat vertices[] = {
         // vertices                                                   padding  texture
         1 - VIEWPORT_WIDTH_RATIO,  1 - VIEWPORT_HEIGHT_RATIO,  0,     0,       1, 1,
         1 - VIEWPORT_WIDTH_RATIO, -1 + VIEWPORT_HEIGHT_RATIO,  0,     0,       1, 0,
        -1 + VIEWPORT_WIDTH_RATIO, -1 + VIEWPORT_HEIGHT_RATIO,  0,     0,       0, 0,
        -1 + VIEWPORT_WIDTH_RATIO,  1 - VIEWPORT_HEIGHT_RATIO,  0,     0,       0, 1
    };
    
    GLuint indices[] = {
        0, 1, 3, // first triangle
        1, 2, 3  // second triangle
    };
    
    // create and bind texture
    unsigned int texture;
    glGenTextures(1, &texture);
    glBindTexture(GL_TEXTURE_2D, texture);
    
    // how to handle overscaling
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);
    
    // texture filtering
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, 0);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, 0);
    
    uint8_t *frame_buffer;
    
    if (posix_memalign((void**) &frame_buffer, ALIGNMENT, RGB_FRAME_SIZE) != 0) {
        std::cout << "Couldn't allocate frame buffer" << std::endl;
    }
    
    std::vector<void*> frames;
    std::vector<int64_t> pts_list;
    
    int c = 0;
    
    while (!video_ctx.end_of_file) {
        void *temp;
        int64_t pts;
        
        if (!read_frame(&video_ctx, frame_buffer, &pts)) {
            std::cout << "Failed to load frame" << std::endl;
            return 1;
        }
        
        pts_list.push_back(pts);
        
        if (posix_memalign((void**) &temp, ALIGNMENT, RGB_FRAME_SIZE) != 0) {
            std::cout << "Couldn't allocate frame buffer" << std::endl;
        }
        
        std::memcpy(temp, frame_buffer, RGB_FRAME_SIZE);
        frames.push_back(temp);
    }
    
    free(frame_buffer);
    
    // frame row alignment
    glPixelStorei(GL_UNPACK_ALIGNMENT, 4);
    
    // create vertex buffer object, which is sent to GPU as a whole
    unsigned int VBO;
    glGenBuffers(1, &VBO);
    
    // create element buffer object, which is sent to GPU as a whole
    // EBO uses indices to draw triangles in a given order to avoid overlap
    unsigned int EBO;
    glGenBuffers(1, &EBO);
    
    unsigned int VAO;
    glGenVertexArrays(1, &VAO);
    
    // bind vertex array object
    glBindVertexArray(VAO);
    
    // copy vertices array in a buffer for OpenGL to use
    glBindBuffer(GL_ARRAY_BUFFER, VBO);
    glBufferData(GL_ARRAY_BUFFER, sizeof(vertices) + 8, vertices, GL_STATIC_DRAW);
    
    // position attributes
    glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 6 * sizeof(GLfloat), (void*) 0);
    glEnableVertexAttribArray(0);
    
    // vertex sequence
    glVertexAttribPointer(1, 4, GL_FLOAT, GL_FALSE, 6 * sizeof(GLfloat), (void*)(4 * sizeof(GLfloat)));
    glEnableVertexAttribArray(1);
    
    // copy index array in element buffer for OpenGL to use
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, EBO);
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
    
    int current_frame = 0;
    int counting_frame = 0;
    
    unsigned int PBO;
    glGenBuffers(1, &PBO);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, PBO);
    
    // get the next bigger number of RGBA_FRAME_SIZE in the series x^n where x in 2^i
    int BUFFER_SIZE = get_next_aligned_number(128);
    
    int FRAMES_IN_BUFFER = 16;
    
    glBufferData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * FRAMES_IN_BUFFER, NULL, GL_STREAM_DRAW);
    
    // bind texture
    glBindTexture(GL_TEXTURE_2D, texture);
    
    // load texture into OpenGL
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, VIDEO_WIDTH, VIDEO_HEIGHT, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
    if (get_error("glTexImage")) {
        return -1;
    }
    
    bool initial = true;
    bool clear_ghosts = false;
    
    shader.use();
    
    glBindVertexArray(VAO);
    
    float start_time, end_time;
    
    // render loop
    while (!glfwWindowShouldClose(window)) {
        
        static bool initial_frame;
        
        if (initial_frame) {
            glfwSetTime(0.0);
            initial_frame = false;
        }
        
        start_time = glfwGetTime();
        
        if (counting_frame % frames.size() == 0) {
            initial_frame = true;
        }
        
        process_input(window);
        
        // clear the colorbuffer
        glClearColor(0.0f, 0.0f, 0.0f, 1.0f);
        glClear(GL_COLOR_BUFFER_BIT);
        if (get_error("glClear")) {
            return -1;
        }
        
        int64_t *pts = &pts_list.at(current_frame);
        
        double pt_in_seconds = *pts * (double) video_ctx.time_base.num / (double) video_ctx.time_base.den;
        
        /*
        if (pt_in_seconds > glfwGetTime()) {
            glfwWaitEventsTimeout(pt_in_seconds - glfwGetTime());
        }
        */
         
        if (initial) {
            
            glBufferData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * FRAMES_IN_BUFFER, NULL, GL_STREAM_DRAW);
            if (get_error("glBufferData")) {
                return -1;
            }
            
            for (int i=0; i!=FRAMES_IN_BUFFER; i++) {
                glBufferSubData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * i, BUFFER_SIZE, frames[(current_frame + i) % frames.size()]);
                if (get_error("glBufferSubData")) {
                    return -1;
                }
            }
        }
        
        glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, VIDEO_WIDTH, VIDEO_HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(intptr_t)(BUFFER_SIZE * (counting_frame % FRAMES_IN_BUFFER)));
        if (get_error("glTexSubImage")) {
            return -1;
        }
        
        glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
        if (get_error("glDrawElements")) {
            return -1;
        }
        
        glfwSwapBuffers(window);
        glfwPollEvents();
        
        initial = false;
        
        if (current_frame == 4) {
            clear_ghosts = true;
        }
        
        if (clear_ghosts) {
            glBufferSubData(GL_PIXEL_UNPACK_BUFFER, BUFFER_SIZE * ((FRAMES_IN_BUFFER + (counting_frame - 4)) % FRAMES_IN_BUFFER) , BUFFER_SIZE, frames[(counting_frame + FRAMES_IN_BUFFER - 4) % frames.size()]);
        }
        
        glFinish();
        if(get_error("glFinish")) {
            return -1;
        }
        
        end_time = glfwGetTime();
        
        std::cout << end_time - start_time << "ms" << std::endl;
        std::cout << 1 / (end_time - start_time) << "FPS" << std::endl << std::endl;
        
        current_frame = ++current_frame % frames.size();
        counting_frame++;
    }
    
    glfwTerminate();
    close_reader(&video_ctx);
    
    for (auto frame : frames) {
        free(frame);
    }
    
    return 0;
}

A Few comments:

The Shader class is pretty much the same as introduced in the beginning of learnopengl
The video reader is based off of a few videos by the beautiful Bartholomew and uses FFmpeg to decode video file (Part One)
This implementation copies a number of frames into the PBO and copies new data into memory 4 frames after the the frame has been rendered (approx. as suggested by Dark_Photon)
I would still consider myself to be a C++ newbie so forgive my obvious mistakes if found

Final Remarks (for now)

I must again thank @Dark_Photon for their patience and willingness to explain details I wouldn’t have thought to come across.

I’m looking for a new board now and’ll try to make sure it can handle the video playback.

So long!

system · August 15, 2023, 3:26pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.