Updating textures per frame

debinair · August 23, 2016, 10:42pm

i have 3 textures. I am updating 1 texture every frame, so i have circular buffer of textures. 1st frame will be on 1st texture, 2nd frame will render on 2nd and so on. Now the data i am providing to these textures is from CPU side, I am calling glTexSubImage2D with data and i am not using PBOs. When I bind the texture for update, I am sure that OpenGL has done the reading from it, so glTex call would be blocking call here? When it would be blocking? If I use PBOs for updating these textures, will I get a performance gain as when we update the PBOs using glMapbuffer, it would be still blocking call, isn’t it? Is there any way that I could get a performance gain here?

Silence · August 23, 2016, 11:54pm

As far as possible, try FrameBuffer Object. They should better suit your requirements (render to texture).

mhagain · August 24, 2016, 1:28am

PBOs are unlikely to help if you must draw the textures in the same frame as you update them.

Since the data is coming from the CPU and you’re not actually doing render to texture, framebuffer objects are useless to you: ignore them.

You can get rid of your circular buffer, do this with a single texture and still get good performance. This is a topic that has been discussed up and down these forums, but it comes up often enough that it’s worth going over again.

Typically we see problematic glTexSubImage calls looking like this:

glTexSubImage2D (GL_TEXTURE_2D, 0, 0, 0, w, h, GL_RGB, GL_UNSIGNED_BYTE, data);

This happens when the user, with good intentions, thinks that because they only need 24-bit data, if they supply their data as 24-bit then they’ll get a double win as (1) they’ll save memory, and (2) they’ll have less data to upload.

But GPUs don’t work like that at all. Unless you’re using some really esoteric hardware, there’s actually no such thing as a 24-bit texture in GPU land. If you ask for a 24-bit texture what you’ll typically get is 32-bit with the extra 8 bits unused.

So if you’re trying to send 24-bit data via glTexSubImage to a 32-bit texture, your driver must first do a format conversion. This typically happens in software, involves allocation of extra temporary buffers, and is slow.

This is better:

glTexSubImage2D (GL_TEXTURE_2D, 0, 0, 0, w, h, GL_RGBA, GL_UNSIGNED_BYTE, data);

Now we’re sending 32-bit data, and at this point some kinds of people will normally yell about “wasting memory”. We’re not “wasting memory”, we’re using it: using it to get better performance.

Run that and you’ll probably find that you still run slow. This can happen because internally the GPU is more likely to store texture data in BGRA order than in RGBA. So once again we fix it; this is even better:

glTexSubImage2D (GL_TEXTURE_2D, 0, 0, 0, w, h, GL_BGRA, GL_UNSIGNED_BYTE, data);

At this point those same people who yelled last time will yell again: this time about endianness. So just detect the endianness of the system you’re running on and adjust your parameters accordingly: not too difficult.

Almost there; in most cases you’re going to find that this runs quite well, but there’s one outlier: some GPUs need a further adjustment to get peak performance, so the final change is:

glTexSubImage2D (GL_TEXTURE_2D, 0, 0, 0, w, h, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, data);

On certain hardware I’ve measured this combination running up to 40 times as fast as the more typically seen GL_RGB/GL_UNSIGNED_BYTE. The trick is to supply the appropriate parameters so that the driver can detect that it doesn’t need to do any format conversions and can just copy the data across directly.

So maybe your source data is 24-bit RGB ordered and maybe you don’t have any (or much) control over that?

The best thing is to write your own code to convert it to the format that the GPU prefers. It’s 10 lines of C, just make sure that you do the required memory allocation one-time-only at startup (rather than every frame), or convert into a static buffer, and this is one case where your own code can certainly outperform the driver.

This even has it’s own entry on the OpenGL Common Mistakes wiki page: https://www.opengl.org/wiki/Common_Mistakes#Texture_upload_and_pixel_reads - although that doesn’t mention the performance implications (which is appropriate because OpenGL specifies functionality, not performance).

Silence · August 24, 2016, 2:50am

What I was meaning is filling the texture then attach it to the FBO then, depending on the needs, blitting this to the ‘screen’ buffer or doing whatever he needs with FBO. He can easily keep his 3 textures as a buffer.

With doing it all with the fullscreen quad and glTexSubImage2D, he will always have to ensure that the job is finished on the GL side. But this might depend on what he does with this after.

Why is the FBO idea so wrong ?

debinair · August 24, 2016, 6:53pm

I am drawing with Vulkan and i have another library which renders it on the screen. So I render with Vulkan and take that data on CPU and pass it to OpenGL.
Are you saying that passing data to glTexSub is slower than render-to-texture ?

Silence · August 24, 2016, 11:34pm

I am not. glTexSubImage2D is known to be very fast. FBOs are also fast. With FBOs you’ll end up with framebuffer objects with which you can do many things with (since you seem to process images). With rendering a quad, you end up with a displayed image on the screen.

That was just what I was thinking about your question. mhagain explained you things in detail and is sure about that FBOs will be useless for you. He has very good knowledge about OpenGL which I don’t have. This is why I asked why my opinion was wrong, in order to understand and hopefully have a better knowledge about OpenGL
So hopefully someone will explained this to me.

Now I have a question to you: why are you rendering with Vulkan, taking pixels back, storing them somewhere, and sending them back to OpenGL ? This looks pretty strange to me. Why not keeping a single rendering API ?

debinair · August 25, 2016, 12:41am

I am hoping that I will get better performance by using Vulkan.

mhagain · August 25, 2016, 2:21am

I’m saying that FBOs are useless in this case because they do nothing to solve the problem of getting data from the CPU to the GPU. Once you have your data on the GPU, whether it’s drawn via a fullscreen quad, or blitted via an FBO, or even glDrawPixels from a PBO, the performance is going to be pretty much a wash - covering the screen with pixels, no overdraw, that’s something that your GPU and monitor can already do at least at 60fps while you’re just using your computer for day to day work. It’s not a bottleneck.

So typically with this kind of question the actual bottleneck turns out to be the “getting data from the CPU to the GPU” part, and the common cause is glTexSubImage parameters. The same applies to parameters to glDrawPixels and glReadPixels, which have similar performance characteristics. Some examples:

As you’ll see, it’s quite consistent and it’s not theoretical; these are real results: change the parameters and the bottleneck goes away.

However, post #5 changes everything. Round-tripping data through the CPU like this is, of course, going to be an even bigger problem. You should still examine and fix up your use of glTexSubImage, of course, but you’ve got a much bigger bottleneck which is doing a readback to CPU on your Vulkan side of things.

Silence · August 25, 2016, 2:54am

Thank you mhagain. This was interesting.

To go back to the topic I must admit that now I’m confused about what the OP does really want to achieve…

mhagain · August 25, 2016, 5:48am

Me too.

As I see it, the workflow goes something like:

[ol]
[li]Draw the scene using Vulkan. This is done as a performance optimization.[/li][li]Read back the drawn scene to the CPU.[/li][li]Upload the bytes from (2) to an OpenGL texture.[/li][li]Draw that texture to screen, then SwapBuffers.[/li][/ol]

I’m not even sure that (4) is a requirement, but in any case, (2) and (3) will more than completely eliminate any performance gain from (1), so the OP would be better off just using a single API, be that OpenGL or Vulkan, for everything. It would also be much cleaner (and more robust) code as it wouldn’t be needed to manage two contexts from different APIs bouncing off the same window.

debinair · August 25, 2016, 9:53am

thanks mhagain and Silence for your inputs…