2D - fast direct pixel manipulation


I’ve an application which receives a list of 32x32 sized unique tiles in BGRA format from a special scientific grabber device. Together with these tiles comes a damage region information which tells me which (rectangular) part of which frame contains valid data. This graphical information will be displayed on a 1080p screen where a FPS of 60 must be achieved.

Currently I’m using mmap to create a mapping of the linux framebuffer device in the programs virtual address space.
This allows me to directly modify the pixels by writing (memcpy the lines of the damaged tiles data) to the correct offsets of that address.

So this gives me extremely fast “memory to screen blitting”.

Now I’m trying to achieve the same thing using OpenGL for portability (Windows/X11/MacOSX) with a similar performance.
I’ve set up a orthogonal projection, disabled depth test and tried the most simple way using glRasterPos/glDrawPixels and a more sophisticated approach by creating 32x32 sized textures, uploading the data using glTexImage2D and “blitting” these by drawing quads according to each tile’s damaged region using glTexCoord /glVertex.
The lifetime of each texture is extremely short because it is only used to transfer data from user to video memory and because there is no reuse for the bitmap data in the texture.

However, the performance compared to the mmaped /dev/fbx solution is extremely poor (~30 times slower) and 99,9% of the time is consumed by the glTexImage2D calls.

So what is the fastest way too implement “user memory to screen blitting” with exact 2D-pixelization in OpenGL ?

Thanks for any hints !


glTexImage2D reinitializes the texture object, if you only want to update the contents use glTexSubImage2D (even if all data changed).

I assume you already disabled mip maps (using e.g. GL_LINEAR or GL_NEAREST filtering)?

The fastest way to do what you’re trying to do on the majority of GPUs is to just do it all in software, and then blit the single final composed image to the screen via GL. Even if you use the correxct texture update functions as carsten suggests, your use of the GPU is totally opposite of how GPUs work; rendering lots of small individual frequently changing textures one at a time is basically the worst case performance scenario for the parallel execution engine of the GPU. Mixing software and hardware rendering is not fast, and trying to do both simultaneously will give you the worst performance. If you want to render images in software, just composite the whole scene in software (ideally using optimized SIMD code) and only use OpenGL to blit your final composited image.

A more forward-looking approach would be to do all your rendering on the GPU. Whatever image data you’re generating on the CPU can most likely be generated on the GPU with the correct use of shaders, optimized draw calls, and maybe OpenCL. We’d need more information on your actual use case to advise in more detail.

Thanks Carsten for the hint regarding glTexImage2D. Using glTexSubImage2D resulted in a (tiny) performance gain. I’ve already used GL_NEAREST.

Elanthis: Thanks. However there is no more information. I get finished created 32x32 sized BGRA tiles for free without CPU usage from an external device already along with the information which region of each of these tiles has to be blit where on the screen.

So all I want to do is to blit these tiles as fast as possible to the screen.

I am reading about the ARB_pixel_buffer_object extension. The glMapBufferARB function seems to give me a mapped address where I can directly modify the pixels.
Is it really a mapped address or just a simple copy of the texture pixels into user memory?
I fear that glUnmapBufferARB() will probably result in transferring the complete user space copy of the texture data back to GPU controlled memory even if just 1 pixel was changed during glMapBufferARB and glUnmapBufferARB …

glMapBufferRange() may be what you are looking for then.

So is your whole texture just 32x32 pixels? Do you have many of them and how often do you update? I’m not even sure it’s worth bothering to only update a part, given how small the whole thing is - of course measuring is the only way to be certain :wink:

You could try using a bunch of textures in a ring buffer setup: you use the one at the head to render and update the one at the buffer’s tail. That introduces some frames worth of lag, but gives the driver time to do the transfer to the GPU before you access the data for rendering.

Is it really a mapped address or just a simple copy of the texture pixels into user memory?

It could be anything. It could be an allocated scratch piece of memory. Or it could be a GPU address. It could be different depending on how you use the function.

PBO isn’t terribly helpful for cases where you’re uploading texture data. PBO is primarily there for asynchronous transfer; if the only thing you do after uploading is rendering with that texture, then you gain pretty much nothing. from PBOs.

And please stop using the ARB-suffixed versions. This is core stuff, and has been core for over half a decade.

Or it could be a GPU address … depending on how you use the function.

Nice! How would I use the function to get a mapped GPU address ?

Nice! How would I use the function to get a mapped GPU address ?

Actually, that was something of a typo/brain fart.

It depends on a combination of how the driver feels like implementing the function, where the driver puts the buffer, and what you’re doing when you map it. Ultimately, there is nothing you can do to guarantee a GPU address, for the simple fact that a buffer object may not have a GPU address.