Direct Memory Access in OpenGL

Hello good people,

I’m trying to reduce latency in modern OpenGL using DMA.

My current pipeline is camera → frame grabber → CPU (+ copying) → GPU. And I was wondering if it’s possible to somehow speed up that data flow in OpenGL using Pixel Buffer Objects or something similar?

OS: Ubuntu 20.04
CPU: AMD Ryzen Threadripper 3960X
GPU: Nvidia GeForce RTX 3080
FrameGrabber: Kaya Komodo
Motherboard: ASUS TRX40

If there is such a solution, please provide me with some guidance for implementing the same.

Thank You for your time and have a great day!

Once the frame is on the GPU, what do you do with it? Upload it to a texture? Display it? Crunch on it with compute kernels? That is, what are your needs for it on the GPU?

This may drive which forms/methods you may want to use.

If you’re copying from the capture device to the CPU to the GPU, you may be able to reduce the number of copies by allocating a buffer, mapping it, and copying from the capture device to the mapped region.

Binding a buffer to GL_PIXEL_UNPACK_BUFFER causes texture upload functions (glTexImage*, glTexSubImage* etc) to read from the buffer rather than CPU memory.

Thank You for your replies!

I’m sorry for the inconvenience. Let me be more specific.

This is exactly what I am currently doing:

  • allocating uint8_t (unsigned char) buffer and generating texture using glTexImage2D()
  • filling that buffer in a callback function whenever the camera captures a frame
  • then in the render loop: updating the texture with glTexSubImage2D() each time callback gets a new frame

Of course, binding textures and drawing mashes…

Let me just say that this works perfectly, but with a latency of around 70ms… I need it to be in the 30-35ms range.

Sadly, I’m still a student that is learning this concept so I’m not exactly sure what the correct pipeline is, I’ve been told that in order to speed this up I need to bypass CPU, hence implement DMA.

I have tried to implement such a solution using CUDA, but unsuccessfully.

Please let me know if You need me to provide You with a code example of my current implementation.

Thank You!

Instead of allocating a CPU-side buffer, generate an OpenGL buffer object with glGenBuffers, glBindBuffer and glBufferData (with a null pointer for the data parameter so that it only allocates the storage but doesn’t try to fill it). When capturing a frame, map the buffer with glMapBufferRange and use the pointer to the mapped region as the destination. To upload the captured data to the texture, bind the buffer to the GL_PIXEL_UNPACK_BUFFER target before calling glTexSubImage2D with the offset into the buffer (cast to void*), as the data pointer.

There may be some advantage to having multiple buffer objects which are used as a circular list. If a glTexSubImage2D command has been issued but not completed, attempting to map the buffer will stall until the command completes. Alternatively, you can map the buffer with the GL_MAP_UNSYNCHRONIZED_BIT flag; this will prevent waiting, but if you overwrite the buffer while the contents are being uploaded, you’re likely to get a texture containing a mix of the old and new contents.

You can use fences (glFenceSync) to keep track of which commands have been executed, and use that to decide whether to capture a new frame and which buffer to overwrite.

Ok, that’s a good starting point.

Key here is determining where that latency is coming from and how it’s distributed across stages.

Nsight Systems can help you out a bunch here. With it, you can see the timing/latency of everything from submission to OpenGL through execution on the GPU. If you add some NVTX markup to your app, you can even track this back even further through frame acquisition from the frame capture API. You can even see how this timing works out relative to the VSync clocks your targeting for display (if running FullScreen exclusive / Flip mode).

First off, I’d tweak your 3D Settings in NVIDIA Control Panel for low latency. Low Latency mode ON, Triple-buffer OFF, Multithreaded driver OFF, etc. Also would suggest running Fullscreen Exclusive (aka Flip Mode) to get DWM as out-of-the-loop as possible (it just adds latency). Then re-measure end-to-end latency and see where you are.

On that frame grabber… How much data are we talking about here? Does it support different formats (e.g. uncompressed, compressed, encoded with MPEG/AV1/VP9/etc? What kind of latency does it add? Reason I ask about formats: the NVIDIA Video Codec SDK has fast paths for uploading pre-encoded video and displaying it on the GPU. It will transparently make use of the NVDEC hardware decode units on the GPU for low-overhead decode+playback. It’s worth considering whether making use of that might help you out here. If nothing else, you could look at how it uploads frame data to the GPU for tips on how to do this with best perf on NVIDIA GPUs/drivers. I don’t have experience using this SDK’s decode path, but I have used its encode path and it works very well.

Another thought that occurs to me: uploading texels to a GL texture has a cost besides upload time. This texel data needs to be “re-organized” by the GPU/driver from the linear list of pixels to the vendor-specific tiled format it needs to be in in a texture. This takes time (adds latency) and happens behind-the-scenes. There are also fast paths and slow paths for texel upload here. You can query these from the GL driver using:

So “if” you continue uploading your frames to the GPU in GL textures (not a given), “then”… you for instance want to make sure that you’re uploading to a texture format that’s natively supported, and want to upload using the format/type that the driver recommends for maximum performance. Other combos are going to potentially run much slower due to run-time format conversions, adding latency.

Also, if you continue uploading to GL textures, you may find that you get a speed-up by not using the same one over-and-over but using a ring-buffer of 3-4X the number you upload to per frame. This to make it less likely that the driver imposes some implicit synchronization under-the-covers when you try to upload to a texture that it hasn’t finished rendering with.

Also, I’d start by reducing the amount of the data you upload to the GPU to a tiny subset (e.g. one 10x10 texel region instead of the whole frame). Measure timing/latency of that and optimize the heck out of it. Then increase the amount of data. That way, you’ll know if/how much of the added latency you’re seeing is just your basic upload+render pipeline, and how much is just slowdown introduced by increasing the amount of data uploaded per frame.

Also, you might check these out. Search for TexSubImage. Though a bit dated, many of the principles remain the same:

@Dark_Photon @GClements
Thank You very much for the responses!

You have provided me with a lot of, useful, information. Give me some time to study through it & implement your tips. I’ll get back to You as soon as I have something further to discuss!

For the frame grabber: supports uncompressed raw data. Here is all i know about my frame grabber:

Thanks again! Have a great day!

Looks like it may be this:

According to this datasheet, this supports much more than basic RGB output. For instance:

Camera pixel formats supported Raw, Monochrome, Bayer, RGB, YUV, YCbCr and RGBA (PFNC names):

• Raw
• Mono8, Mono10, Mono12, Mono14, Mono16
• BayerXX8, BayerXX10, BayerXX12, BayerXX14, BayerXX16 where XX = GR, RG, GB, or BG
• RGB8, RGB10, RGB12, RGB14, RGB16
• YUV411_8, YUV411_10, YUV411_12, YUV411_14, YUV411_16
• YUV422_8, YUV422_10, YUV422_12, YUV422_14, YUV422_16
• YUV444_8, YUV444_10, YUV444_12, YUV444_14, YUV444_16
• YCbCr601_411_8, YCbCr601_411_10, YCbCr601_411_12,  YCbCr601_411_14, YCbCr601_411_16
• YCbCr601_422_8, YCbCr601_422_10, YCbCr601_422_12, YCbCr601_422_14, YCbCr601_422_16
• YCbCr601_444_8, YCbCr601_444_10, YCbCr601_444_12, YCbCr601_444_14, YCbCr601_444_1

Also note: RGB8 is typically not a natively supported GPU texture format. So if you use this, your texel uploads will likely be suffering an implicit texel format conversion, to the detriment of performance, and latency in particular. Check that glGetInternalFormat query for details. For starters, it might be better to try RGBA8 or Mono8 first, even though alpha will probably always be 0xFF.


You are absolutely correct! That is the frame grabber I am working with.

Unfortunately, I’m using Kaya’s Visio Point API for C/C++ in order to control cameras and frame grabber in C code. And, for some unknown reason, I can not set frame grabber PixelFormat to a preferred format because the API returns “FGSTATUS_GENICAM_EXCEPTION” when I try to set frame grabber format to Mono, RGBA, etc.
When camera format is set to BayerXX8 → frame grabber format can only be set to RGB8 while when camera format is set to BayerXX12 → format on frame grabber can be set to RGB12, BayerXX12, and RGB8.
So I decided to always use RGB8 since it can be stored in uint8_t buffer and loaded to texture with type “GL_UNSIGNED_CHAR”.

I’m very restricted in that sense :frowning:

Also, I’m using this code for internal formats RGB & RGBA internal formats:

GLint format, type, preferred;
glGetInternalformativ(GL_TEXTURE_2D, GL_RGBA, GL_GET_TEXTURE_IMAGE_FORMAT, 1, &format);
glGetInternalformativ(GL_TEXTURE_2D, GL_RGBA, GL_INTERNALFORMAT_PREFERRED, 1, &preferred);
glGetInternalformativ(GL_TEXTURE_2D, GL_RGBA, GL_GET_TEXTURE_IMAGE_TYPE, 1, &type);
printf("Format info RGBA: %d, %d, %d\n", format, type, preferred);

This is the output for each format:

Format info RGBA: format: 6408, type: 33639, preferred 6408
Format info RGB: format: 0, type: 5121, preferred 6407

I’m still trying to figure out what those values mean, but it seems alpha channel is preferred!

Have a great day!

If you’re uploading via glTexSubImage, and you must have 24-bit RGB source data, I’ve found in the past that it’s often preferable to expand this to 32-bit yourself before the upload, rather than rely on the driver to do it for you.

These are decimal values for GLenums describing the preferred format and types to use. Convert them to hex, then search for them in your header file. So we have:

6408 = 0x1908 = GL_RGBA
33639 = 0x8367 = GL_UNSIGNED_INT_8_8_8_8_REV
5121 = 0x1401 = GL_UNSIGNED_BYTE
6407 = 0x1907 = GL_RGB

I have tried to implement this, but I’m having some problems…

Could You please provide me with some code?

Currently, I’m doing the following:

  • buffer allocation:
    uint8_t *pFrameMemory1 = malloc(texWidth * texHeight * 3 *sizeof(uint8_t));
  • preparing texture:
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, texWidth, texHeight, 0, GL_RGB, GL_UNSIGNED_BYTE, NULL);

  • In a callback function, I call KYFG_BufferGetInfo() and provide the buffer to be filled with data for a particular frame
    KYFG_BufferGetInfo(streamBufferHandle, KY_STREAM_BUFFER_INFO_BASE, &pFrameMemory1, NULL, NULL);
  • And finally in render loop:
glBindTexture(GL_TEXTURE_2D, textureID);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, texWidth, texHeight, GL_RGB, GL_UNSIGNED_BYTE, pFrameMemory1);

Replace this with

GLuint buf;
GLsizeiptr buflen = texWidth * texHeight * 3;

glGenBuffers(1, &buf);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buf);

Here, ensure that the buffer is still bound to GL_PIXEL_UNPACK_BUFFER, and pass (void*)0 as the last argument instead of pFrameMemory1.

I’m assuming that the alignment of the mapped region is sufficient for what the Vision Point API requires. Given that you’re getting away with a plain malloc, that will probably be the case (memory mappings are usually page-aligned, while malloc is typically 16-byte alignment on modern architectures).

@FilipVuk123 :


are for calls to glGetTexImage. You’re not calling that AFAICT.

The enums you want to use instead are:


which apply to calls to glTexSubImage2D().

See Image_Format#Image_format_queries for details.

When using the followiing enums:


Output is the same:

Format info RGBA: format: 6408, type: 33639, preferred 6408
Format info RGB: format: 0, type: 5121, preferred 6407

Ok. So for GL_RGB, you’re getting GL_NONE (0) back. Which means GL_RGB isn’t supported for glTexSubImage2D().

I could not make this work with glMapBufferRange(), but I managed to do the following:

GLuint pbo1;
glGenBuffers(1, &pbo1);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);
glBufferData(GL_PIXEL_UNPACK_BUFFER, texHeight*texWidth*3, NULL, GL_DYNAMIC_DRAW);

And then doing this in the render loop:

void *mappedBuffer = glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
memcpy(mappedBuffer, pFrameMemory, texWidth*texHeight*3);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, texWidth, texHeight, GL_RGB, GL_UNSIGNED_BYTE, (void *)0);

While callback function stayed the same → filling pFrameMemory (because filling mappedBuffer did not work). But this implementation adds on latency :frowning:

I’m still having problems implementing Your tip into my program.
I’m confused about what to exactly do in the callback function after I initialize GL_PIXEL_UNPACK_BUFFER and call glTexSubImage2D() as You advised.

I can only use KYFG_BufferGetInfo() function from VisionPointAPI in a callback (passing a buffer to be filled with frame data) What buffer should I pass to that function?

When and how do I fill that glMapBuffferRange pointer? Also, when do I unmap GL_PIXEL_UNPACK_BUFFER and, if necessary, clear it?

Thank You for Your help!

Oh OK :frowning: I believe what You are trying to say is that I need to somehow get RGBA format from the grabber so that glTexSubImage2D() gets the format it supports?

And that should also reduce latency?

And You were right! I have 3 cameras and on one there is overlapping (textures containing a mix of the old and new frames). Can you show me how I’m supposed to use glFenceSync to prevent overlapping?

Thank You for Your help!


GLsync sync;
if (sync) glDeleteSync(sync);

When uploading the data to the mapped region:

GLenum result = glClientWaitSync(sync, 0, 0);
if (result == GL_ALREADY_SIGNALED) {
    /* upload data */
else {

Essentially, if the previous glTexSubImage2D command hasn’t completed, you can either call glClientWaitSync again with a non-zero timeout to wait for it to complete, or you can do something else and try again later, possibly with a different frame. If the grabber is supplying frames faster than you can render them, you want to use the most recent frame each time rather than grabbing a frame then waiting until that particular frame can be displayed.

Better still is to have multiple buffers and alternate between them. If you put a (different) fence after each glTexSubImage2D command, each time you get a frame from the grabber you upload it to the buffer used by the last glTexSubImage2D command which executed. The next glTexSubImage2D command to be executed will be using the other buffer so there’s no risk of getting a “mixed” frame from GL reading the buffer while the grabber is writing to it.

But that may be moot because:

If you’re doing an explicit memcpy from a CPU buffer to the GPU buffer, there’s no point in using GPU buffers. If you can’t get the grabber to write the frame directly to the mapped region, it’s possible that the grabber doesn’t support DMA between devices, or there may be something extra that needs to be done in order to enable this.

Also, this is a completely separate issue to using the “preferred” format. Using the preferred format would be faster, but if the grabber can’t supply that then you’ll just have to use whatever format it can supply and accept the conversion cost.