Memcpy mapped buffer memory (DMA) to user-space memory is very slow

vietpro · December 12, 2022, 3:30am

Hi everyone, after use PBO to speed up download data from gpu, I copy pixel data to another memory area, but this make very slow:

ByteBuffer pixelsBuffer = (ByteBuffer) GLES30.glMapBufferRange(GLES30.GL_PIXEL_PACK_BUFFER, 0, mViewWidth * mViewHeight * 4, GLES30.GL_MAP_READ_BIT);
/* send pixelsBuffer to JNI and do memcpy(): take ~20ms for copy frame (1980x1080). If this is memcpy from user-space to user-space memory it will take ~3ms */

Are there any way to do this faster? thanks all

GClements · December 12, 2022, 10:09am

Try using a fence (glFenceSync, glClientWaitSync) to ensure that the command which fills the buffer (glReadPixels?) has completed before attempting to read from the buffer.

On desktop, typically glMapBuffer (and similar) will block if you try to map a buffer which is a target of pending commands. But it’s possible for an implementation to map the buffer immediately then block if you try to read data before it’s available.

Dark_Photon · December 12, 2022, 1:26pm

vietpro:

Hi everyone, after use PBO to speed up download data from gpu, I copy pixel data to another memory area, but this make very slow:
GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, mViewWidth * mViewHeight * 4,
                         GLES30.GL_MAP_READ_BIT);
/* send pixelsBuffer to JNI and do memcpy(): 
   take ~20ms for copy frame (1980x1080). 
   If this is memcpy from user-space to user-space memory it will take ~3ms */
Are there any way to do this faster? thanks all

You mentioned OpenGL ES. Which GPU are you targeting? Assuming a mobile GPU…

Mobile (tile-based) GPUs are absolutely dependent for good performance on being able to defer all rasterization work for a framebuffer until after all vertex transform work for that same framebuffer (0-2 frames later). If you request the pixels for a frame on the same frame you submitted the draw commands for it, you’re going to trigger a long stall – whether you use a PBO or not.

The OpenGL ES Programming Guide from your GPU vendor should describe how to optimize these kinds of readbacks. The typical approach is to create a ring buffer of 3-4 separate PBOs, and never read back to the CPU the results of a frame until 3 or 4 frames have elapsed. That is, this frame’s readback should readback the pixels for the frame 3 or 4 frames ago. So each frame: render frame, copy pixels into PBO N, readback pixels from PBO N-3. The idea is to give the GPU plenty of time to finish rendering and populating the frame at its own pace, without forcing a full pipeline flush and a stall (implicit synchronization).

Related: Your frame timings yield ~3ms for the pure CPU memcpy cost and ~20ms for the GL-ES readback (the way you’re doing it anyway). I don’t know if you noticed, but 16.66ms + 3ms ~= 20ms. So your CPU draw thread is being forced to block for 1 display frame interval to get that pixel data back (waiting for the queued frame to finish rasterization most likely). Proper use of a ring buffer of PBOs for the readback will let you get rid of that 1 frame of blocked time on your draw thread.

Also, garden-variety MapBuffer in GL-ES is a blocking call in many OpenGL ES drivers, triggering a stall / implicit sync. Again, check the vendor’s GL-ES Programming Guide to details on how to minimize/avoid these stalls for pixel readbacks. Some mobile GPU vendors’ GLES drivers support MAP_UNSYNC to avoid these stalls, but you have to be very careful with that (as on deskside) to avoid stepping on the driver’s toes.