OpenGL ES-CM 1.1 performance issue on IMX.6 Etnaviv Vivante driver

We have an IMX6 solo custom device running with WinCE7 & Linux on it

We have developed a custom benchmark application using OpenGL ES-CM 1.1

When I ran the benchmark application on WinCE7 the performance looked good but with similar application on Linux gave 50 % reduction in performance.

Following is our configuration

Linux       - Mainline 5.15
GPU driver  - Etnaviv
X Driver    - xf86-video-armada
OpenGL      - OpenGL ES-CM 1.1

WinCE Result:

EGL version : 1.4
GL vendor   : Vivante Corporation
GL renderer : Vivante GC880
GL version  : OpenGL ES-CM 1.1 Mesa 22.0.3
run scene 'Floating Frame3D VBO'
initialize, DATA_SIZE:11264, No of Frames:512 Average FPS = 48.075005
run scene 'Floating Frame3D'
initialize, DATA_SIZE:11264, No of Frames:512 Average FPS = 31.975005
run scene 'Frame 3D Fixed'
initialize, DATA_SIZE:11264, No of Frames:512 Average FPS = 31.505484
run scene 'Floating graph'
Average FPS = 238.403093
run scene 'Fixed graph'
Average FPS = 238.187271

Linux result:

EGL version : 1.4
GL vendor   : etnaviv
GL renderer : Vivante GC880 rev 5106
GL version  : OpenGL ES-CM 1.1 Mesa 22.0.3
run scene 'Floating Frame3D VBO'
initialize, DATA_SIZE:11264, No of Frames:512 Average FPS = 26.872623
run scene 'Floating Frame3D'
initialize, DATA_SIZE:11264, No of Frames:512 Average FPS = 24.373300
run scene 'Frame 3D Fixed'
initialize, DATA_SIZE:11264, No of Frames:512 Average FPS = 23.783205
run scene 'Floating graph'
Average FPS = 136.624447
run scene 'Fixed graph'
Average FPS = 136.510016

glmark2 which uses opengl 2.2 gives 80 % score. Can it be issue with OpenGl 1.1 ? Or X driver?

At the moment not able to figure out where is the problem. Any help is appreciable.

“Similar application” or “same application”. If “similar”, it may not be fair to compare results.



Are these running on exactly the same hardware? If not, it’s not a fair comparison.

These tests are running on 2 different OSs. So it’s not really a fair comparison.

You appear to be running on 2 different drivers:

  • One on WinCE allegedly written by the GPU vendor (Vivante Corp).
  • Another on Linux (Etnaviv) which appears to be an open-source driver.

so it’s not a fair comparison. Typically you’d expect the vendor driver to outperform the open-source driver, and it appears that this is what you’re seeing.

As to what specifically is performing differently between these configurations, you’ll just have to profile your frames and nail it down. You haven’t really given us enough info to really help here, other than comment on the HW/SW differences between these two setups.


It’s a same application

It’s same hardware, same application but OS & driver is different. Do you suggest to try with Freescale BSP with vivante driver and vivante xorg-driver xf86-video-imx-vivante

Here is the application source → GitHub - sahithyen/gl-nightmare
When I run the application I see CPU load is 80-90 % which is quite high.
I have used apitrace tool to trace the application & I see for single frame It is showing 4000+ calls & for some frame 28 calls. I can not upload log file as it is more than 100 MB. Below is the snippet for 28 call frame

In application we are calling glDrawArrays() function inside for loop. Is this ok to call like this?

    for (int li = 0; li < line_count; li++)
        glDrawArrays(GL_LINE_STRIP, li * point_count, current_count);

That’s a clue. However, you need to run a profiler to see what your app is busy doing. Is it tied up on the application thread? In the GPU back-end driver? Both? If application thread, what specific calls is it busy executing? That’ll give you some idea what you can change to reduce the total time consumption.

Pull out a CPU and GPU profiler for your platform, run your app under it, and see what it tells you!

“Ok” as in it works, sure. “Ok” as in it yields the best performance? It’s not the best, if there are a lot of consecutive GL calls. You can combine those all into 1 draw call in a few ways.

But first I’d see where you’re primarily bottlenecked. Profile first. Determine why you’re slow. Then make changes to alleviate that bottleneck.

Browsing your code, I see #ifdefs that switch between submitting vertex data using client arrays or using VBOs. Which path are you taking in your tests? And are you sure that you are taking the same one on both platforms?

In your VBO code, I see a very bad usage pattern that typically performs horribly on mobile GPUs:

Here, you’re re-uploading the CPU data in data[] to the GPU in a buffer object vbo before every single draw call. On mobile, this is almost certainly going to trigger a full frame implicit sync which can cost you an entire frame of latency. Do some reading in your GPU vendor’s OpenGL ES Programming Guide. It’ll almost certainly talk about this.

The short why on this is: Mobile GPUs run on top of very slow CPU memory. Therefore, they are built to absolutely depend on being able to queue a full frame of GL commands up-front and then execute (render) the frame 1-2 frame periods later. When you do this hot-mod of a buffer object on-the-fly, your thwart its ability to do this. Effectively, your CPU thread is blocked until the GPU “catches up” on previous uses of this buffer object before allowing the buffer object update request (glBufferData()) to be queued. This is bad, and your frame rate can be cut by 2-3X in the process.

For more reading on the problem, and some solutions that work well on desktop, see this wiki page:

Keep in mind however that, even if your driver supports glMapBufferRange() with the GL_MAP_UNSYNCHRONIZED_BIT, using the Sync Object method of synchronization is problematic on mobile GPUs. The reason is, GL_SYNC_GPU_COMMANDS_COMPLETE synchronizes on “all work” in the entire pipeline, which is bad. This will trigger a full pipeline flush and needless mid-frame rasterization pass, often resulting in a full-screen flash with partially-rendered data and potentially other artifacts (occlusion, MSAA resolve, etc.). You instead want to sync on just the vertex work, not the vertex and fragment work together. Talk to your GPU vendor about how to do that on your driver. On some mobile GPU drivers, querying the result of a dummy Transform Feedback batch can sync on vertex work alone.