Flashes on ARM Mali

Utumno · June 3, 2018, 3:02pm

Hello OpenGL gurus,

I have written a OpenGL ES 3.1 app for mobile devices and I am battling (again!) with problems on one particular platform, ARM Mali GPU. The program appears to run correctly on Adreno and PowerVR GPUs.

One frame is composed of several render passes. The render passes communicate with help of a Shader Storage Buffer Object and atomic counters. The whole thing looks like this:

Pass1_Initialize_SSBO_and_Atomic();
glMemoryBarrier(GL_ALL_BARRIER_BITS);

Pass2_Fill_SSBO_With_initial_Data();
glMemoryBarrier(GL_ALL_BARRIER_BITS);

for(i=0;i<N;i++)
   {
   Pass3_Render_Object(i);
   glMemoryBarrier(GL_ALL_BARRIER_BITS);
   }

Pass4_Compose_Everything();

Now, the problem is that on Mali the screen keeps flashing. I have made many recordings and watched them frame-by-frame. What happens is that about 95% of frames look correct, but every so often an arbitrary subset of Objects disappears, and reappears in the next frame. Sometimes (very rarely) I can also see in Android’s debug facility (logcat) the following:

E/OpenGLRenderer: Error:glFinish::execution failed
E/OpenGLRenderer: GL error: Out of memory!

I’ve seen that a few times before and so far this meant that some shader runs couldn’t finish (due to an infinite loop, like for example in case of my previous question here: SSBO: GPU locks up - OpenGL: Basic Coding - Khronos Forums ).

########################################################

The problem is, I have no idea what can be causing the disappearing Objects. What I’ve tried so far is to keep removing code to see if the bug is still there - in an attempt to come up with the shortest piece of code that reproduces the problem. This approach fails because the more code I remove, the harder the bug gets to reproduce. Initially it keeps happening about twice per second, after several passes of removing various bits I can only reproduce it once per minute, and ultimately I cannot reproduce it anymore, but I have no idea if this is because I just removed the offending code or because I just passed some threshold and the bug is still there but is now very hard to reproduce.

The second thing I tried is to measure the bug by taking a look at the SSBO. I memory-map it to CPU at certain moments between passes and make sure it really does contain what it should.
Unfortunately as soon as I add a

glBindBuffer(GL_SHADER_STORAGE_BUFFER, mSSBO[0] );
glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, length, GL_MAP_READ_BIT);
(...) // print the buffer
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);

pretty much anywhere in the application code, the bug simply disappears.

In particular, when I remove the first glMemoryBarrier() and replace it with the above, the bug disappears completely ( I recorded 10 minutes worth of screen and watched this frame-by-frame, it’s gone). This happens even if I don’t inspect the buffer on CPU at all, I just map it and unmap it right away (which AFAIK should have the same effect like a memoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) ??) This has raised a suspicion that maybe glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) on Mali is buggy, so I wrote a test program to see - and this proved glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) works just fine.

I also have a Mali Graphics Debugger and I can connect it to the phone, and when I do the bug still shows. I however have no idea what to look for in the Debugger’s interface.

Would you have any advice how to approach such an issue?

The code is GPL v2 and it is fully available to download - is it 45k lines of Java, XML and GLSL though. If anybody wants to take a look, here it is: How to compile and run the example code - Distorted Android

Dark_Photon · June 3, 2018, 7:30pm

[QUOTE=Utumno;1291672]I have written a OpenGL ES 3.1 app for mobile devices and I am battling (again!) with problems on one particular platform, ARM Mali GPU. The program appears to run correctly on Adreno and PowerVR GPUs.

One frame is composed of several render passes. The render passes communicate with help of a Shader Storage Buffer Object and atomic counters. … glMemoryBarrier …

Now, the problem is that on Mali the screen keeps flashing.[/QUOTE]

Man, this sounds familiar. Not the Mali part. Or the glMemoryBarrier. But a mobile GPU, your symptoms, and what might be triggering it.

The problem is, I have no idea what can be causing the disappearing Objects. What I’ve tried so far is to keep removing code to see if the bug is still there - in an attempt to come up with the shortest piece of code that reproduces the problem. This approach fails because the more code I remove, the harder the bug gets to reproduce.

Yep, that’s familiar too.

The second thing I tried is to measure the bug by taking a look at the SSBO. I memory-map it to CPU at certain moments between passes and make sure it really does contain what it should.
Unfortunately as soon as I add a … glMapBufferRange … pretty much anywhere in the application code, the bug simply disappears.

Now that makes sense.

glMapBufferRange() w/o GL_MAP_UNSYCHRONIZED_BIT in the flags (which you wouldn’t specify with GL_MAP_READ_BIT) will likely be a synchronizing map. That’s devastating for most mobile GPUs (i.e. tilers; aka GPUs with a sort-middle architecture). If there are batches in-flight referencing that buffer object, it can cause a flush of everything in the pipeline needed to free up that buffer object while your draw thread is completely preempted off the CPU and blocked from continuing until then.

So if you’re having a problem that seems like a pipeline load or race condition related problem (like yours), then this could mask it … to the detriment of your performance. It can slow the whole pipeline way down.

You may already know this but tilers can maintain their framebuffer in slow CPU DRAM (rather than fast VRAM/GDRAM like desktop GPUs) because they do all the vertex work for a render target up-front, binning it into screen tiles, and deferring the rasterization/fragment work until later. Each screen tile is rasterized in super-fast on-chip cache, and then when complete, written out to slow CPU DRAM “once”, greatly cutting the required DRAM memory bandwidth per render target. The rasterization work may happen as much as a frame or two later than your submission of the draw calls that instigated it. So when you do something that causes a partial or full pipeline flush, it can create a very big pipeline bubble (…and sometimes generate rendering artifacts – see below).

Would you have any advice how to approach such an issue?

This sounds so much like something I hit and worked out with the dev support guys for a different mobile GPU vendor. In this case SSBOs and glMemoryBarriers weren’t involved, but other pipeline synchronization primitives were (glFenceSync/glWaitSync – which I learned you don’t want to do on tilers without extreme care). And I’m pretty sure the result looked like flashing, where a frame was rendered but missing some objects in the scene.

What I know was happening there and what I think might be happening in your case is you are doing something to trigger a “full pipeline flush”. Remember I said above that tilers are completely structured so that they pre-execute all the vertex transform work for a render target, bin that transformed work per-screen-tile, and then, per-tile, they perform rasterization? A full pipeline flush completely thwarts that. It says: screw you, GPU. I know you’re not done accumulating primitives to render for this render target. But I’m telling you to take what I’ve given you and shove it all down the pipeline right now!! (i.e. finish all the vertex work, run the rasterizer for all tiles, and spew out all the pixels into DRAM – even though this doesn’t represent what will be the correct contents for those screen tiles). It’s like Dark Helmet in Spaceballs demanding the ship go straight to Ludicrous Speed. You’re just you’re not supposed to do that! The pipeline’s not built for it!

This mid-frame flush can cause rendering artifacts in a number of ways. If memory serves, doing multisample alpha rendering is one case where you can end up with artifacts, though there are more. The net effect is that you have one frame every so often that’s “missing” objects (or pieces of objects), but then it magically clears itself up in subsequent frames (for a while). The frames where objects are missing (or in general: artifacts were seen) are the frames where full pipeline flushes occurred. The “trick” is figuring out what caused the pipeline flush and getting rid of it. This may involve finding an alternative that does what you want without triggering a full pipeline flush.

Another thing that can cause a full pipeline flush is when you run out of “buffer space” between the vertex and fragment stages (recall how tilers work from above). This is called different things by different vendors, but let’s just call this the parameter buffer. If you run out of space in this parameter buffer (e.g. submit more primitives than the driver writers anticipated you ever would), then the driver has little choice but to do a full pipeline flush to free up parameter buffer space so that it can accept more transformed vertex data. And this can generate artifacts. Sometimes there will be a driver configuration setting you can tune to adjust the size the parameter buffer for subsequent boot-ups and executions. In other cases, the size may be hard-coded in the GPU driver and not user-tunable.

Your mention of “the more code I remove, the harder the bug gets to reproduce” made me think of parameter buffer space. The less you submit, the less likely you are to run out of parameter buffer space given a fixed buffer size, which would instigate less frequent pipeline flushes mid-frame. You might try seeing if you have any control over the amount of parameter buffer space in your Mali GL driver and try tweaking it. And I would check to see if you have any visibility into when the driver is actually “doing” mid-frame pipeline flushes (which it really does not want to do! Triggering these is basically a big usage error on mobile GPUs). Perhaps, using a debug message callback would let you tap into the driver trying to tell you that (for details, see Debug Output (GLWiki) and KHR_debug). Alternatively, look for a driver log you can tap into where the Mali driver may be telling you in more detail what it’s doing (logcat for instance?). There may be a driver setting you can toggle to flip it into a more verbose logging mode. Check with the Mali docs and driver guys for details.

That said, it could be you’re just doing less work which might have to flush due to some of the synchronization, leading to fewer forced pipeline flushes. Check their docs to see if anything you’re doing (particularly synchronization primitives) may be causing a pipeline flush.

Utumno · June 4, 2018, 3:47pm

Thanks for such an in-depth answer! I just read and re-read it 4 times. At least this whole ‘disappearing Objects’ bug starts to make sense!

I’ll take a look at Debug Output ad KHR_debug; I’ll post questions in Mali support forum. There must be a way the Mali Graphics Debugger can help here…

##########################################################

What I was also thinking to do is the following:

what if I added another atomic counter, and incremented it in a fragment shader (99% of complexity is in fragment shaders) on condition I want to probe ATM?
Then, every N frames (we could initially set N to be 100) I’d map this Atomic Counter Buffer to CPU, inspect its value and reset it back to 0. I’d then know how often the condition I want to probe has happened.

Good news is, I can ‘switch off time’ in my app, and keep rendering exactly the same frame over and over, and the bug still shows just the same. In this situation I should be able to manually compute how often the condition should have happened during those N frames, compare this theory to measurable practice, and hopefully detect some anomalies.

Memory-mapping a small Atomic Counter Buffer every N=100 frames shouldn’t affect the reproducibility of the bug, hopefully. We could even see what’s the relationship between N and reproducibility of the bug, and - who knows - maybe get down to a smaller N without affecting the bug.

The problem is - if this is really a ‘full pipeline flush’ then the above wouldn’t really tell me anything, would it?

Dark_Photon · June 4, 2018, 7:03pm

[QUOTE=Utumno;1291701]what if I added another atomic counter, and incremented it in a fragment shader (99% of complexity is in fragment shaders) on condition I want to probe ATM?
Then, every N frames (we could initially set N to be 100) I’d map this Atomic Counter Buffer to CPU, inspect its value and reset it back to 0. I’d then know how often the condition I want to probe has happened.[/QUOTE]

I guess it would tell you how many fragments you rendered in that period of time, and effectively fragments per flash. It’s a data point, but it seems unclear whether this flash has to do with fragment counts.

If you wanted to minimize the chance of mapping the buffer triggering a flush, allocate 3 or 4 different buffer objects and write to them in a round-robin fashion, one per frame. Wait 3-4 frames after having the GPU write to a buffer before having the CPU map/read from that buffer. If the Mali driver is tracking use properly, this ideally should result in your application not triggering any pipeline flushes when the CPU tries to read from a buffer.

Good news is, I can ‘switch off time’ in my app, and keep rendering exactly the same frame over and over, and the bug still shows just the same.

That’s good. Anything to simplify the test case is a definite plus. If you end up nailing this down to a Mali driver bug, it’ll also give you something to submit as a repro to them which will increase your chances that they’ll actually fix the bug.

Memory-mapping a small Atomic Counter Buffer every N=100 frames shouldn’t affect the reproducibility of the bug,

Probably not. And I’d say that if you use the round-robin trick above to minimize the map triggering a partial pipeline synchronization/flush, then it should reduce the chance of it affecting the reproducibility even further.

The problem is - if this is really a ‘full pipeline flush’ then the above wouldn’t really tell me anything, would it?

Probably not. Besides the suggestions I made above, if it were my problem, I’d be inclined to continue whittling this down to get a minimally reproducible test case. That may end up pointing out the root cause to you … or at least give you a small standalone test program you can post here and to the Mali dev forums to get more ideas as to what might be going on and things to try.

You could also provide more details about what you’re doing which could spark more suggestions.

Speaking of that, why do you have these barriers on GL_ALL_BARRIER_BITS? Have you tried limiting those down to the bare minimum that is required? Do you really need to use SSBOs and/or atomics here? Is there another method of communicating between the render passes you could try? Does the problem go away if you get rid of one of the other, or both? Are you rendering with MSAA?

Utumno · June 5, 2018, 3:26am

In production code I actually use GL_SHADER_STORAGE_BARRIER_BIT | GL_ATOMIC_COUNTER_BARRIER_BIT only. I changed it to GL_ALL_BARRIER_BITS just to make sure I am not missing anything.

The code implements the so-called A-buffer for order independent transparency, exactly this: https://hal.archives-ouvertes.fr/hal-01093158

Before I was saying the code looks like this:

Pass1_Initialize_SSBO_and_Atomic();
glMemoryBarrier(GL_ALL_BARRIER_BITS);
 
Pass2_Fill_SSBO_With_initial_Data();
glMemoryBarrier(GL_ALL_BARRIER_BITS);
 
for(i=0;i<N;i++)
   {
   Pass3_Render_Object(i);
   glMemoryBarrier(GL_ALL_BARRIER_BITS);
   }
 
Pass4_Compose_Everything();

So first we zero out the atomic counter and all the (screenWidth*screenHeight) ‘head pointers’ in the SSBO.
Then we do a dry run rendering geometry N% larger than it really is, with empty fragment shaders, depth and color writing off, writing only stencil.
Then in a loop we render each Object to a temp FBO:

We do a user-defined post processing operation (in my example, a Gaussian blur) on each Object (using stencil). The blur extends the size of the Object a few pixels in each direction, which is the reason why in the previous step we were marking a larger object in the stencil buffer.
We copy the post processed Object to another FBO, this time rendering opaque pixels directly to the FBO, and transparent ones to the per-pixel linked lists in the SSBO , already sorting them by depth (which involves concurrent inserts to linked lists in fragment shader!) as described in the paper by Sylvain Lefebre.

Then we compress the linked lists to remove the fragments that are occluded by opaque parts of the scene.
Then we compose the whole scene from the FBOs and the per-pixel linked list in SSBO to another FBO.
Finally we blit the FBO to the screen.

The geometry is very simple, in my reproducing example it is actually only 2 cubes (2* (3*8)) = 48 vertices! , both blurred and semi-transparent and partly occluding each other so we can see the algorithm working.

This is my hobby project I’ve been using to teach myself OpenGL - it’s a library for graphics effects for Android. One can define his own effects (like the blur) and render stuff: https://distorted.org.
The idea this will be used mostly to do 2.5D interfaces or simple games like Candy Crush, it is by no means a 3D game engine.

Dark_Photon · June 5, 2018, 5:35am

[QUOTE=Utumno;1291710]
The code implements the so-called A-buffer for order independent transparency, exactly this: https://hal.archives-ouvertes.fr/hal-01093158[/QUOTE]

Ok. Pretty ambitious for a hobby project

So first we zero out the atomic counter and all the (screenWidth*screenHeight) ‘head pointers’ in the SSBO.

I was reading your description from the standpoint of whether anything else you’re doing might be triggering a pipeline sync/flush.

One thought I had is if you are clearing the SSBO with the CPU rather than the GPU, this might be triggering a synchronization. 3-4 buffers used round-robin can often avoid this, but I would check the Mali docs for details here.

…Then in a loop we render each Object to a temp FBO:
…We copy the post processed Object to another FBO, this time rendering opaque pixels directly to the FBO, …Then we compose the whole scene from the FBOs and the per-pixel linked list in SSBO to another FBO.
Finally we blit the FBO to the screen.

As you indicate, this makes heavy use of FBOs. This is another feature which can cause pipeline synchronization. On another mobile GPU vendor’s GLES drivers, the FBO was basically “the” placeholder for all the draw work submitted to that FBO. If you rerendered to that FBO before all the queued work from previous renders to that FBO completed (and/or reconfigured that FBOs bindings), this would cause a full flush/sync waiting on all that queued work to complete. You could see this flush/stall in the GPU profiler pretty easily. To avoid this inefficiency, at the recommendation of the dev support guys, we added a round robin queue of FBOs which we used for offscreen rendering. This allowed multiple FBOs of render work to pipeline well without a flush.

That’s another thing you might do: Capture profiling data for your application around when a flash is happening. Then pull up the Mali profiling tool and look at that instrumented profiling data to see if you can identify a single prominent problem in the data.

Good luck!

Utumno · June 5, 2018, 9:08am

[QUOTE=Dark Photon;1291713]
One thought I had is if you are clearing the SSBO with the CPU rather than the GPU, this might be triggering a synchronization.[/QUOTE]

First I tried to use the CPU, but memory-mapping the SSBO and setting 2 million integers to 0 in Java proved way too slow. Now I use the GPU like this:

Vert shader: http://distorted.org/redmine/projects/distorted-android/repository/revisions/order-independent-transparency/entry/src/main/res/raw/oit_vertex_shader.glsl
Frag shader: http://distorted.org/redmine/projects/distorted-android/repository/revisions/order-independent-transparency/entry/src/main/res/raw/oit_clear_fragment_shader.glsl

As you indicate, this makes heavy use of FBOs. This is another feature which can cause pipeline synchronization. On another mobile GPU vendor’s GLES drivers, the FBO was basically “the” placeholder for all the draw work submitted to that FBO. If you rerendered to that FBO before all the queued work from previous renders to that FBO completed (and/or reconfigured that FBOs bindings), this would cause a full flush/sync waiting on all that queued work to complete. You could see this flush/stall in the GPU profiler pretty easily. To avoid this inefficiency, at the recommendation of the dev support guys, we added a round robin queue of FBOs which we used for offscreen rendering. This allowed multiple FBOs of render work to pipeline well without a flush.

This is very interesting, because quite recently I went exactly the other way: in an obsessive attempt to save memory, I crammed many FBOs into 1. I thought ‘why have seven 512x512 FBOs, three 765x765 and two 1024x1024 if only 2 are really needed to hold information at any given time - let’s have only two 1024x1024 ones. If they are too large for a particular render, I’ll just adjust Texture Coordinates (actually if you take a look at the Vertex Shader from above again, this is exactly what the ‘u_TexCorr’ uniform does) and the Viewport’. This will save so much memory! (you can see I am a programmer of embedded systems in my day job)

You’re saying this is a no-no?

That’s another thing you might do: Capture profiling data for your application around when a flash is happening. Then pull up the Mali profiling tool and look at that instrumented profiling data to see if you can identify a single prominent problem in the data.

I’ll need to sniff around how to do that…

Dark_Photon · June 5, 2018, 6:54pm

[QUOTE=Utumno;1291716]This is very interesting, because quite recently I went exactly the other way: in an obsessive attempt to save memory, I crammed many FBOs into 1. I thought ‘why have seven 512x512 FBOs, three 765x765 and two 1024x1024 if only 2 are really needed to hold information at any given time - let’s have only two 1024x1024 ones. If they are too large for a particular render, I’ll just adjust Texture Coordinates (actually if you take a look at the Vertex Shader from above again, this is exactly what the ‘u_TexCorr’ uniform does) and the Viewport’. This will save so much memory! (you can see I am a programmer of embedded systems in my day job)

You’re saying this is a no-no?[/QUOTE]
No. I’m just saying, on a different mobile GPU vendor’s GLES driver, it’s requires a different FBO usage to pipeline the work well. That doesn’t say anything for sure about ARM Mali GLES drivers though.

So I would look through the Mali GLES developer guides for guidance w.r.t. FBO usage, and/or post a question to the Mali GLES developer forums.

It could be the Mali GLES driver is a bit smarter about pipelining the draw work for different render targets even when a small number of FBOs are used to dispatch it, in which case you might not need to do anything different.

Utumno · June 8, 2018, 2:49am

A string of bad news:

my question on Mali dev forum (https://community.arm.com/graphics/f/discussions/10285/opengl-es-3-1-on-mali-t880-flashes ) remains unanswered
it turns out the bug is not reproducible when tracing with Mali Graphics Debugger:
- I run the ‘bug reproducing’ app, flashes happen regularly about once-twice per second.
- I connect MGD, start tracing, flashes completely stop.
- I press the ‘disable tracing temporarily’ button in MGD - flashes immediately start again.
- I press the button again to restart tracing - flashes completely disappear…
I have tried to introduce the ‘round-robin’ strategy with FBOs, and with the Atomic counter, and with the SSBO ( I’ve tried 2, 3, 5 of everything ) —> this does not make one bit of a difference. Still flashes once-twice per second. (although with the SSBO what I tried is only 1 fragment shader with 1 SSBO binding point, and 5 SSBOs being bound there in a round-robin strategy - maybe I should try 5 fragment shaders with the only difference being SSBO binding points - and 5 SSBOs permanently bound 1-to-1?)
Attempts at removing code so far only succeeded to remove part of the 45k lines that simply do not run when this ‘bug-reproducing’ app is being run. (recall that the whole thing is a generic library for graphics effects, most of the effects run just fine on Mali, it is just the ‘A-buffer order independent transparency’ that gives me the flashes). So the ‘minimal bug-reproducing app’ is ATM about 5000 lines of code. As soon as I start to remove code that actually runs, reproducibility of the bug decreases. Even if the code that is being removed should, in this particular case of 2 cubes arranged as they are, be a no-op.
I also tried to re-structure my FBOs used for ping-pong: they used to have 2 color attachments + 1 combined depth-stencil attachment. Each time I blurred something I’d keep detaching and re-attaching those 2 color textures (because blur is a 2-pass algorithm: first horizontally, then vertically); now instead I have 2 separate FBOs sharing a depth-stencil texture. Result: looks like this has made the issue slightly worse.
KHR or debug output does not seem to be possible on Android. In fact, I don’t even know how to create a debug context…

EDIT: debug output is there, but it is only available starting with OpenGL ES 3.2. Fortunately the Mali T880 is 3.2 capable, let’s try!

EDIT2: Well great. KHR debug output seems to be there in OpenGL ES 3.2 headers (at least the function and constant definitions are there) but trying to use them - i.e. calling glDebugMessageCallbackKHR() - results in

java.lang.UnsupportedOperationException: not yet implemented

on Android 7.0. Supposedly Java bindings were added in Android 8.0 (this was claimed by opengl es - Is glDebugMessageCallbackKHR implemented in Android 6? - Stack Overflow, turned out to be not true)

EDIT3: Even better. Now I tried this on another phone running Android 8.1.0 updated today - still ‘not yet implemented’. No wonder there’s absolutely no info online about how to use it…

Utumno · June 8, 2018, 3:09pm

But also one surprising discovery - glFlush() does NOT make this bug go away!

I have added some 20 glFlushes() all over the code now, 1 after each major step. It’s still flashing just the same.

If I add just one synchronised ‘mapBufferRange - unmapBuffer’ at the beginning of my render loop, the bug is completely gone though:

glBindBuffer(GL_SHADER_STORAGE_BUFFER, mSSBO[0] );
glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, length, GL_MAP_READ_BIT);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);

Dark_Photon · June 8, 2018, 5:34pm

[QUOTE=Utumno;1291754]A string of bad news:

my question on Mali dev forum (https://community.arm.com/graphics/f/discussions/10285/opengl-es-3-1-on-mali-t880-flashes ) remains unanswered[/QUOTE]

That stinks. Yeah, I had similar experience on the Qualcomm forums a few years back: LINK Come to find out, their driver sometimes generates GL_OUT_OF_MEMORY when their shader compiler fouls up or fails to detect an error: ANOTHER LINK.

However, you’ve established that you can get rid of your flashing merely by stalling the pipeline a while, and you’re on Mali, so it’s not likely that it has anything to do with a shader error.

You might try posting about your problem on the the OpenGL ES Forum on khronos.org. It’s possible you might catch someone’s eye there that has tripped over something similar on their mobile GPU driver.

I have tried to introduce the ‘round-robin’ strategy with FBOs, and with the Atomic counter, and with the SSBO ( I’ve tried 2, 3, 5 of everything ) —> this does not make one bit of a difference. Still flashes once-twice per second. (although with the SSBO what I tried is only 1 fragment shader with 1 SSBO binding point, and 5 SSBOs being bound there in a round-robin strategy - …

…maybe I should try 5 fragment shaders with the only difference being SSBO binding points - and 5 SSBOs permanently bound 1-to-1?)

No, I don’t think that’s likely to produce a different result.

Re your GL_OUT_OF_MEMORY problem, I’d take a look at your buffer and texture usage to make sure there’s absolutely no way you’re ghosting them on blocking on them such that they might trigger a fragment flush. That “can” lead to legitimate GL_OUT_OF_MEMORY on mobile with some drivers, even if it doesn’t look to you like you’re using that much memory. Yes, I realize it’s a different GPU vendor (so buffer and texture usage contention may be handled differently on Mali drivers), but for general idea, see this link:

[ul]
[li]Why GPUs don’t like to share – a guide to improve your renderer on PowerVR-based platforms [/li][/ul]

KHR or debug output does not seem to be possible on Android. In fact, I don’t even know how to create a debug context…

It’s possible your platform doesn’t support KHR_debug. Have you tried printing the output of glGetString(GL_EXTENSIONS)? If so, look in there.

Also, one of the helpful things about KHR_debug is you don’t necessarily have to create a debug GL context. You can just call glEnable( GL_DEBUG_OUTPUT ) and start using it, if supported. That said, if you want to go the “create a GL debug context” route, you can do that with EGL by calling eglCreateContext() and passing the EGL_CONTEXT_OPENGL_DEBUG_BIT_KHR bit within the EGL_CONTEXT_FLAGS_KHR flags. For more detail, see EGL_KHR_create_context

EDIT: debug output is there, but it is only available starting with OpenGL ES 3.2. Fortunately the Mali T880 is 3.2 capable, let’s try!

EDIT2: Well great. KHR debug output seems to be there in OpenGL ES 3.2 headers (at least the function and constant definitions are there) but trying to use them - i.e. calling glDebugMessageCallbackKHR() - results in

java.lang.UnsupportedOperationException: not yet implemented

on Android 7.0. Supposedly Java bindings were added in Android 8.0 (this was claimed by opengl es - Is glDebugMessageCallbackKHR implemented in Android 6? - Stack Overflow, turned out to be not true)

EDIT3: Even better. Now I tried this on another phone running Android 8.1.0 updated today - still ‘not yet implemented’. No wonder there’s absolutely no info online about how to use it…

There’s got to be some way to use it. If not, that’s surprisingly lame.

But also one surprising discovery - glFlush() does NOT make this bug go away!

I have added some 20 glFlushes() all over the code now, 1 after each major step. It’s still flashing just the same.

That’s interesting, though I’m not too surprised about that. Depending on how glFlush() is implemented and where in the frame you do it, it could actually cause “more” flashing. I know on another mobile GPU’s drivers, a:


glFenceSync( GL_SYNC_GPU_COMMANDS_COMPLETE, ... );
...
glWaitSync();

mid-frame will cause flashing, as this generated a full GPU pipeline flush including tile rasterization for the partially-queued work for the currently bound render target. A flush at the beginning of the frame (in between render target renders) is more likely not to induce flashing than one done mid-render-target-render.

If I add just one synchronised ‘mapBufferRange - unmapBuffer’ at the beginning of my render loop, the bug is completely gone though:

glBindBuffer(GL_SHADER_STORAGE_BUFFER, mSSBO[0] );
glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, length, GL_MAP_READ_BIT);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);

Another good data point. I think if you keep whacking at this, you’ll figure it out.

If your driver blocks the draw thread whenever you map a buffer that’s in-use by the pipeline, that should stall the draw thread at the Map for a while. That could be enough to prevent the circumstances that later that cause the flashing.

And the fact that you’re doing it at the beginning of the render loop makes it less likely to cause flashing, if in-fact it’s triggering a full fragment flush (which since you’re using it in the fragment shader, I’d bet that it will).

You really need more visibility as to what’s going on down in that driver.

Dark_Photon · June 8, 2018, 6:33pm

Here’s a Mali dev support guy talking about using KHR_debug under EGL / OpenGL ES on their GPUs 4 years ago.

Easier OpenGL ES debugging on ARM Mali GPUs with GL_KHR_debug

You might see if this helps any.

This (mali_kernel_common.h) among other places talks about the MALI driver having a mali_debug_level that can take values from 0 though 6 (6 being the highest debug level). This might give you more detailed mali driver info in logcat.

Here’s a link related to it on android (LINK) which suggests that you might be able to set this on Android from a shell with:


# mali_debug_level: Disabling it, you can gain some performance improvements.

echo "0" > /sys/module/mali/parameters/mali_debug_level

There are references elsewhere that you might also be able to set this as a parameter to the mali kernel module when it’s loaded into the kernel on boot-up:

insmod /boot/mali.ko mali_debug_level=2

And here’s a post talking about it on the ARM forums: LINK

Utumno · June 9, 2018, 11:51am

Thanks for the answers, again, I really appreciate this!

I have read before the Mali dev guy page about KHR_debug from 4 years ago. Unfortunately he’s talking about native development, i.e. NDK. Like I said, KHR_debug is defined in OpenGL ES 3.2 Java imports (‘headers’), but the methods are ‘not yet implemented’. Maybe I’ll rewrite the app to Native and then I’ll be able to squeeze some more info from the driver - but given the fact that the minimal ‘bug reproducing’ app ATM is about 5k lines - I am not that desperate yet.

I have also rooted my phone and I am looking at Linux filesystem. Unfortunately the structure is different than the one described in those links from few years ago (that was about the previous architecture, so called Utgard) - now there’s no loadable module anymore, everything is statically compiled into the kernel and possibility to dynamically load modules is compiled out (probably for security reasons). But I can see there are unofficial ROMs for my device (Samsung Galaxy S7), maybe I’ll try that. There is some info in /sys about memory Mali allocates when my app is running.

The ‘/sys/module/mali/parameters/mali_debug_level’ unfortunately does not exist any more (not surprising, since there is no module). My question in the Mali dev forum asking how to enable more debugs from the Mali driver ( https://community.arm.com/graphics/f/discussions/10297/android-extra-debugs-from-mali-driver ) of course remains unanswered.

I have also asked in OpenGL ES forum (thanks for the link) : https://forums.khronos.org/showthread.php/13797-OpenGl-ES-3-1-app-in-Android-flashes-on-Mali-T880-GPU . So far not answered.

I had some discussion in StackOverflow with someone who, having read the symptoms, agreed that the only possibility is either a ‘full pipeline flush’ or a bug in the driver.

#####################################
Maybe I’ll show what I exactly mean by ‘flashes’. Here’s how the main ‘bug-reproducing’ app should look like (95% of frames look like this):

The above is rendered in the following way:

zero out our Atomic counter and SSBO ‘per-pixel head pointers’. (AKA ‘Order Independent Transparency Pass1 - Clear’)
take a textured cube
render it to a FBO1 with a special fragment shader that takes a color parameter (in this case RED) which ‘pulls’ each pixel color towards itself
blur FBO1 ( this takes two passes - first FBO1 is the input, blur horizontally to another FBO2, then back to our initial FBO1 blurring vertically) (blur is masked by stencil for speed)
copy this blurred red cube to another FBO3, again with a special fragment shader that copies all opaque (frag.a >0.95 ) fragments directly to FBO3, and the transparent fragments get inserted to the per-pixel linked lists of triplets (pointer to next element of linked list, depth,rgba) - this is done already sorting by depth. (AKA ‘Order Independent Transparency Pass2 - Build’)
Repeat steps 2-3-4-5 with another cube, this time giving YELLOW in step 3 (and a slightly different Model View Matrix, of course)
Repeat steps 2-3-4-5 with another cube, this time giving GREEN in step 3 (and again a slightly different Model View Matrix)
Render a quad, with depth, color and stencil writes off, with fragment shader which goes through the per-pixel linked lists in SSBO and cuts off those that are occluded by opaque pixels from FBO3 (AKA ‘Order Independent Transparency Pass3 - Cut’)
Render a quad, color+depth writes on, stencil off, Blending on, this time going through the linked lists in SSBO, blending them in order, and finally blending with color from FBO3 (AKA ‘Order Independent Transparency Pass4 - Render’)
blit FBO3 to screen

#####################################
Here are some unusual frames:

Green cube completely missing:
The whole Red cube and the opaque part of the Yellow cube missing:
Both Red and Yellow completely missing:
The whole Red, and transparent parts of Yellow and Green missing:
And IMHO the most interesting example, transparent parts of Red and Yellow show, even though they should be occluded (just as if Order Independent Transparency Pass3 ‘Cut’ pass did not run at all??)

Utumno · June 9, 2018, 1:00pm

And another interesting example. Here’s another app, a unit test of the graphics library. With this app its very hard to reproduce this bug, 99.9% of frames look like this:

This is rendered in the following way:

Take a texture of a leaf, render 8 quads textured with this to a FBO1 forming the ‘inner ring’ with the fragment shader pulling pixel colours towards RED
Blur FBO1
Blit FBO1 to the center of another, larger FBO2
Render 8 more quads textured with the very same leaf texture to FBO2 (forming the ‘outer ring’), this time pulling their fragments towards GREEN
Blur FBO2 (so the inner ring gets blurred twice)
Blit FBO2 to the screen.

Of course the above is also done with the Order Independent Transparency way, when copying we put transparent fragments to the SSBO and merge the SSBO in the last step.

I managed to see a frame like this:

So the outer ring is gone, and what’s even more interesting, the inner ring got distorted in some strange way. One can see the bottom of FBO2 (the distorted inner ring is cut in the lower part - that’s because FBO2 ends there).
This kind of got me thinking that maybe the objects do not ‘disappear’ but get distorted and moved; in this particular example we were lucky that the movement was small enough that the new position was still on the screen.

But on the other hand this would not explain the last case from the previous post - the one where parts of the transparent rings around the cubes showed, even though they should have been cut by the ‘Order Independent Transparency Pass3 - Cut’ pass.

Utumno · June 10, 2018, 2:58pm

I have updated my phone from stock Samsung Android 7.0.0-based firmware to an unofficial LineageOS 15.1 (Android 8.1.0-based).

Unfortunately the Mali driver is still the same, version ‘r12p1-03dev0’ and, unsurprisingly, behaves identically.
The aim is to install the latest Mali driver and see - I guess I can spin my own kernel here, compile Mali with highest debug level and see…

Utumno · June 16, 2018, 4:54pm

Summarizing this investigation: mixed, mostly bad, news.

Flashing my Samsung Galaxy S7 with Lineage OS 15.1 (even though version of the Mali driver in there is still be same) proved partly useful, because the nature of the bug changed slightly. The bug, for some reason, became more predictable. This allowed for the ‘let’s keep removing code part by part and seeing if the bug is still there’ approach; albeit with a long, statistical test (each time 50 runs + script measuring frequency of crashes / flashes ).

So at the end of this process I concluded that the issue undoubtedly has something to do with the SSBO; sometimes reads from the SSBO

layout (std430,binding=1) buffer linkedlist  
  {                                       
  uint u_Records[];           
  };

would either return garbage or block (I can’t tell those cases apart, but for sure something very shady happens when reading even though the SSBO is filled up with values in 1pass –> memoryBarrier –> read from in another pass). I’ve tried ‘coherent’ ‘volatile’ ‘restrict’ over there, nothing helps.

Then I did some more testing on different devices and I gave up. Results:

Qualcomm’s Adreno 418: works wonderfully
NVidia’s Tegra K1: works wonderfully
PowerVR GX6240: works, although slowly
PowerVR GE8100: shader fails to compile ( reported here: GE8100 in HTC Desire 12 (Android 7.1.1): fragment shader fails to compile - #2 by utumno - PowerVR Insider - PowerVR Developer Community Forums )
ARM Mali T880: some instability with SSBO, flashes, occasional crashes (with Samsung’s original Android 7.0-based OS mostly flashes)

Looks like in light of this driver situation I’ll have to give up trying to implement A-buffer for Order Independent Transparency for mobile devices and think of something simpler. Hurray for Qualcomm and NVidia, down with ARM and Imagination…

I am also rethinking my choice of platform. Looks like writing any advanced graphics app for Android, with so many possible GPUs, is not going to be easy. In iPhone camp at least they have to deal with only 1 brand of GPU…

Utumno · June 21, 2018, 4:23pm

Wow! 3 weeks into the hunt and I finally solved it. I essentially managed to split the SSBO into two parts and make one of the parts (the smaller one, fortunately) into a circular queue of 5 SSBOs, each one used once every 5 frames.
The test I was talking about in the previous post confirms (10000 runs) that the bug is gone.

Now the main problem is the PowerVR GE8100 compiler bug, fortunately on their forum they are a bit more responsive than ARM…

Dark_Photon · June 22, 2018, 6:02pm

Congrats! Persistence pays off!

Do you have any insight as to why that particular combination seems to have avoided the instability? Or just chalk one up to another crazy driver bug that makes no sense yet.

Now the main problem is the PowerVR GE8100 compiler bug, fortunately on their forum they are a bit more responsive than ARM…

Yes, I was very happy with Imagination Technologies’ dev support team a few years back, in how well they investigated problem reports and responded to forum posts. I’d point you to a few names, but the folks that I worked with there have all moved onto other companies since then. In the last 2 years, there’s been quite a brain drain at ImgTech, with them losing a number of really solid dev tech and driver engineers.

In iPhone camp at least they have to deal with only 1 brand of GPU…

Not quite.

Before 2017, sure: it was Imagination Tech’s PowerVR GPUs. Really solid hardware, drivers, dev tools, and developer support IMO.

Since then, read up on the Applie A11 Bionic GPU, Iphone 8+, and the new St. Albans, UK office Apple set up near ImgTech.

Utumno · June 26, 2018, 9:34am

I need to soften my criticism of ARM here; a very kind ARM engineer has answered (https://community.arm.com/graphics/f/discussions/10285/opengl-es-3-1-on-mali-t880-flashes ) my questions in the forum and confirmed that the observed behaviour was a bug in the ‘r12p1-03dev0’ version of the Mali driver. The bug has since been fixed - on r22 the flashing cannot be replicated any more. So answering your question, yes - another crazy driver bug that makes no sense yet.

However now, for a change, the guys in Imagination forum went AWOL