Out var of vertex shader disconnected from in var of fragment shader?

Hello,

OpenGL ES 3.0 here. I’ve got a weird problem. The program I am writing works of some 50 different (mobile) devices I’ve checked, but recently it has been brought to my attention the the program does not work on Samsung Galaxy J4+ (Adreno 308). I’ve got access to this model and indeed, I can reproduce what has been reported.

Using the tactic of the ‘smallest code that reproducees the problem’ I’ve been shaving various pieces of code off and I’ve arrived at a very simple program with essentially displays a textured quad. Here’s what it does:

Vertex shader:

#version 300 es

// …
in vec2 a_TexCoordinate; // Per-vertex texture coordinate.
out vec2 v_TexCoordinate;

// …
void main()
{
// …
v_TexCoordinate = a_TexCoordinate;
// …
}

Fragment Shader:

#version 300 es

precision mediump float;

in vec2 v_TexCoordinate;
out vec4 fragColor;
uniform sampler2D u_Texture;

void main()
{
fragColor = texture(u_Texture,v_TexCoordinate);
}

Using those shaders, when run on any ‘normal’ device, my textured-quad-program displays the following:

Which is exactly what I expect - this is the input texture.

The very same program, when run on the ‘suspicious’ Galaxy J4+ however, displays the following:

Which is the texture moved by vec2(0.5,0.5) [ I’m using GL_CLAMP_TO_EDGE for texture wrapping ]

So it really looks like the J4+ somehow moves the texture coordinates by (0.5,0.5). I’ve then tried the following for the fragment shader:

#version 300 es

precision mediump float;

in vec2 v_TexCoordinate;
out vec4 fragColor;
uniform sampler2D u_Texture;

void main()
{
fragColor = texture(u_Texture, vec2(0.5,0.5) );
}

and that displays (on any device including the J4+)

Which is expected - this is the color of the middle of the texture. Then I’ve tried

Vertex shader:

#version 300 es

// …
in vec2 a_TexCoordinate; // Per-vertex texture coordinate.
out vec2 v_TexCoordinate;

// …
void main()
{
// …
v_TexCoordinate = vec2(0.5,0.5);
// …
}

Fragment Shader:

#version 300 es

precision mediump float;

in vec2 v_TexCoordinate;
out vec4 fragColor;
uniform sampler2D u_Texture;

void main()
{
fragColor = texture(u_Texture,v_TexCoordinate);
}

i.e. hardcoding all the texture coordinates to (0.5,0.5) in the vertex shader and passing those on to the fragment shader - the effect on any ‘normal’ device is expected (blue rectangle just like above), but on the J4+ it displays

i.e. again the input texture moved by (0.5,0.5).

So it would seem like on the J4+ the out ‘v_TexCoordinate’ variable of the vertex shader is completely disconnected with the in ‘a_TexCoordinate’ variable of the fragment shader.

How is that possible?

So if it does not take ‘v_TexCoordinate’ in the fragment shader from the ‘v_TexCoordinate’ in the vertex shader, then where does it take it from? I am pretty sure it takes it from the values of the vertex coordinates themselves. In case of the quad, they are (-0.5,-0.5) [lower left] to (0.5,0.5) [upper right].

Try incorporating the value of v_TexCoordinate itself in the output colour.

With the original vertex shader, what happens if you vary the values being fed to a_TexCoordinate? Do you get a +0.5 shift, or do you get a result which ignores the attribute value? Or something else?

Issue (mostly) worked-around.

So this turned out to be two separate issues, both IMHO in the Adreno 308 driver V@331 ( OpenGL ES 3.0 V@331.0 (GIT@e6de3b7, I22091d40c2) (Date:08/07/19) ).

First, when one uses Transform Feedback, this particular driver messes up the mappings between the corresponding ‘out’ vars of the vertex shader and ‘in’ vars of the fragment shader. This is why my ‘v_TexCoordinate’ in fragment var was mapped not to the corresponding ‘v_TexCoordinate’ out vertex var, but to another vertex var declared two rows up.
Removing a call to ‘glTransformFeedbackVaryings()’ completely fixes the mappings.

Second, this driver only supports an UBO of max size 1920 bytes. Anything larger just disappears into the void.
glGetIntegerv(GL_MAX_UNIFORM_BLOCK_SIZE) returns 65536 i.e. 64k.

I managed to work around both issues, the first one completely, and the second one - at the cost of some functionality. Now I am scratching my head how to detect this bug programatically and only switch on the workaround if it is actually needed.

Correction: after some more debugging it turns out that the nature of the 2nd bug is slightely different: it’s not max 1920 bytes per a single UBO, but max 6000 bytes for all uniforms in the vertex shader.

Shaders compile, glGetError() keeps returning NO_ERROR, the whole thing runs, but if I allocate more than 6000 bytes for uniform variables in the vertex shader, my objects get drawn only partially or not at all.

Is there a way to query the driver for its limit on number of uniforms in the vertex shader?

Assuming GLSL-ES supports it (and if you haven’t already tried it), you might try specifying the transform feedback varying output specification in the shader:

Thanks DarkPhoton!

Actually the trasform feedback part doesn’t bother me too much. I have managed to completely work around that.

Much bigger problem is the second bug: my objects only get partially (or not at all) drawn if I try allocating more than 6000 bytes for uniform variables in the vertex shader, and that includes the UBOs.

ES 3.0 is supposed to support 1024 components in the default block plus 16384 bytes (4096 components) for each uniform block, with a limit of 12 uniform blocks for the vertex shader.

Are you allowing for padding in calculating the size of uniform variables? UBOs align almost everything to the size of a vec4, which can result in a lot of padding for smaller types. In particular, a float[] will align each element to a four-word boundary, meaning that 75% of the space is padding.

I have three UBOs in the vertex shader, each one composed entirely of one array of vec4’s:

layout (std140) uniform vUniformEffects
{
vec4 vEffects[3*NUM_VERTEX];
};

So no, there is no padding.

And yes, the ‘1024 components in the default block plus 16384 bytes (4096 components) for each uniform block’ thing holds true 99% of the time, except on the Adreno 308 driver V@331 ( OpenGL ES 3.0 V@331.0 (GIT@e6de3b7, I22091d40c2) (Date:08/07/19) ).

Given your finding on Adreno OpenGL ES drivers, I find this gem in the ANGLE project (main page link) test code very curious:

// Test to transfer a uniform block large array member as an actual parameter to a function.
TEST_P(UniformBlockWithOneLargeArrayMemberTest, MemberAsActualParameter)
{
    ANGLE_SKIP_TEST_IF(IsAdreno());

    constexpr char kVS[] = R"(#version 300 es
...
layout(std140) uniform UBO1{
    mat4x4 buf1[90];
} instance;

layout(std140) uniform UBO2{
    mat4x4 buf2[90];
};
...
})";
...
    ANGLE_GL_PROGRAM(program, kVS, kFS);
    EXPECT_GL_NO_ERROR();
}

Notice that this test would pass in UBO storage totaling 11,520 bytes, as well as the complete skip of this test if running on Adreno drivers. It would seem someone may have hit similar “UBO size” troubles with Qualcomm Adreno drivers…

Random thought. You might query the value for these uniforms:

  • GL_MAX_VERTEX_UNIFORM_COMPONENTS (== 1024? 1536?)
  • GL_MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS (== 197504?)

While only the 2nd is supposed to include total space for UBOs (OpenGL ES 3.0 Spec link), I wonder if the Adreno driver folks might have incorrectly treated the 1st as the limit including UBOs. If the value of GL_MAX_VERTEX_UNIFORM_COMPONENTS were 1536, that’d be pretty close to 6000 bytes (1536*4 = 6144).

Also, I’d be remiss if I didn’t suggest you post a short repro for this bug to this Qualcomm Adreno forum:

Responses from knowlegeable Qualcomm folks are hit-and-miss. But you might get something useful from them out of it. Or another Adreno dev might follow-up with the workarounds they’ve already found for this bug.

Hmm…

Their Vulkan guide contains the same recommendation (never mind that Vulkan GLSL doesn’t have “default uniform block uniforms”). And further they recommend also preferring UBOs “over push constants on Adreno hardware for performance reasons”:

I wonder how big the “hardware constant RAM” is on Adreno 308 GPUs/graphics drivers, and if it’s configurable…

Wild speculation:

In OpenCL land on Adreno (not necessarily the same)…

Hmm. I wonder if internally their GLSL compiler prefers to locate UBOs in Adreno GPU on-chip constant memory, but there’s a flaw in its falling back to off-chip system RAM when the total UBO space required is too large for constant memory.

Anyway…

If you can’t find some reasonable workaround for the “Max 6000 bytes total vertex uniform space incl UBOs” bug/feature on Adreno graphics drivers, the above at least suggests trying a fallback to using SSBOs instead. SSBOs may not suffer this same bug. And given the above, it sounds like it’s possible that if the UBO space required is > 3KB, the driver may punt the UBO out to off-chip system RAM anyway, which may might render it roughly equivalent in access performance to an SSBO on Adreno drivers. That’s something to check anyway…

On the troublesome platform, we have

GL_MAX_VERTEX_UNIFORM_COMPONENTS = 1536
GL_MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS = 230400

So indeed, your suspicion might be true.

I have now managed to just about fit (with - I guess - about 50 bytes to spare) inside the ‘permitted’ 6000 bytes by basically converting things like

layout (std140) uniform vUniformProperties
{
vec4 vProperties[NUM]; // properties of the effect, x=name, y=unused, z=height, w=unused
};

to

layout (packed) uniform vUniformProperties
{
vec2 vProperties[NUM]; // properties of the effect, x=name, y=height
};

Now I need to dynamically figure out the stride in the array, and to make things worse I need the CPU-side of it already created before I compile the shaders, so I need to guess that the stride will be 8, allocate the CPU side, compile the shaders and figure out the stride, and re-allocate the CPU side if that turns out not to be 8.

But fortunately on the problematic Adreno driver the stride is equal to 8.

Interesting! Thanks for following up! I find that fascinating.

Yeah, sometimes dev’ing on OpenGL and OpenGL ES feels like divination, trying to guess what the heck is going on down there in the driver.

layout (packed) uniform vUniformProperties
{
vec2 vProperties[NUM];  // properties of the effect, x=name, y=height
};

Good idea.

You could also instead use std140, vec4, and div2 / mod2 math to locate the correct property. That’d avoid having to query offsets.

And it’s too bad UBOs in GL/GL-ES don’t support std430 packing, like they do in Vulkan. That’d be the cleanest solution to this problem.

This is, sadly, not the end of this saga.

The workaround from my last post (using an UBO of vec2s with a packed layout instead of an array of vec4s with std140 layout) got me into even hotter water as this, in turn, bumps into a bug on, again, Adreno 306 (and 308?) driver version ‘OpenGL ES 3.0 V@100.0 AU 005.01.00.115.128 (GIT I55c48cad9a)’ which is an even more popular combination than the last buggy couple, Adreno 308 with driver version V@331.

So after applying this workaround I started getting even more cryptic 1-star reviews from owners of said chipset+driver combination.

I have bought a device with Adreno 308 and driver version 331. Sadly, there appears to be no way I can programmatically detect the presence of this bug - on this device, when I allocate more than 6000 bytes for uniforms, everything appears to work correctly (shader compiles and runs, no errors whatsoever from glGetError() ) but the objects are not drawn correctly - only about half the pixels get drawn.

Things got to the stage where I now, on app startup, send to a central server several pieces of information - including the chipset, the driver version and the number of times the user on this particular device has completed a level. The assumption is - if someone completed a level, then this is a pretty good proof that on this particualr (chipset,driver) combination things do get rendered correctly. This way I have assembled a list of more than 600 (chipset,driver) combinations out there, and looks like I now know the (almost complete) list of buggy combinations - Adreno 306 and 308 driver versions 269, 331, 415 appear to suffer from the ‘no more than 6000 bytes for uniforms in vertex shader’ bug, whereas driver version 100 suffers from the ‘buggy packed layout’ problem. Between 100 and 269 there are several other versions out there (140,145), status of which is unknown .

In general, Adreno 30x chipsets are terrible. No problems with PowerVR or Mali devices.

Yuck. What a mess.

Regarding the quality of the EGL / OpenGL ES driver implementations, that was my experience too a few years back. Poor on Qualcomm Adreno, but excellent on Imagination Tech PowerVR (no experience with ARM Mali).

ImgTech / PowerVR driver quality and dev-tech support was awesome, at least before Apple gutted them and the Chinese government bought them out. No clue how they are now. Adreno on the other hand? That’s another story…

Ranting a bit more -

Furthermore, this is not so simple as “there are Adreno drivers v. 100,140,145,269,331,415”.
There are actually many variants of each version of an Adreno driver out there.
Here’s a list of 37 driver variants that various phones with the Adreno 308 chipset come equipped with:

MariaDB [statistics]> select driver from statistics where chipset like “%308%” order by driver;

+--------------------------------------------------------------------------------------------+
| driver                                                                                     |
+--------------------------------------------------------------------------------------------+
| OpenGL ES 3.0 V@145.0 AU@ (GIT%40I83a540a04a)                                          |
| OpenGL ES 3.0 V@145.0 AU@ (GIT%40I762e720a6a)                                         |
| OpenGL ES 3.0 V@145.0 AU@ (GIT%40Ia3ef73d9d4)                                         |
| OpenGL ES 3.0 V@145.0 AU@ (GIT%40Ic4cf336e0a)                                         |
| OpenGL ES 3.0 V@145.0 AU@06.00.01.211.056+(GIT%40I48a9d37399)                          |
| OpenGL ES 3.0 V@145.0 AU@07.00.00.269.016+(GIT%40Id13463b0b3)                          |
| OpenGL ES 3.0 V@145.0 AU@07.01.01.269.023+(GIT%40Ib1167d03fb)                          |
| OpenGL ES 3.0 V@145.0 AU@07.01.01.269.038+(GIT%40Id13463b0b3)                          |
| OpenGL ES 3.0 V@145.0 AU@07.01.02.269.046+(GIT%40I09d312ff84)                          |
| OpenGL ES 3.0 V@145.0 AU@07.01.02.269.051+(GIT%40Ie4790512f3)                          |
| OpenGL ES 3.0 V@251.0 AU@08.00.00.312.030+(GIT%40Ie4790512f3)                          |
| OpenGL ES 3.0 V@269.0 AU@ (GIT%40I109c45a694)                                          |
| OpenGL ES 3.0 V@269.0 AU@ (GIT%40I26dffed9a4)                                          |
| OpenGL ES 3.0 V@269.0 AU@ (GIT%40I4a28d2d249)                                          |
| OpenGL ES 3.0 V@269.0 AU@ (GIT%40Ie4790512f3)                                          |
| OpenGL ES 3.0 V@269.0 AU@ (GIT%40I109c45a694)                                         |
| OpenGL ES 3.0 V@269.0 AU@ (GIT%40I7663a5f222)                                         |
| OpenGL ES 3.0 V@269.0 AU@08.00.00.312.044+(GIT%40I0b59f3a7cf)                          |
| OpenGL ES 3.0 V@269.0 AU@08.01.00.312.045+(GIT%40If99a9f7b1f)                          |
| OpenGL ES 3.0 V@269.0 AU@08.01.00.312.057+(GIT%40I4e9dba44e5)                          |
| OpenGL ES 3.0 V@269.0 AU@08.01.00.312.063+(GIT%40I4e9dba44e5)                          |
| OpenGL ES 3.0 V@269.0 AU@08.01.00.312.063+(GIT%40I77d3059488)                          |
| OpenGL ES 3.0 V@269.0 AU@08.01.00.312.074+(GIT%40I5ed696607a)                          |
| OpenGL ES 3.0 V@269.0 AU@08.01.00.312.083+(GIT%40I77d3059488)                          |
| OpenGL ES 3.0 V@331.0 (GIT@077c1bb%2c+I67e1628f4e) (Date 04/09/19)               |
| OpenGL ES 3.0 V@331.0 (GIT@35e467f%2c+Ice9844a736) (Date 04/15/19)               |
| OpenGL ES 3.0 V@331.0 (GIT@a3700f6%2c+Ia11ce2d146) (Date 11/10/20)               |
| OpenGL ES 3.0 V@331.0 (GIT@d8e170e%2c+I1db8f63a3f) (Date 09/26/19)               |
| OpenGL ES 3.0 V@331.0 (GIT@e6de3b7%2c+I22091d40c2) (Date 08/07/19)               |
| OpenGL ES 3.0 V@331.0 (GIT@f161b04%2c+I0380b38922) (Date 04/05/19)               |
| OpenGL ES 3.0 V@415.0 (GIT@3c77002%2c+I1807e7e5a1%2c+1609249121) (Date 12/29/20) |
| OpenGL ES 3.0 V@415.0 (GIT@5240b29%2c+I000594fe7d%2c+1589394739) (Date 05/13/20) |
| OpenGL ES 3.0 V@415.0 (GIT@7eb20a8%2c+I89e02c0951%2c+1589413919) (Date 05/13/20) |
| OpenGL ES 3.0 V@415.0 (GIT@9672191%2c+If703410f3a%2c+1593771752) (Date 07/03/20) |
| OpenGL ES 3.0 V@415.0 (GIT@d39f783%2c+I79de86aa2c%2c+1591296226) (Date 06/04/20) |
| OpenGL ES 3.0 V@415.0 (GIT@d4595eb%2c+I1f9128a98a%2c+1591023202) (Date 06/01/20) |
| OpenGL ES 3.0 V@415.0 (GIT@f7df46e%2c+Ie3bb699d95%2c+1605115147) (Date 11/11/20) |
+--------------------------------------------------------------------------------------------+

Only now I noticed that there’s also a v. 251! (with only one variant)

Some of those variants have only a few users who have never completed a level, but I have no idea if this is because those few installed the app, took a brief look and immediately deinstalled because they didn’t like the app (likely) or maybe there’s an actual problem with rendering with this particular variant.

Record holder, Adreno 505, comes with (so far) 52 different driver variants.

In contrast, PowerVR devices tend to come with only a few (2-3) driver variants per chipset; Mali - a bit more, about 3-5. Typical Adreno chipset has 10,20 driver variants.

Hmmm. Well, at some point you may just decide it’s just not worth all the dev time/effort trying to band-aid the buggy UBO support in Qualcomm Adreno drivers GL-ES drivers. So despite Qualcomm’s Performance Recommendations to prefer UBOs over SSBOs…

On their GPUs, you either just:

  1. Use SSBO(s), or
  2. Use texture(s) (TBO, 2D, 1D, etc.)

instead to send that uniform data to the shader … assuming they don’t exhibit the bug too! I mean if their driver guys are gonna make it so hard to get a working app with UBOs across many of their driver versions, they can just perform more poorly in app benchmarks against other GPUs!

The switch to SSBOs should be fairly straightforward, at least in terms of obtaining a work-alike. Plus you’ll be able to use std430 packing. And in fact, given the above findings, we have every reason to suspect that you may get the same driver perf with an SSBO > 6000 bytes as you do with a UBO > 6000 bytes (or 3000 bytes) on some/all Adreno GPUs.

If you haven’t already, you might post this as a tentative plan to that Qualcomm Adreno forum and see what feedback you get. Perhaps they’ve got a better band-aid solution for their GPUs/drivers. If nothing else, you’ll save someone else a lot of trouble when they hit this.


Thinking about your problem and UBOs vs. SSBOs gives me another idea what the driver might be botching up here. Are you by chance submitting updates to the contents of the UBO(s) you’re reading in the shader over the course of your frames? Or is the content of the UBOs static across all (or many) frames?

If the former, try making them static – i.e. don’t change the contents. If that “fixes” your problem, then try multibuffering them. That might give you fixed behavior as well.

Why I suggest that: Buffer updates are tricky for an OpenGL/OpenGL ES implementation to handle, because while you have the CPU updating their contents, you have the GPU/back-end driver wanting to render with the “old” contents 1-N updates ago. This is especially an issue on mobile GPUs where the whole architecture pretty much relies on fragment work being done 1-3 frames “after” the CPU submit it. The OpenGL / OpenGL ES interface completely abstracts this as an issue, but the driver has to deal with it some how.

Some GL-ES drivers just punt the whole issue to the app and block your CPU draw thread if it tries to update a buffer object for which there is an unexecuted GPU read-request in the pipe (ImgTech PowerVR drivers do this, or at least did). But in this case, maybe Qualcomm Adreno drivers should be doing this, but instead are yielding buggy behavior (?) Worth considering.

(Related: Vulkan punts this whole issue off to the app developer, making the app deal explicitly with the multibuffering and synchronization to get properly rendered results. This is really the way it should be, because the driver doesn’t know enough to get the best results in all use cases.)

1 Like

I am very weary of introducing SSBOs into this code path.

My code is split into two parts: a graphics library and some apps. The library already uses SSBOs for some effects, and there are problems with those (I’ve opened threads in this very forum about them - here’s one: https://community.khronos.org/t/flashes-on-arm-mali/). Generally speaking my impression is that there are even more driver problems with SSBOs than with UBOs.

The app I am talking about in this thread does not use the effects which require an SSBO.

Maybe indeed a 2D texture would be an answer here. Or your original suggestion: i.e. rather than

layout (packed) uniform vUniformProperties
{
vec2 vProperties[NUM];  // properties of the effect, x=name, y=height
};

as a workaround to save uniform var space, keep using

layout (std140) uniform vUniformProperties
{
vec4 vProperties[NUM]; // properties of the effect, x=name, y=unused, z=height, w=unused
};

but with a twice smaller ‘NUM’ and do mod2 math to figure out proper var in the shader.

Problem is, this computation would be in a rather critical part of the vertex shader, and vertex shading already is the slowest part of the whole thing. Such mod2 math would, I feel, knock off a few digits from the FPS.

If your feeling doesn’t come from actually profiling it, then you should ignore your feeling until you have actual profiling data.