AMD Vulkan drivers has lots of very critical bugs, that literally make Vulkan unusable on any AMD hardware

I have report this bugs to AMD, this bugs make any “not hello-world” aplication broken on AMD.

  1. AMD driver use 6Gb RAM to load 250Kb SPIR-V, that I found in this app
  2. AMD loop bug, that make broken this app
  3. AMD any not linear logic bug, that I randomly found just sending 128bytes push_contants

Spending days to debug some “big corporation” driver this is what I absolutely do not like.
My “experience” of trying Vulkan completely ruined for now.

This is not complain about Vulkan as technology.

Your code is rather impressive. It would appear you managed to use shader logic to build bots that play tetris and whole card games.

However, it is important to note that building card games or tetris bots in shaders is… unusual. There’s a huge gulf between “‘not hello-world’ applications” and the kind of stuff you’re doing. So it’s not entirely unexpected (though obviously unwelcome) to see pathological behavior or other bugs from shader compilers.

Also, I find some of the specific claims you make in your bug reports to be unlikely to be as broadly true as you claim. For example:

using something more complicated then single linear function in AMD SPIR-V shader, with single exit(single return for whole logic) result UB or unpredictable result. Having any code after first single return function will result UB.

It seems unlikely that none of the multitude of applications that ship Vulkan shaders that run just fine on AMD hardware are doing something other than “single linear functions” “with single exit”. It’s pretty difficult to write an ubershader without some kind of conditional branching and probably multiple return statements.

That doesn’t mean these aren’t bugs or that they don’t need to be fixed. But you’re out on the cutting edge of what is possible in shaders; you’re not writing the kind of code that normally gets tested on these drivers. When you start feeding in 64,000-opcode SPIR-V binaries (assuming your 250Kb SPIR-V is stripped of text or other unnecessary decorations), pathological behavior from systems that are routinely tested with shaders with opcode counts an order of magnitude lower is not entirely unexpected (though obviously it should still be looked at and resolved).

Basically, what I’m saying is that you shouldn’t let this “completely ruin” anything for you. Pushing the boundaries of what is possible has its downsides; finding the broken parts of a system that otherwise works is one of them. Indeed, I would consider it a badge of honor to be the first one to find some corner-case in a shader compiler; it means you’re doing something nobody else has done.

Looking at the pattern of some of your bugs, it seems that AMD’s compiler is choking on circumstances that I would consider pathological code. Nested loops that only iterate once, conditional branches to control-flow operations when the condition is based on constant expressions, etc. These are all cases where some kind of pinhole optimizer would step in and smooth out the code into something much more clear. In the case of your AMD_loop_bug example, calling the function ideally would be a no-op.

However, the complexity of figuring out that the function can boil down to nothing is non-trivial (well technically it’s quite trivial, since it returns nothing and has no side-effects, but nevermind that now). So it stands to reason that AMD’s optimizer is probably the culprit here: it’s trying to untangle the logic of what you’re doing and getting confused, so it outputs garbage code that causes problems.

A good way to test this might be to run your code through an offline SPIR-V optimizer, then hand the results off to AMD’s compiler. A problematic optimizer might also be responsible for the pathological behavior of the compiler in your 250Kb SPIR-V example, though the general code generation algorithm could also be at fault.

Another thing I would be curious about is what happens if you take the GLSL code, compile it to SPIR-V for OpenGL and then run that code under OpenGL. That would tell you if the problem is in the SPIR-V compiler or somewhere deeper in the driver.

1 Like

thanks for comment

I did provide minimal code, that literally 5 mines long

void main2()
{
 int i=0;
 if(i==0)return;
 else return;
 for (int i=0; i < 1; i++) {
 }
 return;
}

add this function to any your GLSL shader and launch in AMD Vulkan-it will crash your application on vkCreateGraphicsPipelines.

And I did provide 100% valid application valid_AMD_ub_test.zip code is unchanged SaschaWillems/Vulkan/triangle.cpp

my shaders trigger this bug “randomly” with unpredictable result, as I say this is because UB in literally any complex logic

this bug exist only in Windows, only in Vulkan SPIR-V
in Linux or in OpenGL same shaders on same hardware work without any problems, work as expected and same as on other vendors GPU

it no-op only in OpenGL GLSL, in Vulkan it depends of SPIRV compiler, I use GLSLANG compiler to generate SPIRV code, and this compiler does not do pre-optimiation and does not cut “not used blocks”.

thanks
but my shader is not “that complex”
I code them only from 2018 year, other people code much much more complex shaders on Shadertoy start from 2013 (im myself saw this dates)
so I mean this bugs I talk about they may exist for years, and I do not know why they still exist…

all hardware(even integrated cards, I have launch it even there) that out after 2015 year
this hardware launch my shaders in OpenGL with no problems
oldes hardware I test on is Nvidia 8xxx series(I tested on 8600 and 8800) can launch some of my 1k+ lines GLSL code shaders.

UE4 shaders, or Nvidia filtrating shaders are much more complex then my simple logic.

Are you saying that this is not “unusual” code? I certainly would think so, since it does nothing. People do not usually put 6-line no-op statements in their code.

Don’t get me wrong; it’s a great test for the bug. Because it does nothing, the compiler ought not to change a program’s behavior due to its presence. So if there is any apparent behavioral change, that represents a broken compiler. It’s a very easy way to test if the problem exists.

But that doesn’t disprove my point that your code is not normal shader code. That you’re doing things most people aren’t doing.

That’s not what I said.

There are two ways to build a shader in OpenGL: by giving OpenGL GLSL text directly, and by giving OpenGL a SPIR-V binary directly (presumably by pre-compiling it from some other language, like GLSL). You clearly have tried the former way. I was wondering what would happen if you tried the latter way.

I’m not sure what you’re trying to say here. When I said that it is a no-op, what I meant is that executing it should do nothing, and a sufficiently capable optimizer would recognize that and eliminate the code. I was speaking about the nature of the code, not of what a particular compiler might do to it.

You don’t know why they still exist because you rejected the explanation of why they still exist.

If you have a system that’s being used by vendors A, B, and C, and that system is working fine for the programs written by those vendors, and then vendor D comes along with a new program that breaks, that is still the system’s fault. However, the reason why that problem was never found was because vendor D is doing something different from what vendors A, B, and C are doing.

Basically, vendor D exercised some part of the system that vendors A, B, and C never did.

If you start from the assumption that vendor D isn’t doing anything differently from A, B, and C, then yes, it totally doesn’t make sense why those bugs haven’t been fixed. But since we know that vendors A, B, and C are not encountering these bugs (because then their code wouldn’t be working, and… well, just look at any published Vulkan-capable game; they work on AMD), then we know that this assumption is invalid.

And yet, they work on AMD’s Vulkan implementation. Yours don’t. So clearly, there’s something special going on in your code that is not going on in other people’s code.

And I would say that building a game in a shader is more complex than anything the UE4 shaders are doing. They’re just doing vertex T&L, doing some lighting computations, maybe some geometry culling. That’s all very simple stuff compared to what you’re doing.

1 Like

my point is:
this minimal code use 1 if, 1 loop and 3 exit point of function
if this function ruin shader, then any “more complicated logic” will ruin shader.
I found this minimal-code out of my big shaders, that do “logic”

this is not “no-op” bug, I think you get it.

and you can add “code that do something” to that logic and it still be ruined in AMD, change function type and return anything and use some calculation in that logic…
I made absolutely minimal code to display that “it broken in logic level, not on data reading/writing”

I understand that, I did not test it, and I wont test it. Work as “free beta tester” not that what I want to do.

im not AMD developer to dig that much and debug an actual hardware.

made changed version

float main2(int a)
{
    int i=a;
    if(i==0)return 0.;
    else return 1.;
    for (int i=0; i < 1; i++) {
        a++;
    }
    return float(a);
}

full shader in triangle.frag
download with test exe valid_AMD_ub_test_1.zip

one more changed code:
code from loop bug:

float AMD_loop_bug(float a) {
    for (int k = 0 ; k < 1; k++) {
        for (int i = 0 ; i < 1; i++) {
            
            if(i>1000)break;
            
            for (int j = 0 ; j < 1; j++) {
                if (a<=0.5)return 1.;
            }
        }
        if (true)break;
    }
    return 0.;
}

obvious break part in this code is if (true)break;
and obviously I not use this in any of my code, I have complex rules that generate this true
removing this if (true)break; from this example make it work

but in more complex logic where if (from_some_complex_true_or_false)break; has real usage shader may work without crashes(like in Bug3) but generate some broken result as I show on this screenshot
or like in Bug2 where shader work for first 5 sec then it crash application because complex logic start work after 5 sec timer

make “minimal code” out of my code that will work and become broken during time, look very hard and still need lots of lines of GLSL logic that no one ever will read to understand “too complex bugreport”
I made only minimal examples that all, or full large shaders that display how this bugs executed in large shaders.

also Im absolutely sure on 100% that my logic in my shaders very correct without any “known UB” like not initialized variables or “index out of array size” … or any of API wrong usage like clamp(x,0,-1)… I dont have it
only one “UB that I use” is smoothstep(1.,0.,0.5) that not alllowed, but it does work even in AMD in Vulkan outside this shader, so this is not “break point” in this code.

because bugreporting “complex bugs” become that I need prove that my shaders do not have any “logical errors”
I have no idea how to prove that for 2000+ lines long of GLSL(or C) code

if “minimal 5 lines long code” is not enought, and no one want read and understand something complex, then I do not know how to report bugs.

And my point is that this is not a valid conclusion from the available evidence. Most Vulkan-based game engines have shaders that use “more complicated logic” than your examples here demonstrate. They have lots of loops. They have function calls. They have functions that return in multiple places.

And yet, none of them are having their shaders “ruined” by AMD’s compiler. To my knowledge, there are not a rash of complaints about AMD’s SPIR-V compiler. Therefore, the problem is not about how “complicated” the logic is; the root cause must be something else.

Again, that doesn’t mean that the driver isn’t buggy; your shaders definitely should work. My point is that all evidence points to the problem you have encountered being highly localized, while you have attempted to portray the problem as far more pervasive, as though it’s something that every programmer who uses AMD hardware will encounter.

Your new examples are further evidence that the bug is in the optimizer. Both of your new examples are extremely convoluted ways to do very simplistic algebraic operations. Your main2(int a) function is just a fancy way of writing float(a + 1). Your AMD_loop_bug(float a) is just a fancy way of writing a <= 0.5 ? 1.0f : 0.0f. Writing the simple form of both of those works, so why doesn’t the complicated form?

Furthermore, we know that the compiler works just fine with loops in genera, with conditional branches in general, with break and return statements in general, etc. It can certainly compile all of the pieces of the code as is. So what is stopping the compiler from correctly compiling this specific set of pieces in this order?

When encountering this code, the compiler has two options. It can follow the logic exactly, or it can use optimization mechanisms to detect that there’s a much simpler way to do what the particular logic says. Since we know the compiler is able to make all of the pieces individually work just fine, it stands to reason that the flaw is that it tries to optimize the code and then produces garbage.

If the bug is in the optimizer, then what triggers the bug is the ability of the code to be optimized, and optimized in a very specific way. And this is probably why other people aren’t encountering the problem.

Most conditional logic in shader code is ultimately based on runtime-determined values. If you’re going to do a conditional break out of some loop, then the condition is probably going to be based on the index of the loop and some uniform value. If you’re going to conditionally call a function in a lighting routine, it’s probably going to be based on what material the current object is, which will be defined by a uniform or value fetched from a texture. And so forth.

All of the examples you’ve shown that exhibit the problem use logic that is almost entirely based on constant expressions. All of your loops start and end at a constant value. Almost all of your conditional break/returns are based on conditional branches.

If there’s a bug in a dead-code elimination optimizer, then it will only be found by code that has a lot of dead code in various places. And code that uses lots of constant expression conditions has a lot of dead code.

That’s why I would really like to see what would happen if you ran your SPIR-V through one of those optimizers before handing it off to AMD to compile.

2 Likes

https://www.shadertoy.com/view/3sXyW2 (this shader in Bug3 first post) using this code, change uint bits[32] to push-const
and on AMD youl see that screenshot
if you keep it as "shader constant uint bits[32]" everything will work
in this shader, I think, compiler/optimizer should generate “same” logic for both shaders when uint bits[32] is push const, or when its in shader constant.

debugging this bugs are complicated, and I already spend literally weeks for this, I already did all I can, and all I know, continue debugging it is only using some “low level stuff” that SPIR-V optimizer or trying other SPIR-V compiler… I can do it, I do not want to do it.

original main2() function was in some shader that work with bits, that give bugs on AMD, and I debug it to this debug shader
so you see this is original code of main2() function

float main2(in vec2 fragCoord)
{
    fragCoord=fragCoord/iResolution.xy;
    fragCoord=floor(vec2(256.,3.)*fragCoord)+0.5;
    ivec2 ipx=ivec2(fragCoord);
    ipx.y=int(vec2(256.,3.).y)-(ipx.y+1);
    uint ival=decodeval16(bits[ipx.x/8],(ipx.x/2)%4);
    int map[8];
    for (int i=7; i >= 0; i--, ival /= 2u) {
        map[i]=int(ival % 2u);
    }
    return float(map[1+ipx.y+(ipx.x%2)*4]);
}

almost same as I show before, but it does not crash on “loading shaders” it make UB during work, result is broken result on image

[1+ipx.y+(ipx.x%2)*4] is correct ipx.y is 0 to 2 and (ipx.x%2)*4 is 0 or 4 so result is:
1 is min, to 1+2+4=7 and map[8] size 8, other code also valid(I think, and I check many times)

I can update with more minimal code then previous:

float main2(in vec2 fragCoord)
{
    fragCoord=fragCoord/iResolution.xy;
    fragCoord=floor(vec2(256.,2.)*fragCoord)+0.5;
    ivec2 ipx=ivec2(fragCoord);
    if(ipx.x<0)return 0.; //just to be "sure"
    if(ipx.x<128)
    if((bits[ipx.x%2])==0x77777777u)return 1.;else return 0.; //bug
    else
    if(((ipx.x%2==0)?bits[0]:bits[1])==0x77777777u)return 1.;else return 0.; //no bug
    return 0.;
}

with my launcher, my launcher does have validation errors, so you may not trust it(I say this validation errors should not have any effect on this bug, but I canbe wrong)
download test_AMD_shader_ub_min.zip shaders source in shaders/src

this code display this image on AMD is window on right side of screen, left side is Nvidia:

this code show why I write this:

using something more complicated then single linear function in AMD SPIR-V shader, with single exit(single return for whole logic) result UB or unpredictable result. Having any code after first single return function will result UB.

this is UB that triggered not because conditional logic
you can use array[pixel_id] in any other shader and it will work fine
you can use also [pixel_id%2] in any other logic for indexing array and it will work fine

this is UB, and literally “any shader logic with more complex logic then 1 return without loops” can be ruined/affected by this UB with absolutely unpredictable result, no way to “fix it from code” because I do not see any pattern in this bugs.

for latest code anyone can say "that AMD do not allow to multiply its pixel_ID" this why it does not work

and AMD allowed only using “shader-Const” as index… but this is not true

my own other shaders that do this logic(use pixel_index and change it during code) even in very large code like that CardGame(look Bug1 first post) that work in AMD and other code like this shader do work on AMD without bugs

… I do not know what else I need to write to prove this bug exist, and this is not my shaders or my Vulkan code bug…

I have update my code, my launcher do not have validation errors now(earlier this code has just single error that had no effect, as I told) (this links contain valid code test_AMD_shader_ub_min.zip test_AMD_shader_ub.zip )
so bug still exist

also I have check compute shaders, and this bug exist in SPIR-V Compute shaders on AMD…