NVIDIA releases OpenGL 4.3 beta drivers

Please send Piers or me a private message so we can take care of this and file a bug.

(note beside, our developer forums were recently compromised by a third party which might explain why your account is not working anymore)

[QUOTE=Brandon J. Van Every;1241294]I installed NVIDIA beta driver 304.32 on Lubuntu 12.04 Linux using the xorg-edgers PPA so that I wouldn’t have to fool with .run scripts and turning off X servers. On my laptop I have an old GeForce 8600M GT card, which is OpenGL 3.3 class HW. The NVIDIA OpenGL driver page says, “For OpenGL 3 capable hardware, these new extensions are provided:” and gives a list of 18 extensions. Unfortunately none of them are shown as available by either glxinfo or nvidia-settings.

Anyone have OpenGL 3 class HW that’s showing the new extensions, on either Windows or Linux?[/QUOTE]

I think if you installed a driver with version 304.32 you probably didn’t use the OpenGL 4.3 beta driver, which is version 304.15.00.02 for Linux. See here for the driver location:
http://www.nvidia.com/content/devzone/opengl-driver-4.3.html

Found another small one. I’m not sure when this crept in, but I was previously using 295.49 without a hitch. I have a vertex shader with the following outputs:

flat out ivec4 pickID;
     out float pickZ;

When I call glGetProgramiv( pid, GL_TRANSFORM_FEEDBACK_VARYING_MAX_LENGTH, &max_len) on the parent program of the vertex shader (no other shader stages), it sets max_len = 6. According to the GL spec it should return the length of the largest string including the null terminator, in this case 7. When I then pass max_len to glGetTransformFeedbackVarying() as the bufSize parameter, ‘pickID’ is cut off to ‘pickI’ (which is expected if bufSize==6) and this messes up further rendering.

In the meantime I’ve accounted for the null-terminator by adding one to max_size. I can afford the extra byte :slight_smile:

[QUOTE=malexander;1241428]
When I call glGetProgramiv( pid, GL_TRANSFORM_FEEDBACK_VARYING_MAX_LENGTH, &max_len) on the parent program of the vertex shader (no other shader stages), it sets max_len = 6. According to the GL spec it should return the length of the largest string including the null terminator, in this case 7. [/QUOTE]

I’ve confirmed that this is a recent regression, and might only be in the OpenGL 4.3 beta driver. I’ll try to get this into the next beta driver.

You’re also welcome to send me bug reports. My NVIDIA email ID is the same as my user ID on these forums.

pbrown:

I have an issue with ComputeShader, Buffer and the usage hint of Shader Buffer:

I’m pretty sure I’m doing ugly stuff, but they behavior of the code changes totally depending on the usage hint of the buffer.

Should i send you an email?

[QUOTE=Piers Daniell;1241407]I think if you installed a driver with version 304.32 you probably didn’t use the OpenGL 4.3 beta driver, which is version 304.15.00.02 for Linux. See here for the driver location:

When I installed the exact driver version 304.15.00.02, I did get the 18 new extensions for my 3.x class HW. Unfortunately it also sent me up a driver installation and package update learning curve that destabilized my system. Tried to use Duplicity to restore it and it totally failed, although I didn’t lose any personal data. I don’t recommend the xorg-edgers stuff. They don’t actually have the beta driver, and all their extra package junk is pretty much what destabilized my system. Hopefully I’ll have better luck working from mainstream Lubuntu 12.04. Got a fresh system now. Once I’ve found a more reliable system backup program to use, I’ll try the NVIDIA .run install again, and then hopefully have a viable development system.

[QUOTE=guibou;1241454]pbrown:

I have an issue with ComputeShader, Buffer and the usage hint of Shader Buffer:

Should i send you an email?[/QUOTE]

Someone already spotted that, and Piers root-caused and fixed the bug. Thanks for the helpful report and reproducer. If you have more issues that you suspect are likely driver things, feel free to shoot me an email.

Pat

EDIT: Something I thought was actually related to this driver was actually present in an earlier driver, making a seperate post

First is Nvidia monitoring OpenGL 4.3.0.0 sample pack by g-truc and aware of three samples not working…

I’m concentrating on two I’m more interested first in 430-image_store sample I get I think and error in assembly
line 22, column 5: error: supported only on load, store, and atomic instructions
which is:

IMQ.COH R0.xy, images[R0.x], 2D;

this is related to new function

ivec2 Size = imageSize(Diffuse);

in shader which I fix removing coherent in image definition

layout(binding = 0, rgba8) coherent uniform image2D Diffuse; //dont work

to this:

layout(binding = 0, rgba8) uniform image2D Diffuse; //it works!!

Full details (saying no where it fails) and patch applied:

#version 420 core
#extension GL_ARB_shader_image_size : require

#define FRAG_COLOR        0
#define DIFFUSE            0

in vec4 gl_FragCoord;
//layout(binding = 0, rgba8) coherent uniform image2D Diffuse; //dont work
layout(binding = 0, rgba8) uniform image2D Diffuse; //it works!!


layout(location = FRAG_COLOR, index = 0) out vec4 Color;

const int Border = 8;

void main()
{
    ivec2 Size = imageSize(Diffuse);

    if(gl_FragCoord.x < Border)
        Color = vec4(1.0, 0.0, 0.0, 1.0);
    if(gl_FragCoord.x > Size.x - Border)
        Color = vec4(0.0, 1.0, 0.0, 1.0);
    if(gl_FragCoord.y < Border)
        Color = vec4(1.0, 1.0, 0.0, 1.0);
    if(gl_FragCoord.y > Size.y - Border)
        Color = vec4(0.0, 0.0, 1.0, 1.0);
    else
        Color = imageLoad(Diffuse, ivec2(gl_FragCoord.xy));
}

Fragment info

Internal error: assembly compile error for fragment shader at offset 611:
– error message –
line 22, column 5: error: supported only on load, store, and atomic instructions
– internal assembly text –

!!NVfp5.0
OPTION ARB_shader_image_size;
OPTION NV_shader_atomic_float;
# cgc version 3.1.0001, build date Aug  8 2012
# command line args: 
#vendor NVIDIA Corporation
#version 3.1.0.1
#profile gp5fp
#program main
#semantic Diffuse : IMAGE[0]
#var float4 Color : $vout.COL0 : COL0[0] : -1 : 1
#var int Diffuse.__remap :  : c[0] : -1 : 1
#var float4 gl_FragCoord : $vin.WPOS : WPOS : -1 : 1
PARAM c[1] = { program.local[0] };
TEMP R0;
TEMP RC, HC;
IMAGE images[] = { image[0..7] };
OUTPUT result_color0 = result.color;
MOV.S R0.x, c[0];
SLT.F R0.z, fragment.position.x, {8, 0, 0, 0}.x;
TRUNC.U.CC HC.x, R0.z;
IMQ.COH R0.xy, images[R0.x], 2D;
IF    NE.x;
MOV.F result_color0, {1, 0, 0, 0}.xyyx;
ENDIF;
ADD.S R0.x, R0, -{8, 0, 0, 0};
I2F.S R0.x, R0;
SGT.F R0.x, fragment.position, R0;
TRUNC.U.CC HC.x, R0;
IF    NE.x;
MOV.F result_color0, {0, 1, 0, 0}.xyxy;
ENDIF;
SLT.F R0.x, fragment.position.y, {8, 0, 0, 0};
TRUNC.U.CC HC.x, R0;
IF    NE.x;
MOV.F result_color0, {1, 0, 0, 0}.xxyx;
ENDIF;
ADD.S R0.x, R0.y, -{8, 0, 0, 0};
I2F.S R0.x, R0;
SGT.F R0.x, fragment.position.y, R0;
TRUNC.U.CC HC.x, R0;
IF    NE.x;
MOV.F result_color0, {0, 1, 0, 0}.xxyy;
ELSE;
TRUNC.S R0.xy, fragment.position;
MOV.S R0.z, c[0].x;
LOADIM.U32.COH R0.x, R0, images[R0.z], 2D;
UP4UB.F result_color0, R0.x;
ENDIF;
END
# 31 instructions, 1 R-regs

Also two questions before second bug report…
1.is multi_draw_indirect fully supported in current Nvidia HW (and Fermi) without CPU support in driver or is implemented some part in software I say that because this originated from AMD extension so seems AMD really supports in HW…
2.seems fbo_no_atachments extensions is equivalent to Dx11.1 TIR target independant rasterization if so how it’s possible implementable in Nvidia first gen d3d11 HW (fermi) if I read on some sites that TIR was just one thing current D3D 11 HW doesn’t supported… what are the differences then?
Second sample is in 430-compute shader sample seems I get fixed by removing readonly and writeonly qualifiers from storage buffer and then fixing a bug by g-truc
ie changing

Out.Ouput[gl_GlobalInvocationID].Texcoord = In.Input[gl_GlobalInvocationID].Texcoord;

by

Out.Ouput[gl_GlobalInvocationID.x].Texcoord = In.Input[gl_GlobalInvocationID.x].Texcoord;

so seems
this should work but doesn’t work

layout(binding = BUFFER_INPUT) readonly buffer iBuffer
{
    vertex Input[];
} In;

layout(binding = BUFFER_OUTPUT) writeonly buffer oBuffer
{
    vertex Ouput[];
} Out;

i get:
Compute info

0(22) : error C0000: syntax error, unexpected identifier, expecting "::" at token "iBuffer"
0(27) : error C0000: syntax error, unexpected identifier, expecting "::" at token "oBuffer"
 and fixing it by ex:

//layout(binding = BUFFER_INPUT) readonly buffer iBuffer
layout(binding = BUFFER_INPUT)  buffer iBuffer //works!

At least for now with this fix it works!

One question is compute shader stage being exposed in Nvidia Cg language (where to ask?) similar to tesselation stages where added in Cg 3.0… also if not, really another question is if Nvidia Cg compiler cgc.exe will at least get support for that so we can compile GLSL compute shaders to NV_compute_shader assembly similar to which I get from driver when assembly errors occur (see below)…
So this is really the kind of bug in compute shader that I’m afraid Nvidia will not fix for security/hanging kind of thing:
I was happy to see I could port Nvidia matmul CUDA sample to OGL compute shader without much effort basically doing two things:
*Adding defines:

#define blockIdx gl_WorkGroupID
#define __syncthreads() barrier()
#define threadIdx gl_GlobalInvocationID
#define __shared__ shared

and changing all int vars in CUDA kernel to uint vars (another solution would have been to cast gl_WorkGroupID and gl_GlobalInvocationID to int)

2.and another one was to put out shared mem definitions out of kernel to be in global scope

__shared__ REAL As[BLOCK_SIZE][BLOCK_SIZE];

Then I was happy code to be able to be free of errors but a shame is in compilation:

Internal error: assembly compile error for compute shader at offset 108543:
-- error message --
line 2094, column 6:  error: BAR not allowed inside flow control blocks.
line 2108, column 6:  error: BAR not allowed inside flow control blocks.

First let’s be clear this code works in CUDA and recommended by Nvidia for it’s GPUs… second it’s using a fine detail in GPUs in that barriers can be in control flow assuming all warps in block execute follow same path in control flow…
why not in OGL compute shaders?..
also see the “flow control” is cause by this loop:

for (uint a = aBegin, b = bBegin;
         a <= aEnd;
         a += aStep, b += bStep)

not a branch and even more is executed same iterations for every thread but it’s not trivial in the form stated to infer by compiler so I thought hey fix it i changed to (in my sample matrix w=h=1024 and block size=32 so iters=1024/32):

uint a = aBegin, b = bBegin;
    for (int i=0; i<32; i++)
    {
..loop body containing barriers

a += aStep, b += bStep
}

even with this I get same error so unique way to fix is to unroll loop manually which is ugly at least… I think even I added #pragma unroll to loop to force programatically to expand the compiler for me but not working … there is a Nvidia way to force unroll like in CUDA with #pragma unroll
Full shader:

#version 430 core
#extension GL_ARB_compute_shader : require
#extension GL_ARB_shader_storage_buffer_object : require
#extension GL_ARB_gpu_shader_fp64: require

#define REAL float
//#define REAL double


#define BUFFER_A    0
#define BUFFER_B    1
#define BUFFER_C    2

#define BLOCK_SIZE 32


layout (local_size_x = BLOCK_SIZE, local_size_y = BLOCK_SIZE) in;

layout(binding = BUFFER_A) buffer iBuffer
{
    REAL A[];
} ;

layout(binding = BUFFER_B) buffer oBuffer
{
    REAL B[];
} ;

layout(binding = BUFFER_C) buffer oBuffer
{
    REAL C[];
} ;


//matrixMul(float *C, float *A, float *B, int wA, int wB)
#define blockIdx gl_WorkGroupID
#define __syncthreads() barrier()
//memoryBarrierShared();
#define threadIdx gl_GlobalInvocationID
#define __shared__ shared

#define wA 1024
#define wB 1024
#define AS(i, j) As[i][j]
#define BS(i, j) Bs[i][j]

        // Declaration of the shared memory array As used to
        // store the sub-matrix of A
        __shared__ REAL As[BLOCK_SIZE][BLOCK_SIZE];

        // Declaration of the shared memory array Bs used to
        // store the sub-matrix of B
        __shared__ REAL Bs[BLOCK_SIZE][BLOCK_SIZE];

void main()
{    

    // Block index
    uint bx = blockIdx.x;
    uint by = blockIdx.y;

    // Thread index
    uint tx = threadIdx.x;
    uint ty = threadIdx.y;

    // Index of the first sub-matrix of A processed by the block
    uint aBegin = wA * BLOCK_SIZE * by;

    // Index of the last sub-matrix of A processed by the block
    uint aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the sub-matrices of A
    uint aStep  = BLOCK_SIZE;

    // Index of the first sub-matrix of B processed by the block
    uint bBegin = BLOCK_SIZE * bx;

    // Step size used to iterate through the sub-matrices of B
    uint bStep  = BLOCK_SIZE * wB;

    // Csub is used to store the element of the block sub-matrix
    // that is computed by the thread
    REAL Csub = 0;

    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    /*for (uint a = aBegin, b = bBegin;
         a <= aEnd;
         a += aStep, b += bStep)*/
    uint a = aBegin, b = bBegin;
    for (int i=0; i<32; i++)
    {


        // Load the matrices from device memory
        // to shared memory; each thread loads
        // one element of each matrix
        AS(ty, tx) = A[a + wA * ty + tx];
        BS(ty, tx) = B[b + wB * ty + tx];

        // Synchronize to make sure the matrices are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
#pragma unroll
        for (int k = 0; k < BLOCK_SIZE; ++k)
            Csub += AS(ty, k) * BS(k, tx);

        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    
        a += aStep, b += bStep;
    }

    // Write the block sub-matrix to device memory;
    // each thread writes one element
    uint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
    C[c + wB * ty + tx] = Csub;
}

Full assembly:

Compute info
------------
Internal error: assembly compile error for compute shader at offset 108543:
-- error message --
line 2094, column 6:  error: BAR not allowed inside flow control blocks.
line 2108, column 6:  error: BAR not allowed inside flow control blocks.
-- internal assembly text --
!!NVcp5.0
OPTION NV_shader_storage_buffer;
OPTION NV_shader_atomic_float;
GROUP_SIZE 32 32;
# cgc version 3.1.0001, build date Aug  8 2012
# command line args: 
#vendor NVIDIA Corporation
#version 3.1.0.1
#profile gp5cp
#program main
#semantic As : SHARED
#semantic Bs : SHARED
#semantic oBuffer : SBO_BUFFER[2]
#semantic iBuffer : SBO_BUFFER[0]
#semantic oBuffer : SBO_BUFFER[1]
#var float Bs[0][0] :  : shared_mem[8188] : -1 : 1
# lots more
#var float Bs[31][31] :  : shared_mem[8188] : -1 : 1
#var uint3 gl_GlobalInvocationID : $vin.GBLID : GBLID[4] : -1 : 1
#var uint3 gl_WorkGroupID : $vin.CTAID : CTAID[2] : -1 : 1
#var float C[0] :  : sbo_buffer[2][0] : -1 : 1
#var float A[0] :  : sbo_buffer[0][0] : -1 : 1
#var float B[0] :  : sbo_buffer[1][0] : -1 : 1
SHARED_MEMORY 8192;
SHARED shared_mem[] = { program.sharedmem };
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0, R1;
MOV.F R0.y, {0, 0, 0, 0}.x;
MUL.U R0.z, invocation.groupid.y, {32768, 0, 0, 0}.x;
MUL.U R0.w, invocation.groupid.x, {32, 0, 0, 0}.x;
REP.S {32, 0, 0, 0};
MAD.U R1.x, invocation.globalid.y, {1024, 0, 0, 0}, R0.w;
MAD.U R0.x, invocation.globalid.y, {1024, 0, 0, 0}, R0.z;
ADD.U R0.x, invocation.globalid, R0;
MUL.S R1.y, R0.x, {4, 0, 0, 0}.x;
ADD.U R1.x, invocation.globalid, R1;
MUL.S R0.x, invocation.globalid.y, {128, 0, 0, 0};
MOV.U R1.z, R1.y;
MAD.S R1.y, invocation.globalid.x, {4, 0, 0, 0}.x, R0.x;
LDB.F32 R0.x, sbo_buf0[R1.z];
MOV.U R1.y, R1;
STS.F32 R0, shared_mem[R1.y];
MUL.S R1.x, R1, {4, 0, 0, 0};
MOV.U R0.x, R1;
LDB.F32 R0.x, sbo_buf1[R0.x];
STS.F32 R0, shared_mem[R1.y + 4096];
BAR  ;
MOV.S R1.y, {0, 0, 0, 0}.x;
REP.S {32, 0, 0, 0};
MUL.S R1.x, R1.y, {128, 0, 0, 0};
MAD.S R1.x, invocation.globalid, {4, 0, 0, 0}, R1;
MUL.S R0.x, invocation.globalid.y, {128, 0, 0, 0};
MAD.S R0.x, R1.y, {4, 0, 0, 0}, R0;
MOV.U R1.z, R1.x;
MOV.U R1.x, R0;
LDS.F32 R0.x, shared_mem[R1.z + 4096];
LDS.F32 R1.x, shared_mem[R1.x];
MAD.F R0.y, R1.x, R0.x, R0;
ADD.S R1.y, R1, {1, 0, 0, 0}.x;
ENDREP;
BAR  ;
ADD.U R0.z, R0, {32, 0, 0, 0}.x;
ADD.U R0.w, R0, {32768, 0, 0, 0}.x;
ENDREP;
MUL.U R0.x, invocation.groupid, {32, 0, 0, 0};
MAD.U R0.x, invocation.groupid.y, {32768, 0, 0, 0}, R0;
MAD.U R0.x, invocation.globalid.y, {1024, 0, 0, 0}, R0;
ADD.U R0.x, R0, invocation.globalid;
MUL.S R0.x, R0, {4, 0, 0, 0};
MOV.U R0.x, R0;
STB.F32 R0.y, sbo_buf2[R0.x];
END
# 44 instructions, 2 R-regs

From post below I want to ask some questions/suggestions:
*Would be good similar to Nvidia exposing NV_CG_shader extension for allowing “some” Cg functions (and syntax?) in GLSL code if exposed some call NV_CUDA_shader extension exposing to relaxation things in compute shaders to allow easier porting of CUDA kernels to GLSL (I’m smoking crack?):
first defines for all CUDA grid, threadid syntax and the likes having GLSL equivalents like:
#define blockIdx gl_WorkGroupID
#define __syncthreads() barrier()
#define threadIdx gl_GlobalInvocationID
#define shared shared
and defining like int and not uint like intrinsics in GLSL for mentions mentioned in previous post for portability questions
and second allow shared definitions inside compute shader functions which compiler will put out internally if needed…
with that these matrix mul code could be ported immediately!..
*Specification doesn’t inform about any restriction on usage of barriers inside control flow or restrictions in that case similar to CUDA for threads to follow same path reaching barrier… it’s this going to be fixed in spec or is IHV dependant…
if some restrictions apply in spec Nvidia could allow similar relaxation as done in CUDA by an extensions GL_NV_barriers_in_control_flow or similar to inform of that support…
*Also sorry but what is Directcompute restrictions in that case? there are any?
*Finally allow to programatically exposing work group size like by requiring local_size_x variables not to be constant like (new extension like GL_NV_uniform_local_size)

uniform int local_size_x_from_app;
layout (local_size_x = local_size_x_from_app) in;

or better new function
DispatchComputeSetLocalSize(globx,globy,globz,locx,locy,locz)
equivalent of CUDA support
kernelname<<globalsize,localsize>>(…)
I know that DirectCompute doesn’t support exposing local group size but CUDA yes…

Please answer questions!!

In the future, please use [ code ] blocks to format pieces of code in your posts.

I have fixed all using

.. thanks alfonse..

Another bad news altough this would be Nvidia only anyway for now: pointers on compute shaders not working on Nvidia..

I have tried to answer another question towards easy portability of GPGPU codes in GLSL compute shader:
using pointers in compute programs! of course using GL_NV_shader_buffer_load..
if I not enable parser correctly points:

0(47) : error C7531: pointers require “#extension GL_NV_shader_buffer_load : enable” before use


but if I enable I get:

0(13) : warning C7547: extension GL_NV_shader_buffer_load not supported in profile gp5cp


but GL_NV_shader_buffer_load is not supported in every stage? why not in compute?!:mad:
so Nvidia this is a limitation of implementation for now or forever (well more or less :)).. anyway this is supported on CUDA so surely supported in GPU HW for ages..

Hi oscar,
ran into the barrier problem during development as well, currently it seems like an oversight in the spec itself, if you look into 4.3 GLSL :

The barrier() function may only be placed inside the function main() of the tessellation control shader
and may not be called within any control flow. Barriers are also disallowed after a return statement in the
function main(). Any such misplaced barriers result in a compile-time error.

This hasn’t been specced in detail for compute yet. Ideally spec and implementation can be relaxed accordingly!

In general things are in an early stage yet, it will take a couple revisions until the implementation has all the features it should have.

Regarding your twitter on NV_shader_buffer_load, a quick compile test using

#extension GL_NV_gpu_shader5 : enable

did not give any errors when declaring some pointers. Although use of pointers on shared memory currently won’t be possible

Hi oscarbg

Thanks for the bug reports. We have looked at the g-truc OpenGL 4.3 samples and have a few driver fixes pending. There are also a few bugs in the OpenGL 4.3 samples, which I have reported to g-truc.

  1. The issue with calling imageSize() on an image declared with “coherent” has been fixed in the driver. The beta will be updated soon with this fix. Until then you can workaround the issue by removing “coherent” from the declaration.

  2. Yes, multi_draw_indirect is fully hardware accelerated on Fermi and above GPUs.

  3. fbo_no_attachments just allows an FBO to be “complete” without any attachments. I don’t think it’s equivalent to DX11 TIR.

  4. The “readonly” and “writeonly” memory qualifiers on buffer declarations is broken in the driver. We’re working on a fix for this. For now just remove these qualifiers.

  5. Some of your later questions might be better served in the “Official feedback on OpenGL 4.3 thread”.

Thanks for your feedback.

cheers
Piers

Sorry I’m behind on looking at these forum updates.

The error with #extension GL_NV_shader_buffer_load is a compiler bug that I was able to reproduce. The fix was an oversight in the logic handling “what #extension” directives are supported in which shader types/architectures. I’ll work on a fix for the next beta driver update.

As Christoph Kubisch mentions, one workaround for this is to enable the extension GL_NV_gpu_shader5, which also includes the ability to store to pointers. GL_NV_shader_buffer_load only allows pointer reads.

Many thanks for responses Christoph and Pat!
some updates on my findings:
First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it’s here the place to motivate Nvidia to relax this restriction…
Reasons to allow it are big first and foremost:
1.D3D11 compute shader allows it!.. sorry to be pedantic but OGL ARB can’t claim OGL 4.3 is a superset of D3D11…
for example I have found in D3D SDK ComputeShaderSort11 sample in kernel BitonicSort
this code

for (unsigned int j = g_iLevel >> 1 ; j > 0 ; j >>= 1)
    {
        unsigned int result = ((shared_data[GI & ~j] <= shared_data[GI | j]) == (bool)(g_iLevelMask & DTid.x))? shared_data[GI ^ j] : shared_data[GI];
        GroupMemoryBarrierWithGroupSync();
        shared_data[GI] = result;
        GroupMemoryBarrierWithGroupSync();
    }

well this code can’t be without control flow which I verified getting compiled DX IL assembly by fxc which has barriers inside “loop” in DX IL parlance keyworks… even more I compiled down to AMD 5850 ISA using AMD GPU shader analyzer and yes barriers is there between control flow…
2. Even in D3D 11 don’t allowed it, lots of basic GPGPU algorithms are fully optimized in this conditions like sorts, perhaps even scans and many more I think searching CUDA code over web will grasp the importance of this support in CUDA…
3. I have found that AMD is going to allow it! well details follow…
by using some leaked AMD drivers (9.00) and headers about it I found about AMD_compute_shader in fglrx drivers and seeing inside driver binary searching which keywords are implemented from AMD_compute_shader and which ones from ARB_compute_shader I have compiled successfully a compute shader having barrier inside control flow impossible to remove (the control flow) from the compiler… I can post full code and binary to check if you dont’ believe…

*Allow selective unroll of loops similar to D3D [unroll] (i.e. better control over #pragma optionNV(unroll all))
Motivation:
Another question is well matmul code after all has the loop with fixed iters (assuming tuned for a specific matrix dimensions) so we can unroll I have found using #pragma optionNV(unroll all) solves the problem and compiles the kernel BUT produces big binary so question there is a pragma in NV compiler to allow unrolling of selected loops similar to D3D [unroll]… hey that’s another place D3D11 CS are better than OpenGL… with that support I could unroll outer loop of matmul code posted before and the loop between barriers don’t be unrolled…

I have found using GL_NV_gpu_shader5 works perfectly… and allows using without less massage more cuda codes… I found happily that using OpenCL codes (CUDA code also use it) that pass shared mem pointers as function args compile correctly up to assembly compile that founds
something like a LDS.? line and returns with error…
So I have hope with your comment “Although use of pointers on shared memory currently won’t be possible” that this will be possible in the future… Please allow that use case!

First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it’s here the place to motivate Nvidia to relax this restriction

Presumably by releasing an extension, right? Because we don’t want anyone, AMD, NVIDIA, Intel, etc just ignoring what the spec says because that’s not what it is that they want. Like your AMD_compute_shader example is a separate specification, thus allowing AMD to expose whatever functionality they choose.