Compute: trouble with multiple workgroups and interleaved data

Short version: when multiple groups are accessing data in an interleaved fashion, output is inconsistent with my expectation on Nvidia 20 series cards, but not on the 10.

The very simple code below is a contrived example that doesn’t do anything interesting or useful. It serves only to isolate the issue and reproduce the error encountered in the far more intricate original. All it does is calculate a partial prefix sum on the data given, for each respective group. Dispatch just 2 groups. Feed them a buffer with 520 uints, all set to 1. That way the anticipated output is readily recognized.

On 10 series cards it works as I anticipate it should. On the 20 series; it fails. Output from whichever group first reaches the sequencing point, is not visible at the point of capture. It is either never written, or more likely; overwritten by the second group. I do not have access to devices from later generations, nor from AMD. Given that the later Nvidia architectures have ‘relaxed’ subgroup coherence from the earlier, I anticipated that this was simply a synchronization issue; but I have not been able to coerce the 20s to perform as expected. Not even with the heavy hammer of full workgroup barrier()s.

#version 460

#extension GL_ARB_gpu_shader_int64 : require
#extension GL_EXT_buffer_reference2 : require
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_scalar_block_layout : require
#extension GL_KHR_shader_subgroup_ballot: require
#extension GL_KHR_shader_subgroup_arithmetic: require


layout( buffer_reference, std430 ) buffer BufferPtr {
	uint64_t data;
	uint cntr;
};

layout( buffer_reference, buffer_reference_align = 4, std430 ) buffer uintPtr {
	uint u;
};

layout( buffer_reference, std430 ) buffer uvec4Ptr {
	uvec4 v;
};


layout( binding = 0, std430 ) uniform Command {
	uint64_t buffer;
} Cmd;

layout( binding = 1, std430 ) buffer Feedback {
	uvec4 u[256];
} feedback;

const uint Gsz = 32;
layout( local_size_x = Gsz ) in;

shared uintPtr data;
shared uint done;

void main() {

	const uint id = gl_SubgroupInvocationID,
			  grp = gl_WorkGroupID.x;

	if( subgroupElect() ) {
		BufferPtr buf = BufferPtr( Cmd.buffer );
		data = uintPtr( buf.data );
	}
	subgroupMemoryBarrierShared();

	// uint i = id + grp*Gsz;		// no interleave: fine
	uint i = id * 2 + grp;			// single element interleave: no good
	// uint i = id * 8 + grp * 4;	// fourth element (16 byte step) interleave: still bad
	// uint i = id * 16 + grp * 8;	// eighth element (32 byte step) interleave: fine
	uint avant = data[i].u;
	uint apres = subgroupExclusiveAdd( avant );
	data[i].u = apres;

	if( id == 0 ) {
		// cntr is zero when we are dispatched
		done = atomicAdd( BufferPtr( uint64_t(data) ).cntr, 1 );
	}
	subgroupMemoryBarrierShared();

	if( done == 0 ) return;

	// Capture the full width of the 32 byte stride version, covering 512 integers
	uvec4Ptr v = uvec4Ptr( uint64_t(data) );
	for( uint i = 0; i < 4; ++i )
		feedback.u[id*4+i] = v[id*4+i].v;
}

I have strived to extract this intact from the framework I run it in, but apologize for omissions and redundancies.

When the access is changed so that each group works on one contiguous set of data, it works fine. When the interleave stride is extended to 32 bytes, it also works. 16 bytes are inadequate. I anticipate this is a caching related phenomenon. Can this really be the expected behaviour? Is it whatsoever possible to implement finer grained interaction on these (and later?) devices?

It is entirely possible that there is something elementary I am missing, or some vital information I have not been exposed to; that is why I am asking. It is difficult to imagine that no-one else is doing what I need to, so I am hopeful that someone can shed some light.

I appreciate your interest, and your patience.