(1) It implies that all control shader invocations within a given patch do not necessarily have the same program counter, which goes against my intuition. Given the 32 limit on patch size, I assumed that each patch is processed in a single warp.
(2) The only reason you’d need to synchronize threads is if the shader had RW access to a shared memory space. The “patch” qualifier can be applied only to “out” variable, not to temporaries. Wouldn’t it be useful to apply “patch” to temporaries?
(3) What’s the equivalent to barrier() in D3D hull shaders?
(4) When I tried to write a highly efficient control shader that makes use of patch-level shared memory (i.e., patch out) and barrier(), I ran into driver issues with both major vendors. Has anyone out there had any better luck than me?
So sounds alot like what you said and what I’d expect. Similar to barrier( CLK_LOCAL_MEM_FENCE ) in OpenCL, __syncthreads() in CUDA, and memoryBarrier() in ARB_shader_image_load_store/OpenGL 4.2.
When you have some threads potentially operating on the data previously read/written by other threads, you need a way to ensure the data is in place (a sync point) before you let the threads continue and start using the data.
With AMD and NVIDIA, writing to a “patch out” variable from invocation A and reading back from invocation B does not work, even with an intermediary barrier(). The shader compiles without errors, but I see garbled rendering.
Their compilers are probably (incorrectly) allocating intermediary temporaries into a per-thread register space.