Comparison Compute Shader - Other Strategies

bene2808 · February 4, 2019, 7:24am

Hi!

I’m wondering about different approaches for offscreen computations on the GPU. In some papers and articles, strategies like rendering to a 3D texture holding for example float scalars are described. Generally rendering to a texture is often preferred over writing to a shader storage buffer via compute shader. Why is that? Does the first approach have more performance? Or is it just that the compute shader approach is relatively new and thus not mentioned as often? :dejection:

Another question: Can I somehow estimate how expensive it is to switch the pipeline by choosing a new shader program?
For example in Marching Cubes (for the ones who are familiar with it): In my opinion the most logical approach for a chunk of scalar values would be to first determine the intersection points on the edges (one compute shader invocation per edge) and then connect the points with triangles (one invocation per cube). In the best case this would mean switching the pipeline once between the two dispatches. But some implementations instead use only one dispatch already executing the core Marching Cubes, where each cube invocation calculates three edge intersections and then waits for its neighbour cells to calculate their respective intersections, before it continues. Can you tell me, what the general advantages and disadvantages between these two strategies are?

Thanks for your help!

Dark_Photon · February 4, 2019, 7:11pm

[QUOTE=bene2808;1293699]I’m wondering about different approaches for offscreen computations on the GPU.

In some papers and articles, strategies like rendering to a 3D texture holding for example float scalars are described.
Generally rendering to a texture is often preferred over writing to a shader storage buffer via compute shader.
Why is that?[/QUOTE]

Without details on a specific technique, we can only guess.

Best guess: Memory access patterns.

One pro for 2D or 3D textures vs. SSBOs is improved performance: 1) writing to that memory using 2D rasterization and/or 2) reading from that memory using 2D spatially local lookups in a subsequent render pass’s shader(s) (e.g. for 2D texture filtering, or with adjacent fragments in that pass reading from adjacent texels in the texture) should be fairly fast .

Why faster? Texture memory is stored in what’s called tiled memory organization (in Vulkan). Some GPU vendors call this swizzled memory format instead of tiled. Basically, spatially-local texels are stored close to each other in memory to optimize memory read/write bandwidth. As opposed to “linear” memory organization where, for instance, whole rows of texels are stored adjacent in memory, before winding down and storing texels for the next row.

Also, texture memory reads are optimized through texture caches on the GPU, and have been forever. Nowadays, main memory reads (such as from SSBOs) are often cached as well to some degree. Profile on your platform to be sure, but in the absence of some thorough benchmarks, I’d generally expect texture caches for to be more efficient than main memory read caches for 2D spatial lookups.

bene2808 · February 5, 2019, 2:47am

Hi Dark Photon,

thanks for your response! Ok textures seem to be the better alternative in most cases.

Concerning the Marching Cubes algorithm one performance example I am thinking about: As I stated above I’m trying to compute the edge intersections in advance. Generally for those who do not know Marching Cubes: I just try to reach every edge of a uniform 3D grid in a compute shader. One possible strategy would be three subsequent dispatch calls with different uniform parameters, so every edge (along all the three axes x, y, z) is being reached. Another possibility would be dispatching only once, but holding an index array, where each edge is characterized by the indices of the two points connected by it. So can you tell me what would be more useful in this case? Needing only one dispatch or not having to look to an index table before computing one edge?

Thanks again in advance!

GClements · February 5, 2019, 5:00am

There seems to be two basic choices: One pass for edges and a second pass for cubes, or three passes for edges (x, y, z) and a fourth for cubes.

With one pass for edges, you have to consider the situation at the edge of the grid. For a grid with X×Y×Z vertices, you have (X-1)×Y×Z X-aligned edges, X×(Y-1)×Z Y-aligned edges, and X×Y×(Z-1) Z-aligned edges. So you can either use (X-1)×(Y-1)×(Z-1) work groups and ignore the far edges of the cube (if you’re tracing a closed surface with an adequate margin, there should be no intersections there), or use X×Y×Z work groups and allow for the fact that some edges might go outside the grid (although you’ll never actually access those edges, if there are any intersections they’ll occupy space in the vertex arrays unless you explicitly ignore them).

With three passes, you’re reading the source data three times, which could easily outweigh any gains from not needing to deal with the edge case.

Either way, you’d generate the vertex index (via atomic increment) and position each time you find an edge which intersects the surface. I’d store the indices in either a three-component uimage3D or three single-component uimage3Ds. On one hand, the access pattern in the per-cube pass doesn’t seem to favour using a single three-component texture. Each axis needs a different set of 4 values; e.g. the X axis needs (i,j,k), (i,j+1,k), (i,j,k+1) and (i,j+1,k+1). So with a 3-component image you’d end up reading 2×2×2×3= 24 values for each cube and ignoring half of them. But given that you’d need the ignored values for the adjacent cubes, this may be a non-issue with the way that texture access works (this assumes that the local work groups are cubes).

bene2808 · February 5, 2019, 8:01am

Hey GClements,

so you’d rather have one single edge pass, where each invocation calculates three edges and accept the edges outside the chunk, which are computed without being used? Or is there a way to have one pass for the edges and one invocation per edge?:dejection:

Just for understanding everything you said: Do I really have to read the edges of all of the 8 corners? Theoretically I would not have to read the last corner for x = y = z = 1, because it doesn’t contribute to the edges of the cube. This is not nitpicking, but just finding out if I understand correctly what you’re saying :D.
You said one three component uimage3D and three single component uimage3Ds would lead to similar performance. But isn’t one texture with texels including all the information needed at one point of time generally a better idea then using three textures with “distributed” memory?

Another question: I read an article where computing the scalars in a uniform grid was done by rendering to a 3D texture. And rendering means they constructed several plains of two triangles filling the whole volume, so that after rasterization every texel was reached. But how can this be a good strategy? Rasterization and construction of some virtual triangles sounds like a much to complicated way to fill all the texels of a 3D texture…

GClements · February 7, 2019, 1:35am

Well, you could just triple the size in one dimension then use n%3 to determine the axis. But I don’t see any advantage to doing it this way rather than handling the 3 axes in each invocation.

I’m saying that it’s unclear whether it’s worth handling the edge case explicitly. Any checks which go into the shader get executed for every invocation, so the cost of such checks in O(n^3) but the potential saving is only O(n^2).

Maybe, maybe not.

Prior to the introduction of compute shaders, this was the only option.