How does shaders' threads execute on GPU?

WatchAndLearn · June 27, 2011, 8:18pm

Hello everyone:
I want to know how vertex or fragment shader threads execute in parallel on GPU. I know there’s a concept called “warp” in CUDA, and 32 GPU threads will be executed as a warp at one time. Is there a concept just like “warp” in GLSL? or other else. If I render a scene using fragment shader in resolution 800800, the GPU will spawn 800800 threads to execute once or spawn fewer threads and looping execute several times? Is there any useful reference about GLSL on GPU thread model? thanks:)

Alfonse_Reinheart · June 27, 2011, 9:52pm

Everything you’ve asked is all implementation and hardware dependent. GLSL does not define how hardware actually does its job; it only defines what that job is.

You would be better off looking at the available documentation for a particular piece of hardware. But even that doesn’t tell you specifically how each IHV uses that hardware to do rendering.

I’m curious as to what you expect to discover by learning these details.

In general, you can expect the following:

The dispatcher will attempt to group concurrent fragment shader executions into groups of sample areas with a regular grouping. For example, a lot of hardware will use 2x2 “pixel-quads” (a misnomer as they aren’t pixels but samples), even if part of this 2x2 quad will be outside of the triangle’s area and therefore its output will be discarded. So long, thin triangles or single-pixel triangles will be rasterized inefficiently.

The commitment (ie: storage) of all data from triangle rasterization will be in the order that the triangles were submitted in.

For DX10-class hardware, the dispatcher will divide process resources between the different shader units in various ways, so fragment-shader heavy sections will dedicate more resources to the fragment shader. If you do something like deferred rendering, you can expect the deferred pass to be almost 100% fragment shaders.

WatchAndLearn · June 28, 2011, 1:12am

Alfonse_Reinheart:

Everything you’ve asked is all implementation and hardware dependent. GLSL does not define how hardware actually does its job; it only defines what that job is.

You would be better off looking at the available documentation for a particular piece of hardware. But even that doesn’t tell you specifically how each IHV uses that hardware to do rendering.

I’m curious as to what you expect to discover by learning these details.

In general, you can expect the following:

The dispatcher will attempt to group concurrent fragment shader executions into groups of sample areas with a regular grouping. For example, a lot of hardware will use 2x2 “pixel-quads” (a misnomer as they aren’t pixels but samples), even if part of this 2x2 quad will be outside of the triangle’s area and therefore its output will be discarded. So long, thin triangles or single-pixel triangles will be rasterized inefficiently.

The commitment (ie: storage) of all data from triangle rasterization will be in the order that the triangles were submitted in.

For DX10-class hardware, the dispatcher will divide process resources between the different shader units in various ways, so fragment-shader heavy sections will dedicate more resources to the fragment shader. If you do something like deferred rendering, you can expect the deferred pass to be almost 100% fragment shaders.

Well, thanks for your help. now I’m working on multi-GPU task partition, so I just want to make sure whether it is worthy to divide one fragment shader task to two or more parts. For example, divide a 800800 resolution screen to four parts, and render each 400400 subscreen serially. I just want know wheather the time serially rendering four subscreens is approximately equal to rendering full screen. So, if GPU groups “2*2 pixel-quad” threads and concurrentlly execute, I think the two times will almost equal.

mbentrup · June 28, 2011, 7:06am

GLSL shaders execute on the same Hardware as CUDA/OpenCL programs, so the same restrictions apply, i.e. on DX10+ NVidia cards the dispatcher works with groups of 32 threads and on AMD cards with groups of 64 threads.

If you split the screen in 4 parts, you will produce (roughly) the same number of fragment shader invocations and raster ops, but vertex and geometry shaders, clipping and culling will have to run four times.

If fragment ops are your bottleneck you probably won’t lose much, but if you split your screen into more and more subscreens, at some point the vertex ops will become the bottleneck.

Bruce_Wheaton · June 28, 2011, 11:08am

Probably not - it sounds like you’re going to short-circuit the GPU’s mechanisms to do the same thing. You will lose efficiencies in caching and commonality. You will have the overhead of a single pass in each pass, at the very least.

So, not 4 times longer than doing 800x800, but not the same time, by any means.

Bruce

system · October 19, 2021, 7:22pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.