Shader Model 5 suggestions

If you’re using a GPU for graphical rendering, a set shader execution length is very important. The absolute last thing a rendering user wants to hear is, “Sorry, I can’t render anything; your shader may be in an infinite loop.”
how is this anything diff than what a cpu does (+ whats length gotta do with it) ie its up to the programmer to katter for the infinite loop possiblities, ie u can hang a cpu app the same way.
a recent (~4 month ago) app for mine on the psp failed cause it earlied out when it shouldnt off (ie how can u know if its stuck in an infinite loop or just one that takes a minute cause its itterating over a table of data)

Ray tracing, while easily parallelizable, is ultimately very branchy code. Is there something in this voxel? Yes, then loop over the contents and do ray-surface intersection. Etc. That’s not suitable for in-order chips, which will make them much slower than a regular core.
change the algorithm

Originally posted by zed:
how is this anything diff than what a cpu does (+ whats length gotta do with it) ie its up to the programmer to katter for the infinite loop possiblities, ie u can hang a cpu app the same way.
a recent (~4 month ago) app for mine on the psp failed cause it earlied out when it shouldnt off (ie how can u know if its stuck in an infinite loop or just one that takes a minute cause its itterating over a table of data)

I suspect that is because the GPU does more things than to execute shader code ( for example, clipping, triangle setup, framebuffer memory refresh, vsync, etc ). Shaders just represents like the 70% of the GPU time… but in a CPU is like the 95%… But both can hang, that’s for sure… but the CPU can be at 99%(1% to manage mouse pointer and emit strange beeps :stuck_out_tongue: )… if the graphics card goes 99% then you can start to see a black screen which is not very elegant… and you won’t see even a BSoD because won’t have enough resources free to display it!

Well, thats a theory… in practice they just limit it to force us to buy the next card with new features :stuck_out_tongue: ( have you seen a film called Tucker?, a man who wanted to make “the perfect car” )

Ray tracing, while easily parallelizable, is ultimately very branchy code…
[/QUOTE]change the algorithm [/QUOTE]
I think with a streaming monster like the GF8800 Ultra you could bypass the uniform grid DDA and just test ray vs all triangles with a simple kernel, that will save tons of branching ( but not completely because the ray-triangle-hit test requires some IFs ) and will be 100% stackless ( but is the triangle set is too big then kernel will be aborted due to the current 64k instruction limit )

So is a vicious circle…

how is this anything diff than what a cpu does
Any form of infinite loop is bad for the user; it ultimately requires them to kill the application via some means.

However, if the GPU (that which is required to render the screen) goes into an infinite loop, that means that the entire ability for the system to render dies. And since GPU shaders aren’t in any way bound to a particular application, the only way to fix this is to restart the system.

A well-behaved shared resource like a GPU should never crash even from pathological use. Doing so makes the system seem fragile.

Incidentally, this is a problem for GPGPU stuff too, one I hadn’t realized. Well, unless they’re using a GPU that isn’t used for screen rendering.

change the algorithm
Considering that the algorithm in question is called, “Ray tracing”, the only alternative is called “rasterization”, which is what we’ve been doing.

I think with a streaming monster like the GF8800 Ultra you could bypass the uniform grid DDA and just test ray vs all triangles with a simple kernel, that will save tons of branching ( but not completely because the ray-triangle-hit test requires some IFs ) and will be 100% stackless ( but is the triangle set is too big then kernel will be aborted due to the current 64k instruction limit )
Forget the number of instructions; modern meshes, particularly high LODs, can easily top 100,000 polygons. You expect that, for every pixel, you can quickly do 100,000 memory accesses, on top of what you’re already doing with shaders?

You’d probably be better off accepting the branch misprediction performance and just doing the DDA or some other spacial subdivision scheme.

Perhaps the solution to the BoSD for excesive GPU usage is to force the user to add a second graphics card which will be dedicated exclusively to raytracing and gpgpu… or a card with two GPU chips ( like the GX2 ) one to display and the other for gpgpu/raytracing/physics… or start making multicore GPUs…

Forget the number of instructions; modern meshes, particularly high LODs, can easily top 100,000 polygons. You expect that, for every pixel, you can quickly do 100,000 memory accesses, on top of what you're already doing with shaders?
You'd probably be better off accepting the branch misprediction performance and just doing the DDA or some other spacial subdivision scheme. 

Yep, some kind of spatial tree gives better results like the demo shows. And I agree… just a terrain, 8 chars, some vegetation and environment objects in an outdoor multiplayer game like Unreal Tournament can reach 50-100k polys easy.

But does not matter really… both solutions are aborted with the ridiculous 64k instruction limit currently(which is like 3k scene polygons maximum, less even with the current driver status, because the ray-triangle test uses like 200 instructions and is always unrolled…and with 3k polygons maximum you cannot raytrace more than a wall and two teapots… ), that’s why is vital to extend the maximum instruction limit to perform the test manually using a shader or to hardcorde the ray test in the silicon.

The situation is even worse… because probably you will need to test 8-16 packed rays in a row for decent soft shadows… so the numbers shrinks even more.

Although you could use the DDA to reduce a lot the number of node tests, you won’t be able to process more than 200-300 node tests without reaching the shader limits… and that can be not enough for complex scenes.

The call stack implementation + extend a bit more
the limit to 2-8M instructions will help a bit the DDA.

santyhamer + korval yes youre correct you would want there to be no visual update if it got stuck in a loop

Originally posted by Korval:
Forget the number of instructions; modern meshes, particularly high LODs, can easily top 100,000 polygons. You expect that, for every pixel, you can quickly do 100,000 memory accesses, on top of what you’re already doing with shaders?
Yes it can, it you look at it from another perspective though which basically uses modified memory that does all the heavy lifting for you, using lot’s and lot’s of tiny processors.
IT’s all in a blog post i wrote (to lazy to retype).
And sure openGL is not an raytracing API, but it could be, well at least a bridge, it’s not like raster graphics and raytracing graphics have to be oposite of each other, the only difference is that rasterized graphics has a rasterizer(which is today a pretty small part of the whole) and raytracing needs some kind of acceleration for the raytests, so why not have both.

And to get back to topic about SM5, my guess is the blend shader as #1 feature with the rest being things openGL already has.

zeoverlord, you forgot about dissociating the registers from the shader.
Second issue is to assign uniforms to whichever register I want.
This way, when I change shader, the registers don’t change.
I don’t know so much about the details of LP, but I guess it will be easy to send a buffer_of_uniforms

Originally posted by zeoverlord:

IT’s all in a blog post i wrote (to lazy to retype).

I agree with all those points there.

Originally posted by V-man:
I don’t know so much about the details of LP, but I guess it will be easy to send a buffer_of_uniforms
from what i have heard it’s as easy as VBO’s will be, but i guess we will see about that when it’s released and we can actually start to write some stuff on it.

Originally posted by V-man:
I don’t know so much about the details of LP, but I guess it will be easy to send a buffer_of_uniforms
Do you mean a DX10 cbuffer equivalent like

A shader writer can group a set of uniform variables into a common block. The storage for the uniform variables in a common block is provided by a buffer object. The application will have to bind a buffer object to the program object to provide that storage. This provides several benefits. First, the available uniform storage will be greatly increased. Second, it provides a method to swap sets of uniforms with one API call. Third, it allows for sharing of uniform values among multiple program objects by binding the same buffer object to different program objects, each with the same common block definition. This is also referred to as “environment uniforms,” something that in OpenGL 2.1 and GLSL 1.20 is only possible by loading values into built-in state variables such as the gl_ModelViewMatrix.

More info on http://www.opengl.org/pipeline/article

so I bet yes… and if not you can always fill a dynamic floating point texture and fetch it(which I bet will be slower btw)

Personally I would kill those cbuffers and implement better the proposed shared, device and global thing… because is much powerful and you can also write values into.

BTW, if you’re looking to step through a large bank of memory:

See Texture Buffer Object . A texture who’s max size is approximately 128MB of texels.

Btw… it appears now there are three dedicated RPUs:

http://graphics.cs.uni-sb.de/~woop/rpu/rpu.html ( mentiones before )

and also

http://www.artvps.com/page/109/raybox.htm
http://www.artvps.com/uploads/documents/documentation/RayBox_flyer.pdf

which uses 14 dual-core AR500 processors:

http://www.deathfall.com/article.php?sid=5336 ( olf CPUs, 66M tris/sec per core )… the ar500 is like 80. Price is high because includes 14Gb or RAM for big scenes

I asked if there was an SDK available but they told me “nop atm, sorry”

Also this one Avalon chip:

http://www.schwarzers.de/project_features.php.htm

I was thinking about the blend shader a bit, and it doesn’t have to be a separate shader. All that is required it the ability to read gl_FragColor and/or gl_FragData from the pixel shader. Then you can do what ever you want with it.

Regards
elFarto

The final blending can’t currently be done in pixel/fragment shader because these shaders can be executed in parallel and out of order, whereas the blending for each (sub)pixel has to be done in the order the triangles were submitted.
If the blending was done in the fragment shader, then the shader execution might have to pause at the first framebuffer access and wait to be run in sequence, after other shaders from earlier primitives at the same pixel coordinates.
Being able to read the previous Z value would be really cool for doing transparent volumes like tinted water or clouds and other things like that.