Real-time raytracing -- Is MIMD req'd?

Hello all,

Excuse me if this question is really basic, but this is my first time delving into GPGPU tech.

I have an HD4850 card (R700 chipset) which I would like to develop on. I’ll be casting a multitude of individual rays which requires a lot of branching/conditionals per ray marched.

What I read about the R700 is that its cores are SIMD. This concerns me, because I’m not sure that OpenCL is going to help me much given that each ray pretty much follows a different execution path.

Can anyone suggest whether or not I would need a newer card for this? If so, just how mainstream are the MIMD cards?



Whilst a newer card might be faster, all gpu’s basically work like this, and in particular none of them will execute divergent branches well (what they all do is execute ALL branch paths, and use a hardware bit to ignore the results on un-taken paths). There are no MIMD cards, and AMD’s (now out-dated) VLIW stuff is only at the thread level. And given the way they gain their flops/watt, I can’t imagine there ever will be.

One approach might be a state machine based on lookup-tables, so the code paths are the same but the data changes. Although I don’t know if that could be applied to such a problem.

(I just did a search on “state machine ray tracer” and it came up with a lot of hits about accelerated ray-tracing, so it obviously an obvious approach)

I was with you through the first part of your answer.

Re “state machine based on lookup tables”, can you roughly outline how that would work? I googled the same terms with little coherent result; there was no clear topic that came to the fore.

I think my question was unclear, so apologies – I should have stated that I’m specifically talking about voxel raymarching, which means there is really very little that can be precalculated. I am not talking about the sort of ray tracing that relies on eg. distance fields or other non-discrete mathematical formulae to describe its volumes – those work in a fundamentally different fashion and experience different complexity limitations as a result. Raymarching is highly conditional and its complexity relies directly on the grid resolution – and each ray must be marched individually (which contains a lot of conditional logic) rather than in parallel for cache coherence reasons. So outside of MIMD, I really don’t think it’s possible to accelerate much…

Voxel rendering, path tracing, etc. are very common applications for GPUs (i.e. SIMD). There a many examples available on the net. For instance, this is something I wrote a couple of years ago and it does the rendering of a Julia set by ray marching:

This is an example of path tracing (for off-line rendering) done with GPUs:

And this is another example of path tracing (for real-time rendering):

Divergence is a issue for ray tracing and GPUs but you can pack so much brute force on a consumer PC with multiple GPUs to still outperform expansive multi-CPUs solutions.

OK, but that still doesn’t tell me how 3D DDA raymarching (if that’s what is being used) is parallelised on the GPU! Or is 3D DDA avoided in favour of voxel-plane intersection techniques?

Please, can anyone give me any detail on methods to accelerate this if I am stuck with SIMD? The point is that if every ray takes a different execution path then… How do you parallelise this given SIMD!?

Coding in opencl wasn’t meant to be easy.

Not being familiar with the algorithm you’re talking about, it’s hard to suggest anything; do you have some example code of the inner loop and all these different execution paths?

No one suggested learning OpenCl would be easy. Here’s the algorithm: … t12p1.html

I strongly suspect that ray-plane/triangle intersections are where most real time raycasters get their speed gain on the GPU. DDA uses a largely arithmetic approach mixed in with heavily conditional grid-stepping.

Sorry to double post (I can’t seem to edit my last post).

From “Understanding the Efficiency of Ray Traversal on GPUs”, this seems to confirm my suspicions(?):

“All of the tests in this paper use bounding volume hierarchy (BVH)
and Woop’s unit triangle intersection test [Woop 2004]. The ray-
box and ray-triangle tests were optimized in CUDA 2.1”

I was hoping there was some way to acclerate the core logic of 3D DDA on the GPU, but looking at the above, maybe there is not?

To me those loops although they contain branches are not what one would call ‘branchy’, e.g. “algorithm 3DDDA” only has 3 small cases inside the loop, and i suspect those could be collapses into 1 case with some fiddly branchless logic. Even if all lines in the branches are executed it isn’t much redundant work: it’s not like each branch option contains loops or in worse, data-dependent loops.

Certainly the first nest if/else of “algorithm 3DDA” could be implemented fairly easily without branches using standard branchless logic techniques (i.e. using select, or ?). Although the compiler may very well do this for you anyway.

(note that if you managed to use branchless logic here, you could then use vector types and thus explicit SIMD algorithms, however opencl/gpu doesn’t need this to be done for good performance).

But you really just have to code it to see if it will work well or not, even algorithms that don’t ‘fit’ the gpu topology can run faster because it has so much parallelism and memory bandwidth.

I see… Many thanks for your advice, I shall go ahead and experiment as suggested, then.

On that topic, any good books for learning OpenCL?, since I get the impression GPGPU tech is definitely something that one needs solid material to learn from. (Particularly given it’s relative newness and niche application.) Something like a Beginner-To-Pro sort of book that would suit a professional programmer.

I think there’s some suggestions in the various opencl forums (this and the vendors’ ones), try searching those.

Personally I’m not much into coding books and I have the time to stick with the documentation available. The opencl specification is really good, and I also like “AMD Accelerated Parallel Processing OpenCL Programming Guide” - chapter 4 has a good amount on performance tuning and much of it is applicable to all gpus as a lot of the characterstics are very similar. The SDK examples help to get started with the basics, and internet searches help with some quirks.

…fiddly branchless logic.

Absolutely go with the fbl.

I had a large project from years before that was a good candidate for opencl; I jumped in and converted it from plain C. Many of the helper routines are much branchier than what you have. I eventually got it running, and it was faster than without opencl because of all the cores. Then I started replacing branching with fiddly stuff. A variable can take four paths to its assignment? I used #defines to make one big equation with four subclauses added together and multipliers on each one so only the appropriate one mattered. Much faster. Someone (may have been notzed) pointed out that I could have used select(), and I tried it but mine, though uglier, was a tiny bit faster (and a bit more suited to my purpose). I also use ? a lot. And once the fbl was in place I went to float4s, which doubled throughput on the 5870 and helped on Intel also, but not on nVidia.

Make it work first, then make it gooder.

(Also not big on programming books, but have a soft spot for Kernighan and Ritchie…)

(Rest the latter’s soul.) Thanks for the input. These are new concepts to me (I’ve heard vaguely of branchless logic though I don’t grasp the underlying mechanism yet. Any pointers on that would help.

This is why I mentioned books; I feel there are a lot of blanks I need to fill in.

Sorry to leave this so long without replying. The short answer is: look at the OpenCL spec regarding select() and the ternary operator ( ? : ). (And, know that they are performed element-wise for vector types, so when you get to that stage of optimization [if needed for your gpu] they will still work properly].)

What I did was (Just typing this, not real code, may contain errors) something like:

if the original logic structure was:

if ( a < .25 ) z = q^2.;
else if (a < .75 ) z = q^3./2.;
else z = q^4.;

then the new structure is:

z =
(a<.25) * q^2. +
(a<.75) * q^3./2. +
(a>=.75) * q^4.;

. . . that’s the basic initial change I made (using copious #defines), but then later had to make changes to allow for the fact that “true” i.e. for the subclause (a<.25) is -1 instead of 1 for vector types, which actually makes sense, but that won’t matter for you because . . . you’re not going to do what I did, or not in that order.

Someone (may well have been notzed) suggested I try select(), which does almost exactly what I was trying to do above. I did try his suggestion, but my much, much uglier attempt, which I already had working by then, turned out to be 1ms faster than select(), and was a tiny bit better suited to my purposes. BUT, you will use select() first, because it’s much much prettier (less confusing to read, therefore less error-prone in the first place), and then when you’ve got that working, if you need 1 more ms, you may dig further. But the OpenCL spec has thought of this, and they do take pretty good care of you.

Good luck!

edit: It might seem that my re-structuring of the logic example I give is doing far too much work. It is calculating all three values of the example and using “logic multipliers” to zero out the non-relevant results. How could this be more efficient than branching and only calculating the necessary value? (I have similar structures that reduce 6 or 7 options to a single result, with much more complex subclauses!) Well, all I can tell you is that it is faster to calculate 3 (or 6 or 7 or 8) results in the way I have described than it is to use branching to just get the one that is needed. Apparently calculations are cheap and branching is very expensive. When you get your code running, if you get this bit figured out, please come back and tell me about it!