tiny change hangs kernel - vectorizing problem

Photovore · October 17, 2011, 4:44am

Hi folks; it’s been a while but I thought I’d post my final problem here and see if it resonates with anyone.

I have a kernel that runs beautifully on nVidia, which I am vectorizing for an AMD 5870.

I’m doing some trig, so to eliminate branching I have turned things like
if ( T < 0.f ) T += 360.f;
into
T += ( T < 0.f ) * 360.f;

( This is of course necessary to as to process all 4 elements of the vector in one go; if the branch were used then it would have to be performed individually per each element of the vector.)

… all cool, and the logic is good; it works for float1s. (And, doesn’t hurt performance???, even if it’s doing 8 complex calculations for 8 different conditions; I am happily surprised…)

-> However, when you’re using float4s, the value of a comparison is different. Instead of getting +1 back from a logical comparison, you get -1. SO, in order to use it in a calculation, as above, it’s necessary to somehow change that -1 to a +1 for the equation to yield what it needs to.

This hangs the 5870.

I have #defined FLOGIC(x) to take i.e. ( T < 0.f ) and change the sign of the result, in a number of ways:
#define FLOGIC(x) (float4) -(x);
#define FLOGIC(x) (float4) (x) * -1.f;
#define FLOGIC(x) (float4) ( x * -1 );
#define FLOGIC(x) (float4) ( abs( x ) );
#define FLOGIC(x) fabs( (float4) x );

. . . if I don’t do this, if I use the result of the logical comparison as originally depicted way up above, then the kernel compiles and runs beautifully except for the fact that the -1 ruins all the calculations it touches and the results are useless.

. . . if I DO do this, if I attempt any of the above-described methods to reverse the sign of the float4 logical comparison, the kernel never comes back. (Same deal if I use a function instead of a #define. I can do anything in that function or #define +except+ change the sign without terminally messing things up.)

It sails through clBuildProgram and clCreateKernel, enqueues fine, and then clFinish hangs the whole machine. [ Mac Pro ]

The cursor still follows the mouse around but the clock is frozen and so is everything else, requiring a hard boot.

Does anybody have any ideas?

Thanks
Dave

p.s. what fun to be here in opencl’s early days, huh?

Photovore · October 17, 2011, 11:53am

Neverrrrrrr miiiiiiiiiiind ! . . .

Yes, it was a tiny change, but re: what I was trying to do, I now see this, in 6.2.2:

Explicit casts between vector types are not legal. The examples below will generate a compilation error.
…
float4 f;
int4 i = (int4) f; <- not allowed

… well, I didn’t get the compilation error, but it is in fact not supposed to work.

to recap what i was trying to do:

The results of a comparison return an integer vector of the same length as what was compared, so, if I want to formularize a float4 comparison, I want to take something like
if (a<b) x=c; else x=d;
and replace it with
x = ( a < b ) * c + ( a >= b ) * d;
where a, b, c, d, and x are all float4s. (the expressions get much more complex than this simple example)

But, the compiler told me it couldn’t implicitly convert the int4 result of the comparison to float4 for the multiplication I wanted. SO, I put in an EXplicit conversion! It seemed to like it at first, until I went back to address the -1 problem, which caused Lion to slink off into the weeds when clFinish is called.

…

Surely there is some way to get the values out of the elements of an int4 and put them into a float4. I suppose I could copy each element individually, casting them when they’re temporarily scalars… maybe there are explicit conversion functions i haven’t stumbled across … I’ll look…

Thanks for your attention!

ajs2 · October 17, 2011, 1:32pm

That’s right, the original expression should produce a compilation error:

float4 T;
…
T += ( T < 0.f ) * 360.f;

(T < 0.f) is a relational containing a vector and a scalar; 0.f will be widened to float4 and then the comparison is performed element-wise, returning an int4.

According to section 6.3(d), vector relational operators return -1 (all bits set) for true. The evaluation so far would be:

(int4)(…) * 360.f

However, according to section 6.2.6, this should be an error because the rank of the scalar type (float) is greater than the rank of the vector type (int).

This expression would be legal if the scalar had the same rank as the vector type, e.g.

(int4)(…) * 360

However, implicit conversions between vector types are not permitted (6.2.1), so

T += (int4)(…)

is illegal, since T is a float4. Casts between vector types are also illegal (6.2.2), so it is necessary to use an explicit conversion. For example:

T += convert_float4((((T < 0.f ) & (int4)(~0x1)) * 360);

david.garcia · October 17, 2011, 3:22pm

T += convert_float4((((T < 0.f ) & (int4)(~0x1)) * 360);

Can’t you do something like this?


// Original: if ( T < 0.f ) T += 360.f;
T += as_float4(as_int(360.f) & (T < 0.f));

notzed · October 17, 2011, 5:40pm

david.garcia:

T += convert_float4((((T < 0.f ) & (int4)(~0x1)) * 360);

Can’t you do something like this?
// Original: if ( T < 0.f ) T += 360.f;
T += as_float4(as_int(360.f) & (T < 0.f));

Or perhaps this might be a touch more readable (:

T += select(0.0f, 360.0f, T < 0.0f);

Photovore · October 18, 2011, 6:18pm

Thanks, folks, for your ideas. I like your thinking…

For illustration purposes only, here is a slightly more complex example:

for float1, #define FLOGIC(a) (a)
for float4, #define FLOGIC(a) convert_float4( abs(a) )


			H = atan( bb / aa ) \
			+   FLOGIC ( aa  < 0.f ) * PI \
			+   FLOGIC ( bb  < 0.f && aa >= 0.f ) * PIx2;

… I have others that go on for 6 lines or more … so you may see that I wish to compute each logical condition as a self-contained unit and then apply it to the rest of the clause, for each clause in each calculation.

•) ajs2, I like the (~0x1), which should be faster than abs(), but for some reason when I tried it half of the output buffer was vertically inverted. Not so with abs(). Something to figure out later after I attend to other logic problems.

•) david, your idea of using as_type() with bitwise AND should be faster than the multiplications, and I will do that for performance purposes . . . after I get my logic working. (That’s probably why they went with -1 in the first place; why didn’t I think of that? Converting to +1 and float-multiplying each clause is not very smart.)

•) notzed, I thought about select, but it wouldn’t work for any but the simplest two-alternative instances (of which there are plenty, but still). (Also, with #defines, this same 1500 lines of code will work for float4, float1, or straight Xcode C, where select() doesn’t exist … though I could #define one.)

The example I posted earlier with a value of 360 may not have been the best choice; I’m usually using non-integer-value floatns. So, thanks for the ideas which incorporated that calc in integer form, but I’ll want to separate the operations. (For now. Perhaps later I’ll go around seeking out every little performance optimization…)

…

Still a ways off, but perhaps I can see a glimmer of light … here’s what I get:

#define vex (turns on float4s):

Lion, Xcode 4:
CPU (Xeon) – executes; gives me bad data but recognizable as being partially correct…
GPU (5870) – still hangs Lion when clFinish is called…

Snow Leopard, Xcode 3:
CPU ( I7 ) – LLVM compiler has failed to compile a function
GPU (330M) – LLVM compiler has failed to compile a function

//#define vex (float4s turned off):

All devices on both platforms – works well!

what fun…

notzed · October 27, 2011, 6:42am

Sounds like you’re just making work for yourself if i understand you correctly. 1500 lines of code isn’t all that much, and opencl c is different enough to c that trying to wrap it all in macros and create a pseudo custom language that is converted to either at compile time sounds like a lot of hassles for little benefit. Apart from the language the very different hardware requires sometimes radically different approaches. And poor choices can be really really expensive.

Back to select, it can surely be used for any decision logic you can implement any other way (excluding branches in control flow): so i’m not sure what you’re talking about here. It might not be too pretty, but that FLOGIC stuff isn’t either - and that is less general, it’s choice is either ‘0’ or ‘a number’.

But now your goal is clearer … from the specification:

. The ternary selection operator (? operates on three expressions (exp1 ? exp2 : exp3). This
operator evaluates the first expression exp1, which can be a scalar or vector result except
float. If the result is a scalar value then it selects to evaluate the second expression if the
result compares unequal to 0, otherwise it selects to evaluate the third expression. If the
result is a vector value, then this is equivalent to calling select(exp3, exp2, exp1). The select
function is described in table 6.14. The second and third expressions can be any type, as
long their types match, or there is a conversion in section 6.2.1

So you could probably just use ?: and it should ‘just work’.

Photovore · October 28, 2011, 7:56pm

Sounds like you’re just making work for yourself if i understand you correctly. 1500 lines of code isn’t all that much, and opencl c is different enough to c that trying to wrap it all in macros and create a pseudo custom language that is converted to either at compile time sounds like a lot of hassles for little benefit. Apart from the language the very different hardware requires sometimes radically different approaches. And poor choices can be really really expensive.

Making work for myself? … well, yep, that’s me!! I tend not to do things the easy way at first. However, I code straight plain K&R C so the difference is not too great. I just learned enough Cocoa and Objective-C over a couple of months to bring my project up to the present from the old CodeWarrior days and allow the use of current hardware and OpenCL. I do understand that different GPU substrates may require different approaches for best performance (i.e. later vectorization thread you’ve contributed to) but at base I was taught “Make it work first, make it pretty later”.

Back to select, it can surely be used for any decision logic you can implement any other way (excluding branches in control flow): so i’m not sure what you’re talking about here. It might not be too pretty, but that FLOGIC stuff isn’t either - and that is less general, it’s choice is either ‘0’ or ‘a number’.

OK, now I see what you mean; drew a blank at first. So, given that everything I need is currently either 0 or a number, I could say:

val = \
  conditionA ? subcalcA : 0 \
+ conditionB ? subcalcB : 0 \
+ conditionC ? subcalcC : 0 ...

… or the same thing with select().

I can pop that into my #def and see if it affects performance … later. Because I did actually get that bit to work, and my attention is elsewhere at the moment.

Thanks!