Affine transform, floatnxm

Budric · July 19, 2010, 8:52am

Hi,
I want to apply a transform to a bunch of points. I see OpenCL has floatnxm but I can’t find a mention of any function that takes this data type as argument. Furthermore I’m using ATI Stream SDK and declaring data type float4x4 myMatrix; gives an error “identifier undefined”. I don’t know if I’m using it wrong or if ATI doesn’t support this - even though I don’t see this type defined as optional.

So are there any built in ways to do affine transform? If I have to write my own, what’s a good way to load this matrix into local memory for all threads? i.e. maybe there’s a way to load the matrix for the work group, rather than each thread having to parse the float* argument into a data structure before doing the transform.

Thanks.

david.garcia · July 19, 2010, 2:08pm

I see OpenCL has floatnxm

No, it doesn’t. These are reserved keywords which are otherwise unused yet.

Budric · July 21, 2010, 9:31am

I see, and my other question? If my global work size = numPoints and I pass the matrix as __global float * then each thread will be creating a copy of the matrix and reading from global memory. There will also be memory bank contention.

So is this a wrong way to partition the problem, or the wrong way to pass the argument? What’s a good way?

david.garcia · July 21, 2010, 10:13am

If my global work size = numPoints and I pass the matrix as __global float * then each thread will be creating a copy of the matrix and reading from global memory.

I don’t understand why would you do that. Instead, you should write your functions in such a way that they accept a “__global float* matrix” argument instead of requiring the data to be packed into a struct. The only difference [1] between a “float* m” and a “float m[4][4]” is that the latter has a nice syntax to access the matrix elements. If all you have is a “float m" you will have to access elements by hand. I.e. you will need to do "m[4row + column]” instead of “m[row][column]”. You could even create a macro to make the code more readable if you want.


#define IDX(row,column) (4*(row) + (column))

You may also want to take advantage of the lower latency of __constant kernel arguments.

[1] Again, don’t shoot me

Budric · July 21, 2010, 1:10pm

Well what I mean is my multiplication function is something like this


float4 multiply(float4 point, __global float * matrix)
{
    float4 result;
    result.x = matrix[0] * point.x + matrix[1] * point.y + matrix[2] * point.z + matrix[3];
    result.y = matrix[4] * point.x + matrix[5] * point.y + matrix[6] * point.z + matrix[7]

    ...
    return result;
}

From what I understand, if I have 32 threads calling multiply(), that’s 32 threads reading the same 16 values from global memory. Maybe the reads aren’t even cached to local memory. I was just thinking if there was a way not to do that many reads at all.

NVidia tutorial on matrix multiplication uses local memory to reduce number of reads to __global memory - they have each thread working on a portion of the matrix, copying from global to local. However they partition their problem so that the number of threads that are run = num of blocks required. In my case if I did a copy from global to local I don’t know if it wouldn’t help because I still have 32 threads copying the same 16 values to local address space.

Anyway I thought there was a magic bullet. Just some way to specify that for this group of work items I’m creating a piece of local read only memory and copying data from global to local address space and that applies to all threads in the group. I will use __constant to get faster reads like you suggested.

david.garcia · July 21, 2010, 2:25pm

Ah, I see. I think you are doing the right thing. You can assume that all devices will have some sort of cache for global memory, and even if it’s very small, sixteen floats will not be a problem.

Maybe the reads aren’t even cached to local memory

Correct. Unless you explicitly copy the data to local memory I don’t think it’s reasonable to expect the CL to do it “automagically”.

Just some way to specify that for this group of work items I’m creating a piece of local read only memory and copying data from global to local address space and that applies to all threads in the group.

You could use the work-group to cooperatively move the matrix into a __local variable, but declaring the matrix as __constant is not just going to be faster, it’s easier to code and to understand.