I want to apply a transform to a bunch of points. I see OpenCL has floatnxm but I can’t find a mention of any function that takes this data type as argument. Furthermore I’m using ATI Stream SDK and declaring data type float4x4 myMatrix; gives an error “identifier undefined”. I don’t know if I’m using it wrong or if ATI doesn’t support this - even though I don’t see this type defined as optional.
So are there any built in ways to do affine transform? If I have to write my own, what’s a good way to load this matrix into local memory for all threads? i.e. maybe there’s a way to load the matrix for the work group, rather than each thread having to parse the float* argument into a data structure before doing the transform.
I see OpenCL has floatnxm
No, it doesn’t. These are reserved keywords which are otherwise unused yet.
I see, and my other question? If my global work size = numPoints and I pass the matrix as __global float * then each thread will be creating a copy of the matrix and reading from global memory. There will also be memory bank contention.
So is this a wrong way to partition the problem, or the wrong way to pass the argument? What’s a good way?
If my global work size = numPoints and I pass the matrix as __global float * then each thread will be creating a copy of the matrix and reading from global memory.
I don’t understand why would you do that. Instead, you should write your functions in such a way that they accept a “__global float* matrix” argument instead of requiring the data to be packed into a struct. The only difference  between a “float* m” and a “float m” is that the latter has a nice syntax to access the matrix elements. If all you have is a “float m" you will have to access elements by hand. I.e. you will need to do "m[4row + column]” instead of “m[row][column]”. You could even create a macro to make the code more readable if you want.
#define IDX(row,column) (4*(row) + (column))
You may also want to take advantage of the lower latency of __constant kernel arguments.
 Again, don’t shoot me
Well what I mean is my multiplication function is something like this
float4 multiply(float4 point, __global float * matrix)
result.x = matrix * point.x + matrix * point.y + matrix * point.z + matrix;
result.y = matrix * point.x + matrix * point.y + matrix * point.z + matrix
From what I understand, if I have 32 threads calling multiply(), that’s 32 threads reading the same 16 values from global memory. Maybe the reads aren’t even cached to local memory. I was just thinking if there was a way not to do that many reads at all.
NVidia tutorial on matrix multiplication uses local memory to reduce number of reads to __global memory - they have each thread working on a portion of the matrix, copying from global to local. However they partition their problem so that the number of threads that are run = num of blocks required. In my case if I did a copy from global to local I don’t know if it wouldn’t help because I still have 32 threads copying the same 16 values to local address space.
Anyway I thought there was a magic bullet. Just some way to specify that for this group of work items I’m creating a piece of local read only memory and copying data from global to local address space and that applies to all threads in the group. I will use __constant to get faster reads like you suggested.
Ah, I see. I think you are doing the right thing. You can assume that all devices will have some sort of cache for global memory, and even if it’s very small, sixteen floats will not be a problem.
Maybe the reads aren’t even cached to local memory
Correct. Unless you explicitly copy the data to local memory I don’t think it’s reasonable to expect the CL to do it “automagically”.
Just some way to specify that for this group of work items I’m creating a piece of local read only memory and copying data from global to local address space and that applies to all threads in the group.
You could use the work-group to cooperatively move the matrix into a __local variable, but declaring the matrix as __constant is not just going to be faster, it’s easier to code and to understand.