Moving from float to float4: What should be changed in the host code ?


My input to the kernel is 4 x 2D matrices each contains 256x32 float numbers.
The size of the output is the same.

So in the host I called to:

size_t dim = 2;
size_t global_offset[] = {0, 0};
size_t global_size[] = {4 , 256 * 32};

err = clEnqueueNDRangeKernel(queue, kernel, dim, global_offset,
			 global_size, 0, 0 ,NULL, &prof_event);

I dicided that each element in the output will be a work item.
Not sure it is wise.

The kernel function is:

__kernel void id_check(__global float *in,
						__global float *out,
						int n_in_matrices,
						int n_out_matrices)

In order to run faster I changed to:

__kernel void id_check(__global float4 *in,
						__global float4 *out,
						int n_in_matrices,
						int n_out_matrices)

Of course that I changed the code of the kernel so that 4 elements will be processed at single clock.

In both cases I got the same results and the same processing time.
It does not make sense !!!
The second version should work 4 times faster.

What should I change in the host code ?



I found the problem.
I should change global_size to:
size_t global_size[] = {4 , 256 * 32 / 4};