Hello,

My input to the kernel is 4 x 2D matrices each contains 256x32 float numbers.

The size of the output is the same.

So in the host I called to:

```
size_t dim = 2;
size_t global_offset[] = {0, 0};
size_t global_size[] = {4 , 256 * 32};
err = clEnqueueNDRangeKernel(queue, kernel, dim, global_offset,
global_size, 0, 0 ,NULL, &prof_event);
```

I dicided that each element in the output will be a work item.

Not sure it is wise.

The kernel function is:

```
__kernel void id_check(__global float *in,
__global float *out,
int n_in_matrices,
int n_out_matrices)
```

In order to run faster I changed to:

```
__kernel void id_check(__global float4 *in,
__global float4 *out,
int n_in_matrices,
int n_out_matrices)
```

Of course that I changed the code of the kernel so that 4 elements will be processed at single clock.

In both cases I got the same results and the same processing time.

It does not make sense !!!

The second version should work 4 times faster.

What should I change in the host code ?

Thanks,

Zvika