I have two almost identical OpenCL kernels, with minor differences. One kernel performs 256 FMA operations, and the other one performs 512 operations, in the loop body.

Here is the general structure of the kernel:

```
__kernel void WGSXMAPIXLLXOPS8(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
const int XGL = get_global_id(0);
const int XGRid = get_group_id(0);
const int XGRnum = get_num_groups(0);
const int XLSize = get_local_size(0);
const int XLid = get_local_id(0);
// Just a private variable
float MF = (float) XGL;
float NF = (float) N;
float PF = (float) P;
float tempOutTotal = 0;
// Start of a new level of for loop
for (int lcdd = 0; lcdd < 2; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
...
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
GOut[XGL] = temp1;
}
}
```

Now, I calculate the GFlops of these kernels by dividing the total number of floating point operations by the time it takes to finish it. I deploy 141076 number of work-item onto the GPU. For kernel with 256 FMA operations, I get around 1696.5 GFlops and for the kernel with 512 FMA operations, I get around 2043.74GFlops.

From my point of view, I have enough parallelism and I have enough operations in the kernel. Unless my assumption is wrong.

Now the question is: Why having more operations in the kernel improves the performance? I understand having parallelism does matter and also each kernel should do enough operations. So, what exactly cause this specific gap in the performance?

To be more specific, is there any relation between occupancy and the number and the type of operations a kernel does?