Is it possible to batch a 2d portion of a grid?

I am trying to overlay a grid (grid1) onto another grid (grid2) in parallel, and edit grid2 based on the values in grid1. Here is my kernel in pseudo

_kernel void simple(
	global const float* grid2,
	global float* output,
	constant float* grid1)
	int index = get_local_id(0);
	for(int i = 0; i < grid1.width; i++)
		for(int j = 0; j < grid1.height; j++)
			value += grid1[i][j] * grid2[index.x+i, index.y+j];
	output[index] = value;

The algorithm I have works, but is it possible to increase the performance of this by batching the portion of grid2 that I am using into private memory? If so, how would I batch it without writing a for loop that makes the same amount of calls to global memory?


Or maybe my problem is here. This is the call that I am executing in c++.

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &sizeIn, NULL, 0, NULL, NULL);

My problem is that I have a Quadro 135M GPU, and an integrated dual-core CPU, but the GPU takes twice as long so compute the function than the CPU, and if I am right this shouldnt be possible because the GPU has 8 cores.

and if I am right this shouldnt be possible because the GPU has 8 cores

In spite of what the marketers try to sell us, performance cannot be measured in “cores”.

If you are computing a convolution as it seems like you are doing, you can benefit from using local memory. See the examples in AMD’s or NVidia’s SDK for details on how to do it.