Iterations And GlobalRange Difficulty (Related?)

Hi everyone!

I’m benchmarking a parallelized algorithm and i wanted to run 1000 iterations of it.

The thing is, my global range for a single iteration is 82369, to make those 1000 iterations i decided to multiply 82369 times 1000, and then compute the offset inside the kernel. The problem is this raizes CL_OUT_OF_RESOURCES at enqueueReadBuffer.

Then i tried to do a FOR loop, inside the kernel 1000 times and keep global range at 82369 but the same CL_OUT_OF_RESOURCES came out also at enqueueReadBuffer.

Then tried to do some variations like 82369*20 and doing A FOR loop 50 times, but CL_OUT_OF_RESOURCES always came out.

I couldn’t find any explanation for this, info explaining the relation that i found between global range and the number of loops inside the kernel.

If someone has an idea why this happens, i would appreciate it (if this is even possible).

This is the kernel code (most of it):

__kernel void square(__constant unsigned long NBTS, 
					 __constant unsigned long GRID_SIZE_X, 
					 __constant unsigned long GRID_SIZE_Y, 
					 __constant unsigned long radix, 
					 __global unsigned long *BTSET, 					 
					 __global long *fitness,
					 __global int *covered_points

	long globaID = get_global_id(0);

	__local long GRID_SIZE;
	if (globaID==0)

	int contador;
	long x, y;
	long x_k, y_k;
	long x1, y1,rx,ry;

	float cover_rate, fit;	
	int check=0,j,i,m;

	// offset
	long idy = globaID / GRID_SIZE;
	long idx = globaID - idy * GRID_SIZE;

	if (idx==0)
		*covered_points = 0; //

	for (j=0; j<1; j++) // <--- HERE´S WHERE I PUT THE 1000 ITERATIONS

		contador = 0;

		for(m=0; m<NBTS; m++) 

			if (check!=1) 
					for (i=0; i<NBTS; i++) 




If you would like some more information, please just ask.

Thanks in advanced …

Two questions:

  1. what is your total global size? (I believe the Nvidia driver currently has a terribly small limitation of a global size of 65k or something.)
  2. what is GRID_SIZE set to if you’re not work-item 0? It seems like it is uninitialized.

I’m experiencing the same problem executing two nested for cicles. Given the following code (lauching 44100 work-items, the size of addresses array), the card driver eventually crashes. It only allows me to run the outside for loop about 100 times. I don’t know if it is due to memory usage…

		long idx = get_global_id(0);
		long x_p, y_p;
		long x_a, y_a, rx, ry;

		y_p = addresses[idx] / 287;
		x_p = addresses[idx] - y_p * 287;

		for (int i=0; i<1000; i++) {

			for (int j=0; j<49; j++) {

				y_a = bt[j] /287;
				x_a = bt[j] - y_a * 287;

				rx = abs_diff(x_a, x_p);
				ry = abs_diff(y_a, y_p);
			        if (rx<=30 && ry<=30) 
				addresses[idx] = 1;	


What could be preventing it from running as it is supposed to be?


Thanks for the reply!
About question nº2, i made that way to try optimize, but now i realize that i only initialize that for the first work group :slight_smile:

But even if i use it as global, the problem persist.

The global size, is at minimum 82369. But with this value i need to do 1000 iterations per kernel. And this is the problem, because it returns CL_OUT_OF_RESOURCES and i can’t find any reason why this happens.

The reason why a talk about multiplying 82369 times 1000, it was to try replace the loop inside the loop by the number of work-items, if you know what i mean!

Thanks in advanced!

Are you using Nvidia’s drivers? If so, check their release notes because I don’t think they support a global size > 65,535.

Also, if your kernel is taking too long (say longer than 5 seconds) the system watchdog timer on most machines will kill your program. This is true on Mac OS X, and, I believe, windows/linux unless you are using a dedicated (i.e., non-display) card for computing.

Yes, Nvidia drivers here…

A global size of something with 140 MB is allowed. The constant is limited to 64k and the local related to the chip between 16 and 32k.

I had the similar problem when using too large loops. So my outer loop isn’t in the kernel anymore, instad I do use more work groups (several thousands). If you use for all work groups the same data, you should make a good time meassure.

I think there’s some confusion here. I’m not talking about a global memory size when I say there is (was?) a limitation in the Nvidia drivers. I have heard several people say that global_xglobal_yglobal_z must be <= 65535 to run on Nvidia’s drivers. Can anyone confirm if this is still the case? I know it’s not a hardware limitation since the same card will happily run arbitrarily large global sizes under Mac OS X.

Hello EveryOne!

Just writing to say that my problem was solved.

The problem wasn’t the global range size, but the bad management of resources in the kernel. Since i had my kernel, optimized, and by this i mean define a right local range and define only the needed variable with the needed type.

And that was it.

Só, answering to dbs2, i belive that the nvidia limit 65535 doesn’t exist. At least with the last drivers available!

Thanks you all for the help!

Take care!