When i run my application in CPU, no problems occurs. I can put 256 threads in CPU, that the execution happens and the results in the end of execution all is correct.
But, when i put more than 48 threads in GPU the application enter in a infinite loop.
More than 40 threads gets to run, but the results is not the same.
Is it possible be a problem in rounding of the numbers? The way that CPU makes rounding is different than GPU?
This is a first thing that went through my mind.
Can you post here the kernel’s source code? Also, when you talk about threads, do you mean work items? Please also show us the arguments you pass to clEnqueueNDRangeKernel(). We need to know the value of global_work_size and local_work_size.
Sorry the time that i answer, some late.
Yes, when i said threads, it’s means work-items.
The call of kernel is bellow and the value of global_work_items and local_work_items is 1 each for:
erro_executa_kernel_sivia = clEnqueueNDRangeKernel( queue, sivia, 1, NULL, &global_work_size, &local_work_size,0 , NULL,NULL);
Can you post here the kernel’s source code? What you showed us is the API call, not the kernel source code.
Also, what is the value of local_work_size? Have you tried passing NULL instead of local_work_size?