The newest drivers 340.X are wasting shared memory using OpenCL, I do not why. I’ll show you an example (considering blocks of 128 threads)
__kernel kernel(float* dummy)
__local double shared_aux[128*10];
for(int i = 0; i < 10; i++)
shared_aux[get_local_id(0)*10+i] = 0.0;
If you execute this OpenCL kernel in a device with driver 340.X, the profiler will tell you
that this kernel uses 128 * 10 * 8 + 8 bytes (considering the double as 8 bytes). This
value is not correct, the correct value is 128 * 10 * 8.
Well, if you execute this OpenCL kernel in a device with driver 325.X, the profiler will tell you
that this kernel uses 128 * 10 * 8, the correct value.
This is a problem of performance that only happens in OpenCL (in CUDA, the bug doesn’t exist) using the newest drivers 340.X.
So, Did you know this behaviour? For now, I will
keep my old driver because this ‘bug/problem/whatever’ of the 340.X driver has a lot of penalty in
several applications (mainly those that exhaust the shared memory)
Thank you in advance
PS: I have posted in Nvidia Forums but still I have no answers
I’ve the perception that OpenCL hasn’t got a significant user base. Apple said to developers “you know you should be using OpenCL” when they released the new 2xGPU Mac Pro a year ago, but I feel OpenCL hasn’t got much more users than, say, AltiVec for example. The fact that NVIDIA has always pushed proprietary tech doesn’t help, agreed.
Are you sure you posted the right kernel? That one doesn’t compile because kernels must have a return type of void (your’s is missing a return type) and the identifier ‘kernel’ is a keyword (actually a synonym for __kernel), hence it cannot be used as the name of the kernel.
Nvidia confirmed me that the newest drivers r340.XX use an extra word of local memory for ‘internal use’. So it is not a bug, but now I do not know if to updated or not the Nvidia driver, because unless in OpenCL case is not a good idea
I think that we can close this thread because it was solved. Thanks you all.
Does the device report slightly less local memory for CL_DEVICE_LOCAL_MEM_SIZE when you’re running the r340.xx driver? It had better!
I did notice a while back that some older NVIDIA OpenCL 1.0 devices (e.g., Quadro FX 3800) report 16383 bytes now of shared local memory instead of 16384, which is interesting since the specification says 16 KB is the minimum (so they’re technically not even OpenCL 1.0 compliant anymore).