Problem with Local Work Size and Profiling

hello !

I wrote an OpenCL application to perform matrix multiplication. I copied it largely from some tutorials and online resources because i’m just a beginner with OpenCL.
The application seems to work, but there are some issues.

Here the code : (pay no attention to the comments in Italian and sorry for my bad English :stuck_out_tongue: )

One problem is this : if i set the dimension of matrices <512 ,when i calculate the time of execution with “clGetEventProfilingInfo” function it results 0nS. I don’t understand why.

The second problem concerns “clEnqueueNDRangeKernel” function. If I call it passing a NULL value for “*local_work_size” argument the application works (the only issue i find in this case is that with a dimension of matrices = 1024 , Windows tells me that my GPU’s driver crashes,but application works right). If i set “*local_work_size” with “localThreads[2]” array the application works if the array is equal to {16 ,16} , but when i try to set different values (for example {32,32},{512,512} the application crashes and i don’t understand why.

I’m using Visual Studio 2013 with Intel SDK on a Surface Pro 3 with a Intel HD 4400.

Can anybody help me?

Two things: 1) local work group size area (width * height) cannot be larger than what CL_DEVICE_MAX_WORK_GROUP_SIZE returns (which I’ve seen as small as 128 on older hardware, which 32x32 is larger than). 2) The global size must be an whole number multiple of the work group size. For example, if local size is 32x32 then 64x64 global size is OK but 80x80 is not.

Thanks for the reply.
For my GPU the value of CL_DEVICE_MAX_WORK_GROUP_SIZE is 8192 that is larger than group size (width*height) that i tested.
The problem could be related to the “clGetEventProfilingInfo” function : if i cancel the portion of code that use “clGetEventProfilingInfo” function to return time of
execution , the application works fine ( i think…), otherwise the application crashes. I tried to run debug on Visual Studio and it reports this error :“0xC0000005: Access violation reading location 0xCCCCCCCC.” .
Could it be related to the “clGetEventProfilingInfo”?

Another “strange” thing : if I set the second parameter of “clGetDeviceIDs” with CL_DEVICE_TYPE_CPU instead of CL_DEVICE_TYPE_GPU, the application works fine and return the execution time correctly. Why “clGetEventProfilingInfo” works fine with CPU and not with GPU?

Ok,now I solved the previous problem. CL_DEVICE_MAX_WORK_GROUP_SIZE is 512 for my Intel HD4400 and not 8192 like i said.

But now i have an other question. I’m trying to use Intel Code Analyzer to measure the performance of my application, but it doesn’t work. So, how can i do to measure in the correct way time of execution and time of memory operations using only OpenCl functions and not a profiler tool? (the last answer). By the way, if you use CL_MEM_USE_HOST_MEM buffers, time of memory operations should be basically zero (see limitations in Intel’s programming guide).

In my code i call “clEnqueueNDRangeKernel”,after that i call “clFinish” on my command queue and after i use the portion of code to measure the execution time. Only after i call “clEnqueueReadBuffer” to read the results of execution. The execution time that i’ve calculated in this way take into account also the time to read the results from buffer or not ? (i don’t understand very well when you said “time of memory operations should be basically zero”)