Problem with Local Work Size and Profiling

Bigz92 · September 16, 2015, 6:19am

hello !

I wrote an OpenCL application to perform matrix multiplication. I copied it largely from some tutorials and online resources because i’m just a beginner with OpenCL.
The application seems to work, but there are some issues.

Here the code : https://www.friendpaste.com/2NIpYvk8R96S01kFD3H3Gl (pay no attention to the comments in Italian and sorry for my bad English )

One problem is this : if i set the dimension of matrices <512 ,when i calculate the time of execution with “clGetEventProfilingInfo” function it results 0nS. I don’t understand why.

The second problem concerns “clEnqueueNDRangeKernel” function. If I call it passing a NULL value for “*local_work_size” argument the application works (the only issue i find in this case is that with a dimension of matrices = 1024 , Windows tells me that my GPU’s driver crashes,but application works right). If i set “*local_work_size” with “localThreads[2]” array the application works if the array is equal to {16 ,16} , but when i try to set different values (for example {32,32},{512,512} the application crashes and i don’t understand why.

I’m using Visual Studio 2013 with Intel SDK on a Surface Pro 3 with a Intel HD 4400.

Can anybody help me?

Dithermaster · September 17, 2015, 1:57pm

Two things: 1) local work group size area (width * height) cannot be larger than what CL_DEVICE_MAX_WORK_GROUP_SIZE returns (which I’ve seen as small as 128 on older hardware, which 32x32 is larger than). 2) The global size must be an whole number multiple of the work group size. For example, if local size is 32x32 then 64x64 global size is OK but 80x80 is not.

Bigz92 · September 18, 2015, 12:54am

Thanks for the reply.
For my GPU the value of CL_DEVICE_MAX_WORK_GROUP_SIZE is 8192 that is larger than group size (width*height) that i tested.
The problem could be related to the “clGetEventProfilingInfo” function : if i cancel the portion of code that use “clGetEventProfilingInfo” function to return time of
execution , the application works fine ( i think…), otherwise the application crashes. I tried to run debug on Visual Studio and it reports this error :“0xC0000005: Access violation reading location 0xCCCCCCCC.” .
Could it be related to the “clGetEventProfilingInfo”?

Bigz92 · September 18, 2015, 1:30am

Another “strange” thing : if I set the second parameter of “clGetDeviceIDs” with CL_DEVICE_TYPE_CPU instead of CL_DEVICE_TYPE_GPU, the application works fine and return the execution time correctly. Why “clGetEventProfilingInfo” works fine with CPU and not with GPU?

Bigz92 · September 21, 2015, 12:14pm

Ok,now I solved the previous problem. CL_DEVICE_MAX_WORK_GROUP_SIZE is 512 for my Intel HD4400 and not 8192 like i said.

But now i have an other question. I’m trying to use Intel Code Analyzer to measure the performance of my application, but it doesn’t work. So, how can i do to measure in the correct way time of execution and time of memory operations using only OpenCl functions and not a profiler tool?

Salabar · September 21, 2015, 11:31pm

profiling - Measuring execution time of OpenCL kernels - Stack Overflow (the last answer). By the way, if you use CL_MEM_USE_HOST_MEM buffers, time of memory operations should be basically zero (see limitations in Intel’s programming guide).

Bigz92 · September 22, 2015, 3:52am

In my code i call “clEnqueueNDRangeKernel”,after that i call “clFinish” on my command queue and after i use the portion of code to measure the execution time. Only after i call “clEnqueueReadBuffer” to read the results of execution. The execution time that i’ve calculated in this way take into account also the time to read the results from buffer or not ? (i don’t understand very well when you said “time of memory operations should be basically zero”)