I want to measure the time taken by various modules inside the kernel function.
How can I do it. I am able to measure the complete time taken by the NDRangeKernel to execute but
as it has multiple parts, I want to measure those times also.
Do tell me if there is any possibility of doing this.
The best way to do this is to break apart your single kernel into multiple kernels so that you can measure each component separately. You might also consider the use of profiling tools depending upon your OpenCL implementation, but with standard OpenCL your only option is to adjust your kernels to do the benchmarking you need separately.
AJ’s suggestion is great. What I’ve done is comment out various parts of the kernel, along the lines of “if this part was ‘free’ how fast would it run?”. You can do this separately for reads, compute steps, and writes. It gives you a pretty good idea where time is getting spent. Of course the results are incorrect (because you’re not doing all of the work) but it can be a useful profiling aid because it tells you which parts to concentrate on (for example, if commenting out some compute only made it 5% faster, then no amount of optimization of that section could ever possibly make it more than 5% faster).
One warning though, this may sometimes result in the compiler automatically removing sections of your kernel because their results are no longer used. A very simple example is that if you comment out a write to global memory then all operations that were performed to get that result that was written out will be removed by the compiler since their result/output is no longer needed. There’s nothing wrong with the approach suggested by guillona, just keep the compiler’s optimisations in mind.