using Nvidia's visual profiler with OpenCL

Does anyone use the nvidia visual profiler on their OpenCL code? I’m having a few issues with it.

First, I have strange behavior with an extremely simple kernel from the nvidia opencl best practices guide, shown below. If I set offset = 0 the profiler only computes the first run and then the second execution fails. If I set offset = 1 (or seemingly any other value > 0) the profiler works fine. I have also posted on a thread regarding this at the nvidia forums, but they don’t see too active.

__kernel void test_kernel(__global real_t * odata,
								  __global real_t * idata) {
	// offset copy, as found in nvidia opencl best practices guide
	size_t offset = 0;
	size_t xid = get_global_id(0) + offset;
	odata[xid] = idata[xid];

Next, I have some general strange behavior with the profiler and don’t really trust the output I am getting. Details are explained in a post at the nvidia forums.

Just hoping that someone has been through this and has some insight. Thanks!


I think there is an nvidia visual profiler bug. I posted details here.

Yeah, it’s pretty naff. If you can get it to work you can get some decent information out of it at least, although the ‘32 sessions’ limit is pretty annoying.

From your nvidia forum post - if i get problems with ‘file … does not contain any output’, I narrow what i’m profling - turn off as much as possible (for a few months I couldn’t even use it till i discovered this). In the ‘session settings’, on the ‘profile counters’ tab, I typically turn off all instructions, and sometimes most/all of the cache stuff and just stick to memory which is normally what i’m interested in, if it isn’t just the overall time.

As a bonus it runs far fewer iterations of the tests so it makes it easier to use (e.g. 3-5 rather than 12-15). The failures seem to be related to how many runs are taken, so the fewer the better.

As to the coalesced memory reporting - I was just a few days ago trying to track that down with some very simple tests which I know had to be coalesced, and I came to the conclusion that particular top-level report is basically meaningless.

I came to this conclusion as I seemed to hit a wall in performance, thought it was ‘good enough’ anyway (gtx480: array addition: 100G/s read + 50Gs/write), and lost patience with trying to work out exactly what it was reporting to me.

Good thing I met a british guy this weekend so I could ask him the meaning of the word ‘naff’ :wink:

Thanks for the reply. It’s good to get confirmation that the profiler is quirky and that I’m not doing something blatantly wrong.

I plan to see if I can do a simple array addition example to see if I can get the profiler to tell me I’m getting near optimal memory throughput. Do you have any feel for how much data needs to be transferred in order to get an accurate reporting of that? So far I’ve been working on small-ish data sets and, even though I think they should be coalesced transfers, the reported memory throughput seems quite low. I wonder if it’s b/c I’m working on too small of data sets to get accurate reporting? Tips?

Just for sanity, I assume that they calculate throughput from GPU execution time and not total program CPU run time?


If I was British i’d probably have said ‘it’s a bit pants’ :slight_smile:

On the last query - From the profiler output it looks like it’s purely GPU time + GPU mem operations (as counted by the device).

I am just timing stuff using real-world data from video frames, so my array addition test was about 4 million complex numbers. This seemed enough to give it a bit of a work-out. If the job is too small you won’t get peak throughput - but it’s pretty easy to try different amounts.