Why does clGetPlatformInfo get called in every clEnqueue function?

We are profiling an OpenCL application running on an NVidia GPU on both the host and the device. We were surprised to find that (based on gperftools) the host was spending 44% of its time in clGetPlatformInfo, a method which is only called a single time in our own code. It is called by clEnqueueCopyBuffer_hid, clEnqueueWriteBuffer_hid, and clEnqueueNDRangeKernel_hid (and presumably all the other clEnqueue methods, but they are less commonly called in our code). Since this is taking so much of our host time, and we appear to be bound by the host speed right now, I need to know if there’s a way to eliminate these extra calls.

Why is this being called by every OpenCL call? (Presumably it’s static information that could be stored in the context?) Did we perhaps initialize our context incorrectly?

EDIT: I was asked for an MWE:

#include <CL/opencl.h>

#include <vector>
using namespace std;


int main ()
{
	cl_uint numPlatforms;
	clGetPlatformIDs (0, nullptr, &numPlatforms);

	vector<cl_platform_id> platformIdArray (numPlatforms);
	clGetPlatformIDs (numPlatforms, platformIdArray.data (), nullptr);

	// Assume the NVidia GPU is the first platform
	cl_platform_id platformId = platformIdArray[0];

	cl_uint numDevices;
	clGetDeviceIDs (platformId, CL_DEVICE_TYPE_GPU, 0, nullptr, &numDevices);

	vector<cl_device_id> deviceArray (numDevices);
	clGetDeviceIDs (platformId, CL_DEVICE_TYPE_GPU, numDevices, deviceArray.data (), nullptr);

	// Assume the NVidia GPU is the first device
	cl_device_id deviceId = deviceArray[0];

	cl_context context = clCreateContext (
		nullptr,
		1,
		&deviceId,
		nullptr,
		nullptr,
		nullptr);

	cl_command_queue commandQueue = clCreateCommandQueue (context, deviceId, {}, nullptr);

	cl_mem mem = clCreateBuffer (context, CL_MEM_READ_WRITE, sizeof(cl_int),
	                             nullptr, nullptr);

	cl_int i = 0;

	while (true)
	{
		clEnqueueWriteBuffer (
			commandQueue,
			mem,
			CL_TRUE,
			0,
			sizeof (i),
			&i,
			0,
			nullptr,
			nullptr);

		++i;
	}
}

This MWE generates the following profile over the course of several seconds. Note that 99% of the time is spent in clGetPlatformInfo. (See stack overflow question 61663830 for the diagram, since I can’t post links (yet?))

This is a repost from the StackOverflow question 61663830, which is still awaiting an answer.

For the sake of comparison, I ran it on a different computer using an Intel compute device, and its profile looks much the same — 95% of the time is spent calling clGetProfileInfo. The internals are much different, however. A lot of the time there is spent in GTPin_Init or in mutex code.

I’m wondering now… is this just where time is spent to wait on events? Or does it actually have to look up the platform info with every call?

Is it possible that your tester isn’t running long enough to get a reasonable sample size?

The initial call to clGetPlatformIDs can be rather expensive since it requires enumerating which drivers are installed on the system (lots of registry or file system accesses), instantiating the drivers (lots of dynamic library loading), plus any initialization that the drivers do themselves. This should be a one-time cost though. It certainly shouldn’t be called by every OpenCL API.

If the cost of clGetPlatformIDs continues to stay high even as the program runs for longer (and hence doesn’t appear to be a one-time cost), is it possible that your profiler is assigning execution time to it as a default, because it doesn’t know where else to assign it?

I’m more than a little biased :smiley:, but I find the Host Performance Timing capability of the OpenCL Intercept Layer to be a good way to measure both the per-call cost and aggregate cost of OpenCL host APIs.

Thank you for your response!

I agree with your statement that it should be a large one-time cost. However, my profile of our production code (unfortunately, with proprietary info) was based on a five minute sample and had the aforementioned 44% of time spent in clGetPlatformIDs. The profile for the MWE was only a 30 second sample, but I would be very worried if clGetPlatformIDs were taking 99% of a 30 second time. :grimacing:

To your second question, gperftools samples the stack 100 times a second to determine where time is being spent, so somehow the stack is ending up in clGetPlatformIDs via multiple routes (clEnqueueCopyBuffer_hid, clEnqueueWriteBuffer_hid, clEnqueueNDRangeKernel_hid, etc.). So it seems unlikely to me that it’s lost… it would post a raw function pointer if it didn’t have a stack at that point.

You’d be right to suggest that our current stack isn’t well-suited to serious profiling of OpenCL programs… this is our first, and we weren’t able to find any off-the-shelf OpenCL profilers that advertised NVidia support, and NVidia seems to have dropped OpenCL profiling support in the last few years.

Does the Intercept Layer work for NVidia GPUs or just for Intel processors?

I’m almost positive this is a symbol problem.

When I run a test with gperftools using our GPU driver I see most of the time attributed to GTPin_Init, as you mentioned. I think this is because an OpenCL ICD has to export very few symbols, since calls into most OpenCL APIs occur through the ICD dispatch table.

Here are the symbols I see exported from our GPU ICD:

bashbaug@bashbaug-nuc:/usr/local/lib/intel-opencl$ nm -D --defined-only libigdrcl.so 
0000000000042430 T clGetExtensionFunctionAddress
000000000002e2d0 T clGetPlatformInfo
000000000004bec0 T clIcdGetPlatformIDsKHR
000000000004f210 T GTPin_Init

My guess is that the closest exported symbol to a lot of OpenCL APIs tend to be GTPin_Init, so the call gets billed to it even though it’s a different function entirely.

This also explains why other ICDs may see calls attributed to clGetPlatformInfo, since it’s one of the other (few) functions that ICDs are required to export.

Yup, it is designed to work with any OpenCL implementation, and I regularly use it with non-Intel implementations.

Looks like this will be useful. I’ll run it through with this and post back with the results. :slightly_smiling_face:

@bashbaug, thanks for the the suggestions earlier. The OpenCL Intercept Layer was very helpful in tracking down our problem, and I think you are probably right about why clGetPlatformInfo is the dominant node on our performance chart.

We were not releasing subbuffers (we didn’t realize we had to, since they’re aliases to already-allocated memory). The increasing reference count somehow made our enqueue calls involving that memory to increase in time. Now that we’ve fixed that, our timing information looks quite normal, and we’ll go about improving performance in the usual ways (data locality, fewer synchronization points, etc.).

Thank you again!