OpenCL/OpenGL Problems: clEnqueue{Acquire|Release}GLObjects

we are working on a rendering prototype which shares 2D and 3D textures with OpenCL. The volume texture is roughly 125MiB in size. We ran into a problem with the clEnqueueAcquireGLObjects and clEnqueueReleaseGLObjects calls. They take up ~15ms each (~30ms combined!).

This is unacceptable. We suspect that OpenCL internally duplicates the texture memory and copies the data to and from OpenGL. When only acquiring small 2D OpenGL resources the calls do not take up much frame time.

The example how we run the kernel:

std::vector acq;


int arg_count = 0;
cl_error = _ray_cast_kernel->setArg(arg_count++, *_output_cl_image); assert(!cl_error_string(cl_error).empty());
cl_error = _ray_cast_kernel->setArg(arg_count++, *vdata->volume_image()); assert(!cl_error_string(cl_error).empty());
cl_error = _ray_cast_kernel->setArg(arg_count++, *vdata->color_alpha_image()); assert(!cl_error_string(cl_error).empty());
cl_error = _ray_cast_kernel->setArg(arg_count++, *vdata->volume_uniform_buffer()); assert(!cl_error_string(cl_error).empty());

cl_error = context->cl_command_queue()->enqueueAcquireGLObjects(&acq); assert(!cl_error_string(cl_error).empty());
cl_error = context->cl_command_queue()->enqueueNDRangeKernel(*_ray_cast_kernel, ::cl::NullRange, global_range, local_range, 0, 0); 
cl_error = context->cl_command_queue()->enqueueReleaseGLObjects(&acq); assert(!cl_error_string(cl_error).empty());

Here the test code how we measured the acquire and release times:

cl_error = context->cl_command_queue()->enqueueAcquireGLObjects(&acq); assert(!cl_error_string(cl_error).empty());

//cl_error = context->cl_command_queue()->enqueueNDRangeKernel(*_ray_cast_kernel, ::cl::NullRange, global_range, local_range, 0, 0); assert(!cl_error_string(cl_error).empty());

cl_error = context->cl_command_queue()->enqueueReleaseGLObjects(&acq); assert(!cl_error_string(cl_error).empty());

These are the two read only images used by the kernel:

    _volume_image.reset(new cl::Image3DGL(*device->cl_context(), CL_MEM_READ_ONLY,
                                            voldata->volume_raw()->object_target(), 0,
                                            voldata->volume_raw()->object_id(), &cl_error));
    _color_alpha_image.reset(new cl::Image2DGL(*device->cl_context(), CL_MEM_READ_ONLY,
                                               voldata->color_alpha_map()->object_target(), 0,
                                               voldata->color_alpha_map()->object_id(), &cl_error));

This is the single write only image used by the kernel:

        _output_cl_image.reset(new cl::Image2DGL(*device->cl_context(), CL_MEM_WRITE_ONLY,
                                                 _output_texture->object_target(), 0,
                                                 _output_texture->object_id(), &cl_error));

As said, we are suspecting the OpenCL implementation to copy the OpenGL resources to its own memory. Maybe someone has an answer if this is really happening and if this can or will be solved in future implementations? As it is today it is sadly not usable for us…

We are trying this on Nvidia GeForce 480/580 hardware using r285 drivers.

I’m assuming you created the CL image from GL texture using clCreateFromGLTexture{2D|3D} API. clEnqueueAcquireGLObjects / clEnqueueReleaseGLObjects should not do a copy if both CL & GL are on the same GPU. Is there only 1 GPU in the system where you are encountering this issue? If so, I recommend you report the problem to the vendor where you are running into this issue.

15ms is way way too long for 128mb of data anyway, so it can’t just be because of a redundant copy.

Also try a clfinish() before timing the release: you’re timing the kernel run time too.

Yes it is just a single GPU, that is why i am so shocked by this issue. I was under the impression that OpenCL just shares the resources.

The kernel was commented out for the measurements. I also ran the test with the kernel included with additional clFinish after it, the results were the same.

You are timing the kernel execution and the context switch using the code you have provided. enqueueNDRangeKernel will return almost immediately and then the kernel runs later. If you want to actually time the release you should add a callback to the enqueueNDRangeKernel to be notified when it completes then start the timer.

Edit: I see you mentioned that you have commented out the kernel call. You might try using events to wait for enqueue and release of the GL objects. clFinish() seems to do much than just waiting for simply waiting for the events (in my experience it can provide a relatively big performance advantage). Also have you called glFinish() before you start timing to make sure you are not waiting for opengl to finish up?

Another thing you might try your timing after you have run a warm up test to make sure the card has turned off all powersaving measures. Fermi cards while power beasts try to turn off transistors any chance they get.

Hope that helps.

Ahh right, sorry I thought that was a paste-o.

Ask the vendor I guess.

It is probably hitting some limit and swapping device memory around with the host. I would guess that 15ms pretty much suggests it’s crossing the PCIe bus. If you have time (to waste!) you could perhaps see if there’s a point at which this limit occurs and re-arrange the code to be within the limits, or buy different hardware.

ok, some additions.

we see the mentioned behavior under Windows 7, the exact same program running under Linux does not show the high times required for the acquire and release actions.

I contacted Nvidia and their answer was something like this: “Try CUDA and we will help you solve your performance problems”… No thank you, it was a conscious decision to use OpenCL over CUDA.

Hah! How nice of them, although hardly surprising. We recently ditched them for amd, but they have their own troubles and fairly different performance characteristics to deal with …

I had some troubles with windows and opencl from Java - but it was due to Swing using direct2d in places (i presume) and requiring very slow context switches.