Kernel output not being saved in host RAM


I am trying out OpenCL on my nVidia GTX660. It looks great, but I am having an issue with CL_MEM_USE_HOST_PTR buffers. I use this flag on all my buffers : it works on input buffers but my kernel cannot write the output data in RAM.

I tried to use clEnqueueReadBuffer() after kernel execution and it works so what’s going on ? Does this mean that the buffer is in the GPU RAM ? Why ?
Here are my relevant functions.

void OCLInterface::LoadKernel(string file, size_t size)
        char* source;
        size_t fileSize;

        source = LoadSource(file.c_str(), &fileSize);
        program = clCreateProgramWithSource(context, 1, (const char **)&source, &fileSize, &err);

        err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
        kernel = clCreateKernel(program, file.c_str(), &err);

        currentSize = size;
        currentArgCount = 0;

void OCLInterface::AddParam(float* data, bool bInput)
        cl_mem buffer = clCreateBuffer(
                VSIZE * sizeof(float),
        err = clSetKernelArg(kernel, currentArgCount, sizeof(cl_mem), (void*)&buffer);

void OCLInterface::Exec(int size)
        size_t localsize = size;
        size_t globalsize = ceil(VSIZE / (float)localsize) * localsize;
        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalsize, &localsize, 0, NULL, NULL);
        err = clFinish(queue);

// My code

float* A = (float*)malloc(VSIZE * sizeof(float));
float* B = (float*)malloc(VSIZE * sizeof(float));
float* C = (float*)malloc(VSIZE * sizeof(float));
// snip : array init (a loop)

openCL.LoadKernel("add", VSIZE);
openCL.AddParam(A, true);
openCL.AddParam(B, true);
openCL.AddParam(C, false);

At this point C[] is still at its initialization value (0 in my case but depending on the array init code).
And now my kernel, which should do a simple vector addition.

__kernel void add(
	__global const float* input1,
	__global const float* input2,
	__global float* output)
    unsigned int index = get_global_id(0);
    output[index] = input1[index] + input2[index];

Thanks !

You always have to use clEnqueueReadBuffer(), even with CL_MEM_USE_HOST_PTR buffers.

As stated in the specification: “OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.”

There is no automatic synchronization between host memory and device memory with CL_MEM_USE_HOST_PTR buffers, so a clEnqueueReadBuffer() or clEnqueueMapBuffer() is necessary to flush the device cache to the host buffer.

Thank you for this explanation !
If I read your quote right, the host buffer is cached onto the GPU RAM, then (if used as an output) written but not transferred back.

If the host buffer is modified, will its contents be cached again on the GPU or not ?

If you write directly in the host memory of the buffer, you’ll then have to use clEnqueueWriteBuffer() to update the buffer (or you can instead write to host memory inside a clEnqueueMapBuffer() … clEnqueueUnmapMemObject() sequence).

I went with memory pinning as you said, it works great. Thank you !