I was puzzled by a bug in my opencl program for a week, and now I narrowed it down to the problem with sub buffer creation.
Basically in my program, the main buffer is calculated and filled in by a kernel, and the program is all correct till this point. However, the subsequent kernel requires to read parts of this main buffer, and I decided to do that with a set of sub buffers referring to this main one. Here is where the problem comes. It appears that when I create the sub buffer, the part of the memory that covered by this sub buffer is been re-initialized/reset (more like reset, numbers inside become random).
Is this the intended behavior? This problem only occurs with NVIDIA hardware (every single one I tested) though, I tested the program on AMD, and the program runs fine and generated correct results. I also spend 10 hours running the program through oclgrind, no error was reported, and the program also yielded the correct results.
Now I have replaced the sub buffer creation with a regular buffer and a copy from buffer to buffer, and the program runs fine now. But I would love to not do that.
So instead of using clCreateBuffer() alone, you’ve used clCreateSubBuffer to represent a piece of a buffer previously created with clCreateBuffer.
To gather responses from someone more versed in OpenCL, you might post a short snippet of the code related to this sub-buffer and its parent buffer object, including creation, setup before the kernel launch and (importantly) how you’re handling synchronization.
You might also double-check that you are following the rules for accessing this sub-buffer object. From the OpenCL 3.0 spec:
Thanks for the advice. I made a minimum testing example with sub buffers and figured out what was going on.
Basically, when I call clReleaseMemObject() on the sub buffer, NVIDIA’s driver also releases the actual memory allocated GPU, which I believe is not the correct behavior. I figured this out because if I don’t release the sub buffer, the program generates correct results in the small testing cases as well as the actual program.
The reason this is even more difficult for me to debug is that I was using the OpenCL C++ wrapper, which will do the release automatically behind the scene. Now I am considering changing to use the C API, have to accept that OpenCL is a C standard.
I just want to make sure, am I doing the right thing to call clReleaseMemObject() on the sub buffers?