I am trying to debug a program that is throwing a CL_MEM_OBJECT_ALLOCATION_FAILURE (-4) error on a GPU. The program takes in images as input, and it works on the GPU for images up to a certain dimension, but for over that dimension it gives that error failing in a clEnqueueWriteBuffer() call. I think the program only uses device global memory because all kernels use __global for pointers passed to them that correspond to cl_mem buffers and because all the cl_mem buffer objects are created with only the CL_MEM_READ_WRITE flag and the host_ptr being NULL. Thus I suspect that the program is running out of global memory.
The GPU it is running on has 1073741824 Bytes of global device memory. I calculated a memory estimate based on the size of the cl_mem buffers that had been filled with data at the time of the crash. According to my calculation, for one set of input images, the program crashed when trying to reach an estimated 973,078,528 bytes when it was already at an estimated 905,969,663 bytes. For another set of input images of different dimensions, the program crashed when trying to reach an estimated 920,678,400 bytes when it was already at an estimated 887,500,800 bytes.
We figured that there might be a way in OpenCL to use host memory when there is not enough GPU device global memory so that this crash does not occur. As flags in clCreateBuffer(), I tried the CL_MEM_USE_HOST_PTR for some of the cl_mem objects, and I tried the CL_MEM_ALLOC_HOST_PTR for all cl_mem objects, but the program still crashed in the same place.
The program uses OpenCL 1.2. Would using the shared virtual memory of OpenCL 2.0 be a solution for this problem?