Buffer with USE_HOST_PTR doesn't work

I am trying following simple code but it doesn’t work. It gives the access violation when running the kernel ( Which could mean the kernel arguments are not initialized)

Can you point what is wrong with below code.

Host Code:

 iint *bs = new int[2000 *4];
for(int i = 0; i < 8000;i++) {
	bs[i] = 10;
cl_mem tem = clCreateBuffer(context, CL_MEM_READ_WRITE |  CL_MEM_USE_HOST_PTR, (2000 * 4) * sizeof(cl_int), bs, &error);

cl_int   status;
cl_event event;

status = clSetKernelArg(
			scaleKernel, 0, 
			sizeof(cl_mem),  &tem);

size_t globalThreadsStep1[1];
size_t localThreadsStep1[1];
globalThreadsStep1[0] = 2064;
localThreadsStep1[0] = 64;

status = clEnqueueNDRangeKernel(commandQueue, testKernel,
            1, NULL,globalThreadsStep1,
            localThreadsStep1, 0, NULL,
/* wait for the kernel call to finish execution */
status = clWaitForEvents(1, &event);

Kernel Code:

#pragma OPENCL EXTENSION cl_amd_printf : enable
kernel void
test_code(global int4 *data) {
	int id = get_global_id(0);
	int pc  = data[id].s0;
	printf("id: %d pc:%d", id, pc);

Most probably you get access violation when calling clWaitForEvents method. For 2 reasons:

  1. You don’t check for error codes (in ‘status’ variable).
  2. 2064 is no evenly divisible by 64.

I’ve checked for status and error codes, by debugging through host code.
I’ve not included to keep the code to minimum.

Also, I’ve changed to be divisible by 64, But still it doesn’t work.

size_t globalThreadsStep1[1];
size_t localThreadsStep1[1];
globalThreadsStep1[0] = 1920;
localThreadsStep1[0] = 64;

I’ve got the status and error object as CL_SUCCESS, untill the end of the code except for clWaitForEvents

The error is: It is trying to read some unintialized memory when executing the kernel, the printf show that it occurs on first OpenCL thread.
Access Violation read at 0Xffffff

BTW, It is AMD OpenCL Implementation running on CPU.

Well, your kernel is not initialized indeed.

You setArg for scaleKernel, while you are executing testKernel.

It is copy and paste error, when I am removing the implementation details.
Sorry for that.

But, even you change that to test kernel. It sill happens.
Now, I am executing this test code rather than actual code.

Well, I don’t see any other problems in this code. Maybe you removed the wrong code when removing implementaton details.

No, It happens.
I am able to reproduce it.

I can attach template project ( Done on top of AMD Sample template Project )

Please download the file from http://www.assembla.com/spaces/opencl_public/documents
Please override the templateC project in AMD OpenCL Samples.

I know this is a trivial one, But somehow it happens on my machine.

It is definitely something got to with int4 in kernel, otherwise it would work.

My Code is wrong.

It seems we need aligned memory, I just went through normal OpenCL basics and assumed I can allocate memory of 4 ints and it would work.

Oh, yeh. You need to create bs as “new cl_int4[2000]”;

No, new cl_int4[2000] doesn’t help.

The memory needs to be aligned. You need to use
_aligned_malloc(2000* sizeof(cl_uint4), 16)

Does OpenCL Spec mandates the memory to be aligned, Or Is it because of I am executing on CPU ( Which may utilize SSE Instructions)

Does the code run on GPU ?
It is weird, the search for this issue brings no result here and Google.

It wasted too much time.

I checked Intel’s OpenCL Optimization tutorial. It says:

//min alignment query returns value in bits
cl_uint min_align = 0; clGetDeviceInfo(g_dev, CL_DEVICE_MEM_BASE_ADDR_ALIGN…, &min_align,…);
//here alignment should be in bytes
cl_float* g_pfInput = (cl_float*)_aligned_malloc(data_size, min_align/8);
const cl_mem_flags flags = CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY;
//this code places the buffer in the mem-region pointed by g_pfInput
cl_mem g_buf = clCreateBuffer(g_context, flags, data_size, g_pfInput,…);

It seems we need aligned memory, I just went through normal OpenCL basics and assumed I can allocate memory of 4 ints and it would work.

The memory doesn’t need to be aligned in order for the code to work correctly. The only way in which memory alignment matters is in the case where you are passing CL_MEM_USE_HOST_PTR when you create the buffer. In that case passing aligned memory can help the OpenCL driver avoid memory copies.

Again, the code would work even if the memory was unaligned. It must be something else.

No, The problem fixes as soon as I use aligned memory.
The Code is really simple, to say there is another problem.

You can check the attachment URL I gave.

No, The problem fixes as soon as I use aligned memory.

That’s a bug in the OpenCL driver. The spec does not require the user pointer to be aligned.

Thinking a bit more about it I think I was wrong. Section 6.1.5 says that “A data item declared to be a data type in memory is always aligned to the size of the data type in bytes”. Also, section 6.2.5. says that pointer casting “represents an unchecked assertion that the address is correctly aligned”.

Since your kernel argument is a pointer to int4 it must be aligned to sizeof(int4).

In other words, you were right and your OpenCL implementation is fine.