cast between the scalar and vector types

sajis997 · February 5, 2014, 9:14pm

Hello forum,

I created a buffer of cl_host2 and sent to the kernel as float2. Is it possible to cast it to float inside the kernel as follows:


float *converted = (float*)(original_float2);

Thanks

utnapishtim · February 6, 2014, 6:58am

Yes, it is possible. Check section 6.2.5 “Pointer casting” in OpenCL specification though for more detail.

sajis997 · February 6, 2014, 9:47am

Thanks for the hint . I am getting CL_INVALID_COMMAND_QUEUE with the following kernel where i am casting a float2 pointer to the float* pointer.


__kernel void velocity(read_only image2d_t image, // opencl image
	      		       __global float2 *vx , // X-component of the velocity field
			       __global float2 *vy,  // Y-component of the velocity field
	      		       int dx,               // size of the domian on X-dimension
			       int pdx,              // padded width for the in-place FFT 
			       int dy,               // size of the domain on Y-dimension
			       float dt,             // delta time
			       int lb,
			       sampler_t sample)
{
	//cast between the pointer types
	float *vx_aligned = (float*)vx;
	float *vy_aligned = (float*)vy;


	int gtidx = get_global_id(0);
	int gtidy = get_group_id(1) * (lb * get_local_size(1)) + get_local_id(1) * lb;
	int p;

	float2 ploc;
	float4 vterm;

	//variable to store the x-component and y-component
    	//of the velocity field	
	float vxterm,vyterm;


	//gtidx is the domain location in x for this thread
	if(gtidx < dx)
	{
		for(p = 0; p < lb; p++)
		{
			//fi is the domain location in x for this thread
			int fi = gtidy + p;

			if(fi < dy)
			{
				int fj = fi * pdx + gtidx;

				//define the 2D coordinate
				float2 coord = (float2)(gtidx,fi);

				vterm = read_imagef(image,sample,coord);

				ploc.x = (gtidx + 0.5f) - (dt * vterm.x * dx);
				ploc.y = (fi + 0.5f) - (dt * vterm.y * dy);

				vterm = read_imagef(image,sample, ploc);

				vxterm = vterm.x;
				vyterm = vterm.y;

				//only the real component of the velocity field is updated
				vx_aligned[fj] = vxterm;
				vy_aligned[fj] = vyterm;
			}
		}
	}
}

I am not sure what i am doing wrong here. Being new to OpenCL , I need some hint to debug a kernel. I read somewhere in the forum that we usually get this type of error when we have “page fault”/“segmentation fault”/"invalid memory access. But the compiler does not say anything about it.

The computation domain is 2D and its size is 512 by 512. The domain is divided into tiles of 64-by-64 cells. And a workgroup of 64-by-4 work-items is responsible for computing each tile of 64X64. In other words, 256 work-items are divided logically into 64 workitems in x-direction times 4 work-items in y-direction. The work-items are distributed over the tile such that each work-item computes results for a vertical column of 16 cells.

The local and global work size is defined as follows:


 #define TILEX 64 // Tile width
#define TILEY 64 // Tile height
#define TIDSX 64 // Tids in X
#define TIDSY 4  // Tids in Y
   ..............................................
  ...............................................

   localWorkSize[0] =  TIDSX;  // work group # of work items
   localWorkSize[1] =  TIDSY;
   
   globalWorkSize[0] = ((dx/TILEX)+(!(dx%TILEX)?0:1)) * TIDSX; // global # of work items
   globalWorkSize[1] = ((dy/TILEY)+(!(dy%TILEY)?0:1)) * TIDSY * (TIDSX/TIDSY);


   //and execute the kernel
   errNum = clEnqueueNDRangeKernel(commandQueue,
				   advectVelocityKernel,
				   2,
				   NULL,
				   globalWorkSize,
				   localWorkSize,
				   0,
				   NULL,
				   NULL);

Any idea to address the issue?

Thanks

utnapishtim · February 7, 2014, 12:52am

I agree that CL_INVALID_COMMAND_QUEUE is often caused by a memory access violation by a kernel.

If your sampler uses some kind of address clamping (i.e. does not use CLK_ADDRESS_NONE mode), the only access to memory is to vx_aligned[fj] and vy_aligned[fj]. You should check that fj index does not run outside the array bounds (it probably does).

Note that get_group_id(1) * get_local_size(1) + get_local_id(1) is simply get_global_id(1), so gtidy = get_global_id(1) * lb. fj max value is then (dy - 1) * pdx + dx. Check that it is coherent with your allocation for vx and vy.