Memory corruption using clEnqueueWriteBuffer

I’m working in some code that sends large amounts of data from host to device, and it behaves erratically.

In the following piece of the code, I’m trying to send from host to device an array.
The array size is incrementing on each iteration, gradually increasing the amount of memory sent to the device.
The first element in the array is filled with a nonzero value, and it’s read from inside the kernel and printed to console.
The value should be the same when it’s read from the host and from the device, but in some iterations it’s not.

Here’s the code:


	int SizeArray = 0;
	for(int j=1;j<100;j++){
		
		//Array memory allocation, starting with 4MB in first iteration to 400MB in last one
		SizeArray = j * 1000000 * sizeof(float);
		Array = (float*)malloc(SizeArray);
		memset(Array, 0, SizeArray);
		
		//Give the array's first element some nonzero value
		//This is the value that is expected to be printed by the kernel execution
		Array[0] = j;

		memArray = clCreateBuffer(context, CL_MEM_READ_WRITE, SizeArray, NULL, &ret);
	
		//Write the array contents into the buffer inside the device
		ret = clEnqueueWriteBuffer(command_queue, memArray, CL_TRUE, 0, SizeArray, Array, 0, NULL, NULL);
		ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memArray);

		getchar();

		//Execute the kernel where the content of the first element of the array will be printed
		ret = clEnqueueNDRangeKernel(command_queue, kernel, 3, NULL, mGlobalWorkSizePtr, mLocalWorkSizePtr, 0, NULL,NULL);
		ret = clFinish(command_queue);
		
		/****** FAIL! Kernel prints correct value of Array's first element ONLY IN SOME ITERATIONS (when it fails zero values are printed)! Depending on SizeArray :?? ******/
		
		free(Array);
		ret = clReleaseMemObject(memArray);
	}

The device where this code was tested has the following features:
[ul]
[li]Name: Intel® HD Graphics 4000[/li][li]DeviceVersion: OpenCL 1.1 [/li][li]DriverVersion: 8.15.10.2696[/li][li]MaxMemoryAllocationSize: 425721856[/li][li]GlobalMemoryCacheSize: 2097152[/li][li]GlobalMemorySize: 1702887424[/li][li]MaxConstantBufferSize: 65536[/li][li]LocalMemorySize: 65536[/li][/ul]

Kernel returns incorrect values or not depending on the buffer size sent to the device.
Here’s the output:


Array GPU: 1.000000
Array GPU: 2.000000
Array GPU: 3.000000
Array GPU: 4.000000
Array GPU: 5.000000
Array GPU: 6.000000
Array GPU: 7.000000
Array GPU: 8.000000
Array GPU: 9.000000
Array GPU: 10.000000
Array GPU: 11.000000
Array GPU: 12.000000
Array GPU: 13.000000
Array GPU: 14.000000
Array GPU: 15.000000
Array GPU: 16.000000
Array GPU: 17.000000
Array GPU: 18.000000
Array GPU: 19.000000
Array GPU: 20.000000
Array GPU: 21.000000
Array GPU: 22.000000
Array GPU: 23.000000
Array GPU: 24.000000
Array GPU: 25.000000
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 34.000000
Array GPU: 35.000000
Array GPU: 36.000000
Array GPU: 37.000000
Array GPU: 38.000000
Array GPU: 39.000000
Array GPU: 40.000000
Array GPU: 41.000000
Array GPU: 42.000000
Array GPU: 43.000000
Array GPU: 44.000000
Array GPU: 45.000000
Array GPU: 46.000000
Array GPU: 47.000000
Array GPU: 48.000000
Array GPU: 49.000000
Array GPU: 50.000000
Array GPU: 51.000000
Array GPU: 52.000000
Array GPU: 53.000000
Array GPU: 54.000000
Array GPU: 55.000000
Array GPU: 56.000000
Array GPU: 57.000000
Array GPU: 58.000000
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 0.000000     <-------- INCORRECT VALUE, kernel is receiving corrupted memory
Array GPU: 68.000000
Array GPU: 69.000000
...

As you can see, incorrect values are received by the device with no apparent pattern, and no error code is returned by clEnqueueWriteBuffer function.

To summarize: A memory block is sent to the kernel, but kernel receives zero’ed memory depending on the total block size sent.

The same code tested on different computers behaves differently (incorrect values in different iterations).

How can be memory corruption avoided? Am I missing something?

Thanks in advance.

=================================================================================

Here’s the complete code (sorry but i couldn’t upload the .cpp file):


#include <iostream>
#include <vector>

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

#define MAX_SOURCE_SIZE (512)

int main()
{
	std::vector<cl_device_id> deviceVector;
	cl_context context = NULL;
	cl_command_queue command_queue = NULL;
	cl_program program = NULL;
	cl_kernel kernel = NULL;
	cl_platform_id platforms[8];
	cl_uint ret_num_devices;
	cl_uint ret_num_platforms;
	cl_int ret;
	cl_mem memArray = NULL;
	float *Array;
	
	//--------------------------KERNEL---------------------------------
	char *source_str;
	source_str = (char*)malloc(MAX_SOURCE_SIZE);

	source_str =	"#if defined cl_intel_printf 
" \
					"	#pragma OPENCL EXTENSION cl_intel_printf :enable 
" \
					"#else 
" \
					"	#pragma OPENCL EXTENSION cl_amd_printf :enable 
" \
					"#endif 
" \
					" 
" \
					"__kernel void KernelTest(__global const float * Array) 
" \
					"	{ 
" \
					"	const int i = get_global_id(0); 
" \
					"	const int j = get_global_id(1); 
" \
					"	const int k = get_global_id(2); 
" \
					" 
" \
					"	if(i==0 && j==0 && k==0) 
" \
					"		printf(\"\\r\
Array GPU: %f \", Array[0]); 
" \
					"	} 
" \
					"
";

	size_t source_size = strlen(source_str);
	//------------------------------------------------------------------	
	

	std::cout << "
Initializing" << std::endl;
	//--------------------------INITIALIZE---------------------------------
	ret = clGetPlatformIDs(8, platforms, &ret_num_platforms);
	std::cout << " clGetPlatformIDs " << ret << std::endl;
	std::cout << " Number of Platforms:  " << ret_num_platforms << std::endl;

	for (int i=0; i<ret_num_platforms; i++)
	{
		cl_device_id devices[8];
		ret = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 8, devices, &ret_num_devices);
		std::cout << " clGetDeviceIDs " << ret << std::endl << std::endl;

		for(int j=0; j<ret_num_devices; j++){
			if(std::find(deviceVector.begin(), deviceVector.end(), devices[j]) == deviceVector.end())
				deviceVector.push_back(devices[j]);
		}
	}

	for (int i=0; i<deviceVector.size(); i++)
	{
		char buffer[1024];
		cl_uint buf_uint;
		cl_ulong buf_ulong;
		clGetDeviceInfo(deviceVector[i], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL);
		std::cout << " -- "<< i << " -- DEVICE_NAME: " << buffer << std::endl;
	}
		
	int option = 0;
	do{
		std::cout << "
Option: ";
		std::cin >> option;
	}while(option<0 || option>deviceVector.size()-1);

	context = clCreateContext(NULL, 1, &deviceVector[option], NULL, NULL, &ret);
	std::cout << " clCreateContext " << ret << std::endl;

	command_queue = clCreateCommandQueue(context, deviceVector[option], 0, &ret);
	std::cout << " clCreateCommandQueue " << ret << std::endl;

	program = clCreateProgramWithSource(context, 1, (const char **)&source_str,	(const size_t *)&source_size, &ret);
	std::cout << " clCreateProgramWithSource " << ret << std::endl;

	ret = clBuildProgram(program, 1, &deviceVector[option], NULL, NULL, NULL);
	std::cout << " clBuildProgram " << ret << std::endl;

	if(ret){
		size_t log_size;
		clGetProgramBuildInfo(program, deviceVector[option], CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
		char *log = (char *) malloc(log_size);
		clGetProgramBuildInfo(program, deviceVector[option], CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
		std::cout << log << std::endl;

		getchar();getchar();
		return ret;
	}

	kernel = clCreateKernel(program, "KernelTest", &ret);
	std::cout << " clCreateKernel " << ret << std::endl;
	//------------------------------------------------------------------


	std::cout << "
Kerneling" << std::endl;
	//--------------------------EXECUTE---------------------------------
	size_t mGlobalWorkSizePtr[] = { 100, 100, 62 };
    size_t mLocalWorkSizePtr[] = { 5, 5, 1 };

	int SizeArray = 0;
	for(int j=1;j<100;j++){

		//Array memory allocation, starting with 4MB in first iteration to 400MB in last one
		SizeArray = j * 1000000 * sizeof(float);
		Array = (float*)malloc(SizeArray);
		memset(Array, 0, SizeArray);

		//Give the array's first element some nonzero value
		//This is the value that is expected to be printed by the kernel execution
		Array[0] = j;
		std::cout << "

 Size: " << SizeArray/sizeof(float) << std::endl << std::endl;

		//Create the buffer where the content of the array will be stored
		memArray = clCreateBuffer(context, CL_MEM_READ_WRITE, SizeArray, NULL, &ret);
		std::cout << " clCreateBuffer Array " << ret << std::endl;
	
		//Write the array contents into the buffer inside the device
		ret = clEnqueueWriteBuffer(command_queue, memArray, CL_TRUE, 0, SizeArray, Array, 0, NULL, NULL);
		std::cout << " clEnqueueWriteBuffer Array " << ret << std::endl;
		ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memArray);
		std::cout << " clSetKernelArg memArray " << ret << std::endl;

		std::cout << "
Array CPU : " << Array[0] << std::endl;
		getchar();

		//Execute the kernel where the content of the first element of the array will be printed
		ret = clEnqueueNDRangeKernel(command_queue, kernel, 3, NULL, mGlobalWorkSizePtr, mLocalWorkSizePtr, 0, NULL,NULL);
		ret = clFinish(command_queue);

/****** FAIL! Kernel prints correct value of Array's first element ONLY IN SOME ITERATIONS (when it fails zero values are printed)! Depending on SizeArray :?? ******/

		free(Array);
		ret = clReleaseMemObject(memArray);
	}

	ret = clFlush(command_queue);
	ret = clFinish(command_queue);
	ret = clReleaseKernel(kernel);
	ret = clReleaseProgram(program);
	ret = clReleaseCommandQueue(command_queue);
	ret = clReleaseContext(context);

	free(source_str);
	//------------------------------------------------------------------

	return 0;
}