OpenCL Bandwidth Testkernel


At the moment i am trying to measure the bandwidth of global read/write operations of
my gpu. I use this kernel:

__kernel void bandwidth(__global float *idata,
				     __global float *odata,
						int offset)
	int xid = get_global_id(0) + offset;
	odata[xid] = idata[xid];

I just copy the buffer idata to odata, one element per workitem. Offset is normally set to zero, it is just there to observe the impact of uncoalesced memory access.

My graphic card, a Nvidia Geforce GT 330, has a theoretical bandwidth of 24GB/s. But with my little test kernel i just get a maximal bandwidth of around 1.3GB/s. I calculate the bandwidth with the formula in the opencl bandwidth manual ((br + bw * 2 *datasize) / 1024^3 ) / seconds. Br and bw is the number of global memory reads/writes. The time is measured by using the gpu timer (cl_event with profiling in queue enabled). The input have the size of 400 mb (array with 104857600 floats). I use with 104857600 workitems and 256 workgroups. Here is my host code:

//Set up context, queue and compile program
ipf::parser::cl::CLContext context;
   std::cout<<"CommandQueue Failed"<<std::endl;
//Compiling program and creating kernel
std::string path =
   ipf::util::config.get<std::string>("ipf.shader.path") + "/";
path += "";

ipf::parser::cl::CLProgram prog = context.createProgram(path);
ipf::parser::cl::CLKernel bw = prog.createKernel("bandwidth");

//Set workitems and groups
int WGX = 256;
int elements = 10240*10240;
int datasize = sizeof(float) * elements;

std::vector<size_t> local;
std::vector<size_t> global;

//Create buffers on host and device
std::vector<float> a(elements,1.0f);
std::vector<float> b(elements,0.0f);

ipf::parser::cl::CLBuffer a_dev =
ipf::parser::cl::CLBuffer b_dev =

//Copy buffer on device (blocking mode)
//Set kernel args and run kernel
bw.setArg(0, a_dev);
bw.setArg(1, b_dev);
bw.setArg(2, 0);
ipf::parser::cl::CLEvent r =;	
//Wait for finish
//Read data back[0],datasize);

//Caluclate bandwidth and prin out results	
cl_ulong start = r.runTime();
cl_ulong end = r.finishTime();
double mili = ((double)(end) - (double)(start))* 10e-6;
double second= ((double)(end) - (double)(start))* 10e-9;

cl_ulong result_nano = end - start;
std::cout<<"Milisekunden: "<<mili<<std::endl;
std::cout<<"Sekunden    : "<<seconds<<std::endl;
std::cout<<"Data (in MB): "<<(elements*4*2) / 1024 / 1024 <<std::endl;
std::cout<<"Bandwidth   : "<<ipf::parser::cl::bandWidth(elements,r)<<
std::cout<<"***********END RESULT**********"<<std::endl;

Am i doing something wrong or is my graphic card just damm slow?

Epic fail:

double second= ((double)(end) - (double)(start))* 10e-9;

must be:

double second= ((double)(end) - (double)(start))* 1e-9;

Damm it!