clEnqueueReadBuffer() takes msec to complete???


I am using CPU SDK on a comp that has 16GB of RAM, running OpenSUSE. I did some profiling on command-queue and either I messed up something huge, or this is less than acceptable performance. In summary, I am measuring how long it takes to read buffer back after kernel executes for int buf[16] (just 16 ints!!!). Here’s what I got:

        //HOST SIDE (nthreads = 16)
	d_calc2_res = clCreateBuffer(context,
				&err );
	checkResult((err == CL_SUCCESS), "clCreateBuffer failed

        //Pass the pointer to the kernel
	err = clSetKernelArg(calc2_kernel, 10, sizeof(cl_mem), static_cast<void *>(&d_calc2_res));
	checkResult((err == CL_SUCCESS), "clSetKernelArg failed


        //Fill it up with values in kernel (verified correct kernel execution)


        //Read the result back
    	err = clEnqueueReadBuffer(cmdQueue, d_calc2_res, CL_TRUE,
                              0, nthreads*sizeof(int),
                              static_cast<void *>(calc2_res),
                              0, NULL, &eventh);
    	clWaitForEvents(1, &eventh);

        //Read profiling info
	Read buffer time for submit (1 pass):	%f msec


	Read buffer time for execute (1 pass):	%f msec


    	checkResult((err == CL_SUCCESS), "clEnqueueReadBuffer failed

The timer has nanosecond resolution and it’s pretty close to my accurate timer I used before OpenCL, both confirm about the same numbers:

    Read buffer time for submit (1 pass):   0.007054 msec
    Read buffer time for execute (1 pass):  0.298222 msec

So, .3 msec to copy 16 ints??? I tried using blocking/non-blocking option, same thing. Is this to be expected and if so, what workarounds do we have to get decent performance?

Performance is completely dependent on the implementation. You’ll have to talk to whomever provided you with the SDK about the particulars.

With that said, when you read back data with clEnqueueReadBuffer, there is the overhead of enqueueing the read and executing it. You will never get good performance with small chunks of data as this overhead will swamp the transfer. Try transferring 8-64MB and see what performance you get.

Am I the only one who thinks this is extremely slow? On the same note, I understand many of these performance numbers depend on the specific vendor implementations, but can we ask for certain base metrics to be part of the inner (core) requirement for certification?

The premise of OpenCL is platform independence, GIVEN, that they perform on par with some expectations; have any of these expectations been set for all qualifying (certified) vendors? Can anybody post vendor comparisons on key metrics?

The OpenCL conformance tests are used to verify compliance of an implementation. However, I’m not sure that we can require performance expectations that implementations must meet or exceed as part of compliance.

IMO the best way to solve this is to work with the vendor via vendor forum or directly to get these kinds of issues resolved.