I have a large file containing blocks of data. At present we open file, sequentially read each block (for loop), apply an algorithm on each block, and get the result. This loop operation can be executed in parallel. I have tried porting this to OpenCL and got so many unexpected errors, now am in doubt that, can OpenCL be used here?
The OpenCL samples I have seen were doing only parallel arithmetic operations. But in my case these are the additional challenges -
Passing a large amount of data to CL program
Executing a lengthy algorithm on this data
I am not trying to use any standard library, or third party library function in my CL program. The algorithm is pure C (or the subset supported by OpenCL)
I would like to know whether OpenCL (or GPGPU programming) can be used in this kind of problem area?
Thanks. The errors are only due to the lack of my knowledge in OpenCL. I hope I can learn and fix them.
My suspicion was also about the efficiency - as in my present code there is large amount of memory reallocation. As I mentioned earlier there is a large file, I have 'memory mapped" this file in the host (CPU) program (only file reading is needed). To pass this data (memory buffer) to OpenCL I again have created memory buffers for OpenCL, then copied data…
I didn’t feel it as “efficient”. I feel there must a better mechanism to “share” data between host and GPU without reallocation, isn’t it?
I’m not sure why you’d need to ‘reallocate’, unless it’s part of the algorithm.
No matter what you do the data still has to get to the cpu or the gpu, mmap still needs to read the disk, but it has no idea how the application will use the data so has to guess (read-ahead). If you’re just streaming data, then you know the precise access patten so can read-ahead yourself both accurately and trivially: it should be possible to beat (or at least equal) mmap since the latency is from the disk access and not from the memcpy’s.
Since the CPU is just shunting data around and GPU is doing the work, it’s not like you’re saving the cpu cycles for processing either.
For streaming in opencl you’d just allocate a few buffers and use them cyclically (‘multi-buffer’), loading the next one while the current one is being sent to the gpu and so on.
This multi-buffer approach is very efficient and can hide the i/o and bus latencies. Assuming the processing takes longer than the PCI transfers, the cpu should just be queuing work and spending most of it’s time waiting around for the GPU to finish the currently-oldest buffer queued. And if it takes more time than the disk i/o, the disk reads should be finished by the time the gpu is ready for the data too.
Obviously you have to ‘copy’ the data to the gpu device, since it uses different memory (unless you’re using an APU, in which case the ‘copy’ functions do nothing).