I have a large 1-D array of size >= 2^32.
I am using OpenCL on the GPU and/or CPU to test my system out that processes a simple function on each element of this enormous array. (for simplicity let’s assume the function is XOR with 0xFF).
I have a device where the maximum workgroup size is 1024. (i am writing code that can use a CPU OpenCL implementation if available as well, hence the GPU vs CPU choice is immaterial.)
Currently I divide the array into blocks of 1024 * 4 (2 ^ 12) and then make the 2^20 kernel calls to it.
This obviously is not efficient.
I have tried using the workgroup size as 2^32 as well and that is also very slow and leads to lots of heat production in the system. I have tried some other combinations but I am looking for a more generic method that I can use even if the transformation function is not a simple XOR but a complex one that involves multiple arithmetic operations.
How can I solve this problem by making fewer kernel calls and streaming the array to the compute device without having to wait for each kernel to complete ?
Assume that my kernel is just running a custom transform function on each element of the 1-D array.
Unless you can put all the data on-card, or unless you’re doing much more complex stuff the transfer overheads will likely make it slower than just using a cpu. A modern cpu can do quite a lot of this type of work before the memory isn’t a bottleneck.
But some ideas:
It sounds like you’re using the workgroup size to size your buffers - this isn’t how it’s done. The workgroup is how many items execute concurrently on a given CU, they can share LDS, work together to access memory in a coalesced way, etc, but if you’re not using LDS you should set it to a device-efficient size (e.g. multiple of 64), and basically then ignore it. You should then send buffers which are many times this amount, where many is some multiple of the device CU count, remembering each CU is capable of running multiple work-groups concurrently depending on resource requirements (registers, LDS, etc).
You’ll probably have to experiment with the sizes of the transfer buffers but start at least in the mega-items range. The only way to hide any waiting is to use multiple queues and multiple transfer buffers.
e.g. each queue has its own input/output buffer, and then you just cycle through each queue in turn, and you only need to wait on the input buffer being finished with before using it again.
It should then ideally run at the limit of the slowest component (i.e. the pci bus) rather than at the sum of each.