I am trying to process a huge amount of data. Essentially it’s a buffer with some GB in size. Each of my kernel executions will only need a small (some hundred kB) fraction of it. Each execution needs a different part. I’d like to enqueue it with one command with a global work size of couple of ten thousands as this seems to be much faster than enqueueing smaller fractions with a smaller global work size.
Within my kernel execution I then access the right part of the data by using the global id to calculate the index range.
Now my question: is there a way to gradually transfer the data just in time so I never exceed my CL_DEVICE_MAX_MEM_ALLOC_SIZE?
Thanks for your help in advance!
You want an OpenCL kernel to run on data that is continuously streaming into a smaller static buffer? A cool idea, but I don’t think it’s possible. You’d need the kernel to indicate what has been processed, then the application would send in more data, but that’s not possible (or rather, you could hack it, but it will end up being slower and convoluted).
I’ve run into the same issue of processing on a chunk of data that is simply too large, my workaround was to split the buffer and run the kernel on smaller chunks until the entire thing was computed, which is what you’re trying to avoid…
The bottleneck here is mostly the transfer of each “fraction” of data–your best bet (?) is to split it into the LARGEST possible chunks and run the kernel on each one (less transfers and better parallelism). If you have multiple OpenCL devices, you can also send chunks to different devices and run them concurrently (which is what I ended up doing), but the tradeoff between load-balancing and throughput depends on that chunk size. Also read-only or pinned memory, all those optimization tricks for bandwidth, are useful here.
thanks for your detailed and fast reply. I was indeed hoping for some kind of an elegant streaming solution.