Altera OpenCL parallel execution in FPGA

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I’ve looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.

I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like using launching multiple kernels that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.


In terms of OpenCL execution model, I guess this could be expressed as creating a subdevice, but I couldn’t find anything on device fission in Altera’s documentation. Few work arounds come to mind, but I can’t judge if any of them are sane in the context of FPGA. Maybe Altera’s runtime has this feature already, though. If this stuff works as I think it works, nothing stops the runtime to run two different kernel programs simultaneously given they don’t share any transistors.

Each kernel in your binary file created with the offline compiler (e.g., aoc -o foo.aocx) may run concurrently. Simply create a separate command queue for each such kernel you want to execute concurrently and enqueue them accordingly. You can even use the Altera channels extension to stream directly between kernels without going to global memory, of which you can find more details in the Altera SDK for OpenCL Programming and Best Practices Guides:

Thanks for the reply! Is it possible to do this in one device with only one CL_DEVICE_MAX_COMPUTE_UNIT? Is it required to have either multiple CL_DEVICE_MAX_COMPUTE_UNITS under one device or multiple devices in order to launch multiple concurrent kernels? Please let me know. Thanks again!

Yes, it is possible because the architecture of Altera FPGAs is such that when you compile your kernels the compiler automatically creates one or more custom compute units for each kernel. You can think of it kind of like a device fission but for compute units, where one large parent compute unit is split into one or more child compute units, the sum of which are no larger than the parent.

Thanks! I’ll try it out. Earlier, you mentioned each kernel in my binary file created may run concurrently. What if I have only one kernel in only one binary file? Can I run multiple instances of the same kernel concurrently?

You have two methods depending on what you really want to achieve.

A) You can always enqueue a larger NDRange so you have many work-items executing that one kernel concurrently. You can even scale the number of work-items and work-groups that run concurrently using a few simple kernel attributes in the Altera SDK for OpenCL programming guide, e.g., appending attribute ((num_compute_units(4)) would replicate the custom compute unit for your kernel 4 times. However, when using these attributes you cannot manage the replicated custom compute units independently in the way I mentioned above.

B) Thankfully, if that isn’t what you had in mind then you could copy and paste (or use a code generator) to create 4 independent yet identical (except in name) kernels that can be enqueued in any order into their own command queues to run concurrently.

Just for clarification, does method A or B sound more like what you want to achieve?

Thanks! Yes, method B seems to be more like what I would like to achieve eventually.

Is there a limit on the number of kernels that can run concurrently? I’m assuming it should be specified under clGetDeviceInfo(). But, I’m not sure which parameter.

A couple of pages from the end of the Altera SDK for OpenCL Programming Guide ( you can find the current limitations, which are a bit artificial and can be increased or eliminated as needed if you submit a service request to Altera.