Serializing 3 kernels


My input data is N rows X M columns matrix. Each cell is a float number.

The first stage:
Subtract each row from its previous one.
The output data is (N-1) rows X M columns.
For the subtraction, I think (not sure) I have to keep the input matrix and put the output in a new matrix.

Second stage:
FFT on each row. The output is (N-1) rows X M columns.
For the FFT process, the work item is a butterfly. for M items in a row I have M/4 butterflies.

Is it possible to do the 2 operations without coming back to the host after the first stage ?

Best regards,

You know amount of work upfront. You don’t have to return to host, since you can simply enqueue two kernels in a row. It will add kernel launch latency, but with any decent global size, it will be marginal.