New Vulkan FFT library - VkFFT (open-source, crossplatform, faster than cuFFT)

Hello, I would like to share my take on Fast Fourier Transform library for Vulkan. Due to the low level nature of Vulkan, I was able to match Nvidia’s cuFFT speeds and in many cases outperform it, while making VkFFT crossplatform - it works on Nvidia, AMD and Intel GPUs. It also has support for many useful features, such as R2C/C2R transforms, convolutions and native zero padding, which opens up possibilities for Vulkan-based scientific applications (I will make a post on an example case of magnetism simulation software later).

github repository: https://github.com/DTolm/VkFFT

Some of the features, that can give an insight of how VkFFT works and why it is extremely fast:

Some of the features, that can give an insight of how VkFFT works and why it is extremely fast:

  1. Each multidimensional FFT is decomposed into the set of 1D FFTs. This is a common approach to the problem. Each 1D sequence from the set is then separately uploaded to shared memory and FFT is performed there fully, hence the current 4096 dimension limit (4096xFP32 complex = 32KB, which is a common shared memory size). This is the reason why VkFFT only needs one read/write to the on-chip memory per axis to do FFT. It also allows to perform FFT in-place.
  2. All memory accesses are non-strided. This is a very important part, as GPU can upload 32 nearest floats at once. For this, to perform FFT in strided directions (y or z), we have to transpose the data, which takes time roughly equal to one read + one write. As there is only power of two sizes, the transposition after some permutations on the sequence elements can be done in-place with no performance loss.
  3. If sequence has a length of <=256, we don’t have to transpose the matrix. The non-strided access can be acheved by grouping 16 nearby complex numbers or performing 16 FFTs at once. This is the reason why small FFTs are so fast in VkFFT. However, this optimization messes up with the memory layout (see picture on the main page) but if you plan on doing convolutions, output data will return to the way it was before.
  4. Convolutions. They are embedded in the last axis FFT, which reduces total memory read/writes by 1.
  5. As the register file is usually bigger than shared memory (256KB) I have an implementation that relies on it for FFT, using shared memory as a communication buffer. This allows to do 8K and 16K sequences in one go, which is planned for a future update.
  6. To summarize, most of the time is taken by data transfers from graphics card memory to the on-chip memory, and not by the FFT computation itself (I think it is close to 80% of time). This is why it is so advantageous to have explicit memory control provided by Vulkan API. For 256x256x256 FFT there are only 3 reads and 3 writes, which is an absolute possible minimum right now. The convolution for this system will only take 5 reads/writes (+kernel upload).

Thank you for the read! Feedback is welcome!

3 Likes