GPU Bank conflicts in FFT implementation

Hello guys :slight_smile:

I’m trying to write a compute shader based on this article “High Performance Discrete Fourier Transforms on Graphics Processors”:

It presents three algorithms to implement a FFT using Stockham formulation, and I’m working in the second algorithm, which takes advantage of shared memory. The article has a small section about Bank Conflicts, and it concludes that for Ns (subsequences) smaller than the amount of memory banks (currently 32, when the article was written, 16), there are bank conflicts. I wrote a small program in CPP that tries to simulate a workgroup of 64 threads, computing a shared memory index lookup per thread to check the bank conflicts, but instead, I got bank conflicts not when Ns < 32, but when Ns >= 32. Maybe there’s something I don’t understand about Shared Memory Banks, or I wrote something wrong in the program. My idea is that if there are bank conflicts when Ns < 32, we can access shared memory with a different stride when Ns < 32, and another different for the rest of the subsequences.

Maybe can I get some help? :slight_smile:
Thanks in advance!

This is the cpp snippet:

    constexpr std::uint16_t TEST_N{ 128u };
    constexpr std::uint16_t RADIX{ 2u };
    constexpr std::uint16_t BANK_SIZE{ 32u };
    constexpr std::uint16_t WORKGROUP_SIZE{ TEST_N / RADIX };

    //float fakeLDS[RADIX * TEST_N];

    for (std::uint16_t Ns = 1; Ns < TEST_N; Ns <<= 1)
    {
        std::cout << "----------------------------------" << "\n";
        std::cout << "STRIDE : " << Ns << "\n";
        std::cout << "----------------------------------" << "\n";

       for (std::uint16_t thread = 0, waveCounter = 0u; thread < WORKGROUP_SIZE; ++thread)
       {
           // expand(idxL, N1, N2) return (idxL / N1) * N1 * N2 + (idxL % N1);
           //const std::uint16_t idxS{ expand(thread, N / RADIX, RADIX) }; 
           //idxS is constant because thread = [0, WORKGROUP_SIZE - 1], N / RADIX = WORKGROUP_SIZE, so first part is always zero, second part is always equal to thread.
           const std::uint16_t idxS{ thread }; 
           const std::uint16_t idxD{ expand(thread, Ns, RADIX) };

           for (std::uint16_t r = 0u; r < RADIX; ++r)
           {
               const std::uint16_t lookupIdx{ static_cast<std::uint16_t>(idxD + r * Ns) };

               const std::uint16_t bank{ static_cast<std::uint16_t>( (lookupIdx) % BANK_SIZE ) };

               std::cout << "ThreadID: " << thread << "  bank: " << bank << "  Lookup: "<< lookupIdx << "\n";
           }
           ++waveCounter;
           if (waveCounter == 16u)
           {
               std::cout << "<<<<<<<<<<<HALF_WAVE COMPLETED>>>>>>>>>>>\n "; //well, quarter wave for amd?
               waveCounter = 0;
           }
       }
       
    }

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.