2 times sin() vs one local memory access


I need to implement an access function to a discrete circular shaped object in a kernel. I have two option:

  1. Calculate each access with floor(n*sin(t)+0.5f)
  2. calculate all access indices once and access them in local memory

Does anyone know how much cycles a local memory access takes? I found numbers in CUDA for sin() function with ~16 cycles and Global Memory with ~400 cycles.


The AMD APP programming guide has some fairly detailed numbers on memory/l1/lds/constants throughout chapter 4 - although it’s not the same hardware it should be roughly comparable.

You might consider the constant space for these as that is exactly what it is for (if the values are constant between runs anyway).

Personally I would just try them all and see if it makes any difference because in a running kernel there are many considerations which affect performance (register usage, lds usage, concurrency, etc), and this is a fairly simple thing to try out.

Thanks alot for the hint.

I will have a look and keep you up2date about this problem.

Problem with local and contant mem could be the size of both, because there might be too much data I have to store in there. But I will have a look on that.