why copying data from global to private memory makes my code slower

why copying data from global to private memory and using private memory after that makes my code slower?
on my all 3 GPU’s
my code accesses all elements in this array and even some elements more than once, I made very similar ‘optimization’ in my another code and speed increased.
I’m wondering why here speed decrease, here I have ~1k lines of code and many functions, can this be cause of slowdown?
in this another code where something similar worked I have also ~1k lines of code but I splitted one kernel to a few.
global memory is accessed by whole blocks not big, and number of the block which is accessed is data dependantn (memory hard hashing function)
when every kernel have the same input, this copying makes code faster.
when every kernel have different, this copying makes code slower.

The benefit of shared local memory is if many work items in a work group need to access the same memory at different times (for example, a matrix multiply). If each of your work items accesses different global memory, there is no benefit to copying it to shared local memory (in fact adding the copy will slow it down).

The key to getting fast global memory access it to coalesce your reads, which (in the simplest form) means adjacent work items access adjacent global memory locations.

Guessing games won’t lead you anywhere. Profiler willl tell you exactly what’s wrong. You can try out this (find a newer version on AMD site and read it , it will be useful), if profiler says it is memory access issue. Or maybe your kernel is too big indeed and you ran out of registers.