When i moved from a old GPU having 256 as max work item size to a new GPU having 512 as max work item size, there is no performance improvement seen.
Even the local work group size is changed from 8 to 16 as it allowed the local work group size of 16 in the new GPU. But even then the performance is same as the old one.
I wanted to know why there is no performacne improvement even after the local work group size is more in new GPU.
Maybe your code is limited by the number of registers or memory bandwidth. In that case you would not benefit from more workitems because your problem is still splitted to the same low number of workitems it uses on the old GPU