can't re-use shared memory in ATI 5870?

I know this is ridiculous and I believe I did something wrong, because I can’t google anything related on internet. But still, I just can’t re-use a shared memory arry in ati 5870, while the same program run well on nVidia gpu.

say, I have a kernel look like this: (my code is not as simple as this. but the detail is the same)

__local float tmp1[16];
__local float tmp2[16];
uint localIdX = get_local_id(0);
float a,b;
// I first define tmp1 and use it for a

// then if I re-use tmp1for later calculation, the code result will go wrong on ati 5870, while nvidia’s result is good
// but if I use tmp2 instead, then ati is also good.
// example as below

if I use tmp1,
tmp1[localIdX]=1; // the code will go wrong on ati, while nvidia is good

if a new tmp2 is used:
tmp2[localIdX]=2; // then ok for ati too

I make sure there is synchronization before re-use of shared memory. This re-use problem only happen on ati 5870, while nVidia GTX260 is good with re-use of shared memory with the same code…

I think maybe there is problem when I build the program, or something related to my card. but I really have no clue now.

Any thought will be appreciated! Thanks.

If each work-item is accessing its own location in local memory (as it appears to be) you should not need a memory barrier at all, but you might need a memory fence to make sure that the write happens before the read. If you do need that, then you would need to put it after the tmp1[localldX]=1 and before the a=tmp1[localldX]. I know that the Nvidia card is in-order within one work-item so this should work (today) without any barriers/fences, but I do not know if the ATI card is. Try putting the fence after each write and before the read and see if that fixes it.

Thanks dbs2.

actually, I have typo in my original post. the barrier is between those two lines, which is the same as what you mean.

btw, I still can’t solve this wired thing yet.

but thanks for your reply!