Inconsistent Results

I have a kernel that runs fine on the cpu with the Intel platform but returns some bizarre results on an Nvidia 2000M gpu. The funny thing is that the results are correct around 50% of the time with the same input. This leads me to believe that my kernel is correct and something else is going on. My code is pretty complicated other wise I would post it. Has anyone else experienced anything like this? Thanks.

Your kernel could be incorrect if you’re using shared memory, barriers or atomics. CPU implementations do not do any concurrent execution within work-groups (they just use loops). As such, any incorrect barrier or shared memory usage would not show up in CPU code.

The other area is memory, reading/writing from the host, uninitialised variables will be different, and so on.

If it is not something to do with memory transfers or memory initialisation, the high concurrency on a gpu will expose different bugs.

Thanks for the reply. I did notice that the errors seem to occur at the boundary of a work group. If I privately sent you my kernel code would you have a minute to take a look at it? I’ve got hours and hours into this and would love to move on with my project. Thanks.

Here’s images of the results on 2 consecutive runs of the kernel. … 8173fb.jpg … 5bde53.jpg

I’m getting closer. What happens if the total work group count is not a multiple of wave front or warp size? My kernel depends on local data exchange between a cluster of work items within the work group. This cluster size varies from 2 to 100+. I have been trying to use this cluster size as one of the dimensions of my work group. The problem seems to arise when these clusters in a work group are not a multiple of wave front size. Does this make any sense? Thanks I thought that if I was using local barriers correctly the work group dimensions and total count shouldn’t cause these problems to arise.

When the code is executed on the device, the local work group is assigned to one compute unit and always takes a multiple of wave-fronts of processing time. e.g. lws=2 still takes 64 threads on an AMD part (i don’t know much about nvidia anymore, but it’s either 32 or 64), lws=65 takes 128 threads, etc.

If you have problems with your barrier code, then a work size > wave-front size will expose it, as the whole work-group will not execute atomically - which is what happens when the work size <= wave front size. And this can show up as non-deterministic results.

With barriers make sure you bracket the transaction, it’s easy to forget one. e.g. if you do a store, then a load in a loop, you need 2 barriers, not just the one after the store.

The barriers in loops was the problem. Thanks a lot for your help.