I have a fragment shader in which I need to guarantee exclusive access to some memory locations for one thread. As this is not just one operation for which I could use an atomic operation, I need to lock a part of my code. Currently I tried the following pattern (simplified):
My problem with this is the following: As doWork() needs some time (basically a couple of image store operations), the waiting threads which run on another warp/wavefront can burn thru there tries very quickly and just give up. If I increase MAX_TRY to counter this, the performance drops drastically as each try will need one expensive atomic memory access. If all threads fighting for the same lock would run on the same warp I wouldn’t have this problem, sadly, this is not always the case.
Now my question is, are there better suited pattern for this?
It’d be interesting to know what kind of work you do in doWork() - maybe there’s a way to avoid syncing altogether. Also, why do you need the first barrier? Do you have the same problems when using a buffer object and non-image atomics functions?
If all threads fighting for the same lock would run on the same warp I wouldn’t have this problem, sadly, this is not always the case.
[QUOTE=thokra;1254862]It’d be interesting to know what kind of work you do in doWork() - maybe there’s a way to avoid syncing altogether. Also, why do you need the first barrier? Do you have the same problems when using a buffer object and non-image atomics functions?
Can you elaborate on that?[/QUOTE]
doWork() contains write operations to 3 or 4 images and a few read operations (~8) but not much other operations. Sadly, I can’t avoid the sync. The first barrier is a remaining relict of an earlier test, it can get deleted
I have not tried to do the same thing with buffer objects as I need one mutex per pixel (read: a lot).
If a warp of 32 threads all want to perform doWork, one of them gets the mutex and blocks the other 31 while doing the if-case, only when the first thread finishes the other 31 run into the else part and increase the counter. One of them will in the next loop get the mutex etc. In a SIMD processor the threads in one warp are not independent, so they can’t run thru the else part a couple of times while one thread is in the if-part, but in case the threads who want to access this specific mutex it can be the case…
So to clarify, this is in a fragment shader. You have a per-pixel mutex and this contention occurs between fragments writing to the same pixel. You say all threads in a warp must converge at an if-statement (maybe designed like this to help thread/later operation coherency).
I assume it’s quite likely for two threads in contention for a pixel to be running in separate warps, since threads in a warp are likely from the same polygon and won’t be overlapping.
Just thinking out loud but the continuous atomicExchanges may be overshadowing the memory operations in doWork(), depending on the amount of contention. Maybe introducing some form of sleep as well as ++try would actually improve performance, reducing the load on memory transfer.
Can the second imageAtomicExchange just be an imageStore (assuming the image unity is ‘coherent’)? (possibly not if another thread’s atomic exchange can overwrite it before reading it)
Is there anything expensive in doWork() that can be moved outside the lock. For example in the above link, the lock quickly gets a memory location and returns the lock before going on to do some work and write to the memory location.