Trying to optimize vertex computations with local memory

outofsync · July 23, 2015, 6:00am

I’m having a really hard time trying to achieve maximum performance with OpenCL using GPU devices. The best I got for the moment is about 2x the performance of 1 CPU core.

After really hackish/magic profiling (because AFAIK there’s no GPU OpenCL time profiler for the Mac, which is the platform I use -and this makes OpenCL a very difficult API to optimize for, IMHO- ), I got to the conclusion that my bottleneck is GPU global memory access: each work item reads 6 floats and 3 ints from global memory, and writes back 1 int to global memory (a total of 10 accesses to global memory, each one a 4-byte access). Apart from the memory access, the computations performed by each work item are as follows: 6 dot products, 8 float subtracts, 4 int subtracts, and 5 int bitwise ANDs.

Reading the NVIDIA OpenCL optimization docs, it’s obvious that 10 accesses to global memory per work item are a severe impact in performance: If I’m reading the docs correctly, there’s a latency of about 400 to 600 cycles (woah!!!) when accessing global memory. I suppose that my accesses are coalesced because they’re done as indexed arrays, where the index is taken from the work item global IDs, so I believe this meets coalesced accesses, but, as I said, there’s no GPU OpenCL profiler for Mac, so I cannot check if it’s actually the case or not.

My first try was to use constant memory for the kernel input data: It didn’t help. Performance is the same.

The last resource I have is to try local memory.

If I’m understanding local memory correctly, the 16KB of local memory that my NVIDIA GPUs reports isn’t per compute unit, but total. So, I understand that if I have 6 compute units, each workgroup will have, in the best case, 16KB/6 which is about 2.6KB. It’s small, but some of the input data can fit there, reducing such 10 accesses to global memory to almost half of them.

Now, my main question:

How can I transfer the global memory to local memory? I think it’s done with async_work_group_copy(), but didn’t find any understandable code snippet.

For example, imagine I want to transfer an array of 30 ints from global memory to local memory. I want that this happens only in the first workgroup of each compute unit, so that all other workgroups can access the previously initialized local memory.

How would I code the starting lines of my kernel, so that async_work_group_copy() is executed only by the first group in each compute unit?

I chose OpenCL, about a year ago, because I prefer multiplatform APIs and compatibility. But I’m missing so much a convenient path for proper optimization, that I think I’m going to give CUDA a try. I prefer the OpenCL concept, but I feel really lost here.

Thanks.

Salabar · July 23, 2015, 12:56pm

If you access the global memory once per value, you cannot optimize it further. Global -> local transition is used when a value is going to be used more than once. If I understood you correctly, you are trying to make a software-ish render. Try to implement triangle fans so each new triangle would require only a single vertex. Here local memory will actually help. If you can calculate something in place (i.e. normal vector for a plane), don’t transfer it from global memory.

I chose OpenCL, about a year ago, because I prefer multiplatform APIs and compatibility. But I’m missing so much a convenient path for proper optimization, that I think I’m going to give CUDA a try. I prefer the OpenCL concept, but I feel really lost here.

I don’t believe NVIDIA supports OpenCL profiling in their Windows instruments either, so I’d say it is a vendor’s quirk.

UPD. https://www.khronos.org/message_boards/showthread.php/11489-Nvidia-GPU-OpenCL-Profiling Yep, it is.

outofsync · July 28, 2015, 5:29am

No, it’s not a renderer, it’s an algorithm that makes a computation from the dot products of n^2 combinations of pairs of vertices from a buffer of n vertices. I’ve tried everything. Today I tried with images (one of the parts in the computation could be moved outside of the workitems, because it’s “quite constant”, so I created two kernels: the first precomputes all the different values for this part, and stores them in an image which is accessed by the second kernel, thus saving redundant computing cost).

But the image read access from the second kernel is so slow, that it’s even slower than with my original approach. The first kernel is almost instant though.

Then I tried to modify it, using a buffer (flattening the image), but no way. Still slower than my original approach.

OpenCL has to be the most discouraging and unpleasant API I’ve worked with: you use it by trial and error, no way to diagnose your kernel with proper tools, no way to have a clue if some change will speed or slow things until you try it. So the approach is to spend a morning writing a new version, only to realize you lost performance. The next day you repeat the routine, and with a bit of luck you’ll get the same performance you already had. And so on for the rest of the week. Always guessing, always depending on your driver and your operating system (because the same GPU has fast paths on one OS which are slow on another… this is really discouraging).

Yes, I’m very disappointed. I’ve wasted more than a month fighting with this

Salabar · July 28, 2015, 11:50am

As I’ve said, it is Nvidia’s decision not to ship OpenCL profiler with their SDK. Or even, I believe it is Apple who supports end-user drivers for Mac OS X, so lack of tools is their fault as well, probably. AMD has an extensive profiler for Win32 and Linux, for one. Thus it is clearly not the API itself to attack. I guess, I can only suggest to use Intel’s GPU and instruments. I’m not sure if they have those for OpenCL on Mac, though. Intel Developer Zone

outofsync · July 29, 2015, 11:04am

Yes, it’s very disappointing. For example, one of the computations in the kernel has a constant value for workitems with the same get_global_id(0), and another one is constant for the same get_global_id(1). If I precompute them and store them in buffers so that no redundant work is done, the cost of reading the buffers (even with coalesced access) is more or less the same as doing the redundant work without the buffers. There’s no way to predict this behaviour without proper tools. And then, when you think memory access is the bottleneck, I change the order of operators in a computation, and the kernel time rises from 120 secs to 160 secs… it’s really, really disappointing to work like this, trial and error.

I agree the API isn’t the culprit. But, in the end, it does pay the consequences of nobody implementing it in an adequate manner. I mean, if I choose OpenCL because I want a multiplatform standard, but then one of the major GPU vendors won’t provide a profiler on any OS, and the other major GPU vendor won’t provide it for all operating systems, then OpenCL loses most of the interest for me. I didn’t want to take a look at CUDA, but I see no other solution right now. If CUDA is better for NVIDIA GPUs, and Metal works well on the next OSX version, maybe I’ll optimize my code for them, and then provide a less optimized OpenCL version. Btw, this is what Adobe seems to be doing, because Adobe users recommend to first try CUDA acceleration and if your GPU doesn’t support it, OpenCL.

Sad, but I don’t see any other way.