Mem. synch. for the host using clEnqueueMapBuffer and 'querying command completion'

The OpenCL 1.1 standard says (5.2.3):

If blocking_map is CL_FALSE i.e. map operation is non-blocking, the pointer to the mapped region returned by clEnqueueMapBuffer cannot be used until the map command has completed. The event argument returns an event object which can be used to query the execution
status of the map command
. When the map command is completed, the application can access the contents of the mapped region using the pointer returned by clEnqueueMapBuffer.

But in (5.9, immediately after Table 5.15) there is the following statement:

Using clGetEventInfo to determine if a command identified by event has finished execution (i.e.
CL_EVENT_COMMAND_EXECUTION_STATUS returns CL_COMPLETE) is not a synchronization point.
There are no guarantees that the memory objects being modified by
command associated with event will be visible
to other enqueued commands.

Q1: So, I’m wondering whether there is some other way to “query the execution
status of the map command” and whether memory consistency is quaranteed
when a query has returned ‘CL_COMPLETE’?
Q2: Am I missing something?
Q3: What are the typical use-cases for that situation?

1-2. Memory consistency is guaranteed after clWaitForEvents call, which is a synchronization point.
3. It can be used for polling.

cl_event e[N]
while (true){
if (clGetEventInfo == CL_COMPLETE for each i in e){
    //Do stuff
  } else {
    //Do some other stuff instead of stalling entirely

In theory, this can be helpful for realtime applications where every ms counts.

Thanks for a code snippet! Correct me if I’m wrong, does this idiom ensure, that the control flow goes through a ‘quick path’ inside the clWaitForEvents?

I’m positive that N non-blocking maps followed by clWaitForEvents(N) can be faster than N consecutive blocking maps if that is your question. This is an idiom AMD recommends ( CTRL-F EventInfo), but clWaitForEvents is required for the sake of portability. It would make sense for a driver not to send a calling thread to sleep when there is no need to.

Also, there is no a real quick path when it comes to synchronization. In the ideal case you’re supposed to copy your data into the GPU, do all of the computations and read the results back, so the way you do the steps 1 and 3 isn’t really important. It’s not always possible, but such cases don’t benefit from GPU acceleration as much.

Actually, I thought clWaitForEvents call will cost less time when all events are actually in a ‘complete’ state already, looks like your comment about ‘thread sleeping’ is about that. Special thanks for such a useful link! By the way, I’ve seen a paper with benchmark results that show that ‘mapping’ is actually slower than ‘direct data transfer’ whatever it meant, so it seems there are some peculiarities in a way steps 1 and 3 are done. More so, there may be two independent queues one for data transferring and one for a computation, so you may end up pipelining and overlapping communication / computation. I am also not really sure I understand your last remark starting from ‘it’s not always possible’.

Both clWaitForEvents и memory mapping are platform specific. You may guess that some things can be faster than other, but you can’t say anything for certain without benchmarks for every piece of hardware.

benchmark results that show that ‘mapping’ is actually slower than ‘direct data transfer’

Quoting AMD’s documentation, mapping simply provides you a pointer to a driver-allocated chunk of memory you can write into, while write command first copies the contents of your pointer to the very same chunk of memory. Perhaps, CL_MAP_WRITE_INVALIDATE wasn’t a thing back then.

‘it’s not always possible’.

The ideal case for a GPU would be something like this: you feed it a configuration of i.e. liquid equation, run a solver for few hundred times, read the results back and be done with it. This makes the GPU happy and fed and driver being able to make the most efficient decisions. It falls apart when you add datasets that don’t fit into VRAM, want to use the results of previous computation, etc. It makes it extremely hard to utilize hardware well. The whole 3D APIs evolution was about finding new ways to send few megabytes of 4x4 matrices over PCI-E even faster.

Sure, this is a bit outdated resource, but, I think is a still useful one: (about uCLBench), that’s the paper I’ve mentioned before. (look at the Fig. 4a for Radeon GPU). Looking in the benchmark code buffer_bandwidth.cpp I see they use memcopy after Mapping, maybe it was not properly optimized back then…So, your experience tells that ‘out-of-core’ algorithms (I mean, when the dataset doesn’t fit and you need to tile it somehow) on a GPU is not the right thing to do? Is that related specifically to the discrete GPU? What about heterogeneous SoCs with GPU cores inside? (I have a strong interest in embedded computing)

Not “not the right thing to do”, but rather “not the most pleasant thing to do”. Why bother with different complications GPGPU brings when you can buy a 24-core-x2 Xeon server with hundreds of GBs of RAM and bruteforce everything you want to solve? It may work 5 or 10 times slower than a proper GPGPU solution, but who cares if you can get good profiling and debugging tools and rich infrastracture? “Time to market” metric is important. It’s exciting what AMD’s high bandwidth cache may bring to the table.

Heterogeneous computing is a beautiful idea that will not work en masse for the same reason multithreading is generally considered a last resort and not a tool to build scalable and maintainable software: a ton of legacy code that is impossible to rewrite in any observable time frame. Embedded, sure, is the case where you have no options to get certain amount of GFlops/W other than using a GPU, but this (just like supercomputing) is a very niche field of software engineering.

Understood. Thanks for you time!