Shared Memory questions


I’m a bit confused about the shared memory in OpenCL.
It has 16 KBytes and shares data between kernels in the same work group.
So, are there several shared memories(for each work group) and every work group has 16 KBytes or is shared memory splitted?
Furthermore I don’t understand how it exists on the hardware: The number of work groups is variable, so how could that be managed?
I think I didn’t understand the main concept of this memory, but I hope someone someone can explain, what I am not understanding

Each work-group has a shared memory. That memory is only available to the work-items inside the work-group. It can not be saved across kernel invocations.

The implementation details vary by hardware platform. For example, on Nvidia, there is a physical memory associated with each streaming multiprocessor (compute unit) on the card, and while a work-group is running on that compute unit all its work-items have access to that local memory. On CPU implementations the local memory is just emulated by malloc’ing a region of memory and pointing the correct work-items to it.

This memory allows you a work-group to load data from global memory and (potentially) access it much faster. If you have a known set of data that you are going to reuse a lot, then you can get tremendous (10x) bandwidth improvements by using the local memory on some architectures. Remember that you need to have a enough reuse to amortize the cost of copying it to the local memory in the first place, and this is a software-managed memory so it’s generally a pain to program.