Each work-group has a shared memory. That memory is only available to the work-items inside the work-group. It can not be saved across kernel invocations.
The implementation details vary by hardware platform. For example, on Nvidia, there is a physical memory associated with each streaming multiprocessor (compute unit) on the card, and while a work-group is running on that compute unit all its work-items have access to that local memory. On CPU implementations the local memory is just emulated by malloc’ing a region of memory and pointing the correct work-items to it.
This memory allows you a work-group to load data from global memory and (potentially) access it much faster. If you have a known set of data that you are going to reuse a lot, then you can get tremendous (10x) bandwidth improvements by using the local memory on some architectures. Remember that you need to have a enough reuse to amortize the cost of copying it to the local memory in the first place, and this is a software-managed memory so it’s generally a pain to program.