I have a question about USM shared malloc arguments.
In my program, I use a USM shared pointer with size 2*N*N:
std::vector<float> A(2 * N * N);
auto A_acc = (Array<float, 2, N, N> *)malloc_shared( sizeof(Array<float, 2, N, N>),
deviceQueue.get_device(), deviceQueue.get_context());
new (A_acc) Array<float, 2, N, N>(A.data());
Inside the kernel I reference the shared memory via A_acc[0][0][i][j] and A_acc[0][1][i][j].
My program works when N is set to values smaller than 24 (like 8, 16 or 24), but it fails at size 32.
May I ask if there is any limitations on using USM pointers inside kernels?
Can you post the full example or at least a minimum test case showing the problem, which compiler & system you are using, which accelerator are you targeting? Otherwise it is difficult to help. For example I have no idea about what is “Array” and so on.
Thanks for your response. This is how I am defining Array:
template<typename T, int N, int... Rest>
struct Array : std::array<Array<T, Rest...> , N>{
using std::array<Array<T, Rest...> , N>::operator[];
};
template<typename T, int N>
struct Array<T, N> : std::array<T, N>{
using std::array<T, N>::operator[];
};
I am using the CUDA backend on DPC++ where the malloc_shared calls cuda_piextUSMSharedAlloc in the Cuda plugin API and then it makes a call to cuMemAllocManaged.
I see, your Array is multidimensional array defined by using recursively a std::array.
I cannot see any obvious reason for the failure.
Perhaps some alignment constraints?
For N = 32, this requires 8192 bytes, which might hit a bug when using more than one 4K page?
Anyway, it looks related to a specific implementation with a specific back-end. so I suggest you open an issue on GitHub - intel/llvm: Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects. with a complete example which can compile and exhibits the bug at run time so they can directly try the code.