OpenCL + OpenMPI on Apple Silicon returns invalid buffer data

Hello all,

I have an MPI CFD code that uses some OpenCL kernels to speed up meshing (signed distance functions).
I am trying to run it on an M1 MacBook Pro, but I’m having a weird issue where the OpenCL kernels will fail returning data from the buffers to the MPI processes if run simultaneously.

For example, this simple test computes 65536 SDFs from sampled grid points against the Stanford bunny surface. Each MPI process runs an identical copy of the test (no domain partitioning) so, all tests should return the same result.

  1. If I run a 1-process test, everything runs smoothly:
cpu=0 time (kpoints/s): CL=  36206.3, CPU=    29.56, errors=0/65536 speedup=  1224.76x
  1. If I run an MPI test, with usually >=4 processes
cpu=2/4 time (kpoints/s): CL=  29192.6, CPU=    29.57, **errors=51200/65536** speedup=   987.40x
cpu=0/4 time (kpoints/s): CL=  27046.9, CPU=    29.57, errors=0/65536 speedup=   914.67x
cpu=1/4 time (kpoints/s): CL=  30895.6, CPU=    29.51, errors=0/65536 speedup=  1047.05x
cpu=3/4 time (kpoints/s): CL=  31225.5, CPU=    29.52, errors=0/65536 speedup=  1057.69x

but if I use just 2 processes, everything works fine again.

cpu=1/2 time (kpoints/s): CL=  31522.7, CPU=    29.52, errors=0/65536 speedup=  1067.78x
cpu=0/2 time (kpoints/s): CL=  38013.8, CPU=    29.57, errors=0/65536 speedup=  1285.54x

all buffers are initialized with CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR.

Though this is a Fortran program, the CL call syntax is pretty standard, and uses syncronous calls:

     ! So far, only BLOCKING version is supported
     blocking_read = CL_TRUE
     offset = 0
     hostBufferPtr = c_loc(data)
     event_ptr = c_loc(event%addr)

     ! Check buffer data size is >= data size
     readBytes = product(shape(data))*c_sizeof(real(0.0,c_float))

     ierr0 = clEnqueueReadBuffer(queue%addr, &
                                 buffer%cl_mem, &
                                 blocking_read, &
                                 offset, &
                                 readBytes, &
                                 hostBufferPtr, &
                                 queue%n_wait_events, queue%event_handle, &
                                 event_ptr)

Now, checking what’s returned from the buffer read, most of the time it’s 00000s, or sometimes garbage. I suspect there may be conflicting calls to the GPU from the MPI processes but how can that happen? They are completely separate memory wise, and also the CL contexts, queues, buffers have (checked) all different addresses.

PS: If I put barriers on the MPI code such that the CL calls do not overlap in time, the code won’t be parallel anymore, but the buffers now return the correct answers on all processes:

        do icpu=1,ncpu
           call mpi%sync_all() ! sync all MPI processes
           if (mpi%icpu==icpu) call gsurf%sdf_tree(points(1:npoints),gsdf(1:npoints)) ! run sdf on CPU #i
        end do

returns

cpu=2/4 time (kpoints/s): CL=     52.9, CPU=    29.44, errors=0/65536 speedup=     1.80x
cpu=1/4 time (kpoints/s): CL=     54.8, CPU=    29.11, errors=0/65536 speedup=     1.88x
cpu=0/4 time (kpoints/s): CL=     55.5, CPU=    29.25, errors=0/65536 speedup=     1.90x
cpu=3/4 time (kpoints/s): CL=     37.1, CPU=    29.12, errors=0/65536 speedup=     1.28x

FYI I solved this issue by avoiding to have many MPI CPUs simultaneously running command queues on the same GPU.