Different OpenCL kernel result with CPU and GPU

have been trying to figure out why I am getting differing results when running an OpenCL kernel on my CPU and my GPU.

The basic idea is that I launch this kernel teamwork on the device, then interact with it on the host by periodically writing to device memory and causing the kernel to advance as device memory is updated.

Here is the kernel code:

__kernel void teamwork(__global volatile unsigned int* hb,
					   __global volatile unsigned int* tl_buf,
					   const int nsamples) {
	volatile unsigned int tl0 = 0xdeadbeef;	/* always run one dummy loop */
	volatile unsigned int tl1;
	unsigned int lhb = hb[0];
	const unsigned int tid = get_global_id(0);
	for (int i = 0; i < nsamples; i++) {
		while((tl1 = hb[0]) == tl0) {
			if (tl1 == 0xff)
				break;
		}
		if (hb[0] != 0xff)
			lhb = hb[0];
		tl_buf[i] = lhb;
		tl0 = tl1;
	}
}

I am using a python script using pyOpenCL to launch the kernel and communicate with it while it is running:

# teamwork.py
import pyopencl as cl, numpy as np, os, time
print("PID: %d" % os.getpid())

ctx = cl.create_some_context()
cq0 = cl.CommandQueue(ctx)
cq1 = cl.CommandQueue(ctx)

# compile kernel
krnl_src = open("teamwork.cl", "r")
prg = cl.Program(ctx, krnl_src.read()).build("-I./")
krnl_src.close()

hb = np.array([0xdeadbeef], dtype=np.uint32)
nsamples=np.int32(os.getenv("NSAMPLES"))
dt=np.uint32(os.getenv("DT"))
print("DT = %dus | nsamples = %d" % (dt, nsamples))

tl_buf=np.zeros([nsamples], dtype=np.uint32)

d_hb = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, hb.nbytes)
d_tl_buf = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY, tl_buf.nbytes)

cl.enqueue_copy(cq0, d_hb, hb)
print("d_hb={} (before kernel launch)".format(hb))

# run kernel
event = prg.teamwork(cq0, (1,), None, d_hb, d_tl_buf, nsamples)
time.sleep(2)

# tlatch sim
for i in range(nsamples):
    t_actul = np.uint32((i+1)*dt)
    # copy to GPU memory
    event = cl.enqueue_copy(cq1, d_hb, t_actul)
    event.wait()
t_actul = np.uint32(0xff) # write 0xff so the kernel doesn't hang
event = cl.enqueue_copy(cq1, d_hb, t_actul)
event.wait()

print("tlatch_sim() complete...")
event = cl.enqueue_copy(cq0, hb, d_hb)
event.wait()
event = cl.enqueue_copy(cq0, tl_buf, d_tl_buf)
event.wait()

print("d_hb={} (after kernel launch)".format(hb))
print("d_tl_buf={}".format(tl_buf))

Some further detail on how this works:

  1. on the host I setup two command queues: cq0 and cq1,
  2. cq0 initializes device memory and launches the kernel, which has 3 arguments: d_hb, d_tl_buf and nsamples,
  3. cq1 is used in a for loop (of nsamples iterations) to periodically write new values to d_hb[0], which are then saved in the device array d_tl_buf.

The expected result of this kernel is that after the kernel launch, the kernel also enters a for loop of nsamples iterations, then enters a while loop that waits for a new value to be written to d_hb[0]. Once a new value is written to d_hb[0] by the host script, the kernel will break out of the while loop, and record this new value in d_tl_buf, complete the for loop iteration, then re-enter the while loop and wait for new data.

Using my CPU, I get the expected result:

$ PYOPENCL_COMPILER_OUTPUT=0 NSAMPLES=10 DT=10 python3 teamwork.py
PID: 6997
/home/mitchell/.local/lib/python3.11/site-packages/pyopencl/cache.py:495: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  _create_built_program_from_source_cached(
DT = 10us | nsamples = 10
d_hb=[3735928559] (before kernel launch)
tlatch_sim() complete...
d_hb=[255] (after kernel launch)
d_tl_buf=[ 10  20  30  40  50  60  70  80  90 100]
$ clinfo | grep DEVICE_TYPE
  Device Type:					 CL_DEVICE_TYPE_CPU

But when I run this on my AMD Firepro W5100 (amdgpu-pro driver 21.10-1263777), I get this unexpected result:

$ PYOPENCL_COMPILER_OUTPUT=0 NSAMPLES=10 DT=10 python3 teamwork.py
PID: 2432
DT = 10us | nsamples = 10
d_hb=[3735928559] (before kernel launch)
tlatch_sim() complete...
d_hb=[255] (after kernel launch)
d_tl_buf=[255 255 255 255 255 255 255 255 255 255]
$ clinfo | grep DEVICE_TYPE
  Device Type:					 CL_DEVICE_TYPE_GPU

I have been racking my brain trying to figure out why this kernel doesn’t run as expected on the GPU.

I have gotten kernels like this to run on NVIDIA GPUs using CUDA/pyCUDA and using two streams in the same manner that I use the two command queues.

The fact that d_tl_buf is all 0xff (255) when run on the GPU, implies that the entire host for loop is run before the kernel even launches, which is why the time.sleep() is there to ensure adequate time elapses to wait for the kernel to launch on the device, but it seems to have no affect.

It is also worth noting is that if I remove the if (tl1 == 0xff) break clause from the kernel code, the kernel hangs on the GPU.

Does anyone know why I’m getting different results?

I don’t know much about OpenCL, but coming from graphics API (especially Vulkan) perspective this looks highly suspect. In those APIs modifications to a buffer object contents without careful synchronization and flushing of caches is undefined behavior. And even then the synchronization is to my knowledge at best between dispatches of GPU work, not while a single dispatch is still executing.
From a quick read of the section on Shared Virtual Memory in the OpenCL spec where it talks about synchronization points I would suspect you would also need something that ensures consistency of the memory contents.

@carsten_neumann, thanks for your insight. Some further details about my use case: I would like to use DirectGMA (aka bus addressable memory) to have real-time data sent over the PCIe bus to d_hb. Every time new data is sent to the memory buffer, a new timestamp would show up in d_hb[0], which is what this code is trying to emulate. The change to d_hb[0] is how the kernel knows new data is in the buffer to process. To my knowledge, applications like this are the whole point of DirectGMA, so do you know how those applications are supposed to work? I know OpenGL is supposed to work with DirectGMA as well.

Hmm, searching for “OpenCL DirectGMA” leads to a thread on the AMD Forums that seems relevant. Note what it says about having to use Shared Virtual Memory for concurrent access to the buffer from the host and the device and using the cl_amd_bus_addressable_memory extension for utilizing DirectGMA.

There is also an example for using DirectGMA with OpenCL in this repo.

@carsten_neumann, thank you for the link, it was super helpful. Even though AMD advertised the FirePro W5100 as an OpenCL 2.0 device, my clinfo reveals that it is in fact using OpenCL 1.2, has no SVM capabilities and thus atomic operations will not work. I guess I’ll have to start a new post on how to install the amdgpu driver to get OpenCL 2.0 capability.

If someone with an OpenCL 2.0+ capable GPU would be so kind as to run this code and tell me results, that would be greatly appreciated.