have been trying to figure out why I am getting differing results when running an OpenCL kernel on my CPU and my GPU.
The basic idea is that I launch this kernel teamwork
on the device, then interact with it on the host by periodically writing to device memory and causing the kernel to advance as device memory is updated.
Here is the kernel code:
__kernel void teamwork(__global volatile unsigned int* hb,
__global volatile unsigned int* tl_buf,
const int nsamples) {
volatile unsigned int tl0 = 0xdeadbeef; /* always run one dummy loop */
volatile unsigned int tl1;
unsigned int lhb = hb[0];
const unsigned int tid = get_global_id(0);
for (int i = 0; i < nsamples; i++) {
while((tl1 = hb[0]) == tl0) {
if (tl1 == 0xff)
break;
}
if (hb[0] != 0xff)
lhb = hb[0];
tl_buf[i] = lhb;
tl0 = tl1;
}
}
I am using a python script using pyOpenCL to launch the kernel and communicate with it while it is running:
# teamwork.py
import pyopencl as cl, numpy as np, os, time
print("PID: %d" % os.getpid())
ctx = cl.create_some_context()
cq0 = cl.CommandQueue(ctx)
cq1 = cl.CommandQueue(ctx)
# compile kernel
krnl_src = open("teamwork.cl", "r")
prg = cl.Program(ctx, krnl_src.read()).build("-I./")
krnl_src.close()
hb = np.array([0xdeadbeef], dtype=np.uint32)
nsamples=np.int32(os.getenv("NSAMPLES"))
dt=np.uint32(os.getenv("DT"))
print("DT = %dus | nsamples = %d" % (dt, nsamples))
tl_buf=np.zeros([nsamples], dtype=np.uint32)
d_hb = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, hb.nbytes)
d_tl_buf = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY, tl_buf.nbytes)
cl.enqueue_copy(cq0, d_hb, hb)
print("d_hb={} (before kernel launch)".format(hb))
# run kernel
event = prg.teamwork(cq0, (1,), None, d_hb, d_tl_buf, nsamples)
time.sleep(2)
# tlatch sim
for i in range(nsamples):
t_actul = np.uint32((i+1)*dt)
# copy to GPU memory
event = cl.enqueue_copy(cq1, d_hb, t_actul)
event.wait()
t_actul = np.uint32(0xff) # write 0xff so the kernel doesn't hang
event = cl.enqueue_copy(cq1, d_hb, t_actul)
event.wait()
print("tlatch_sim() complete...")
event = cl.enqueue_copy(cq0, hb, d_hb)
event.wait()
event = cl.enqueue_copy(cq0, tl_buf, d_tl_buf)
event.wait()
print("d_hb={} (after kernel launch)".format(hb))
print("d_tl_buf={}".format(tl_buf))
Some further detail on how this works:
- on the host I setup two command queues:
cq0
andcq1
, cq0
initializes device memory and launches the kernel, which has 3 arguments:d_hb
,d_tl_buf
andnsamples
,cq1
is used in a for loop (ofnsamples
iterations) to periodically write new values tod_hb[0]
, which are then saved in the device arrayd_tl_buf
.
The expected result of this kernel is that after the kernel launch, the kernel also enters a for loop of nsamples
iterations, then enters a while loop that waits for a new value to be written to d_hb[0]
. Once a new value is written to d_hb[0]
by the host script, the kernel will break out of the while loop, and record this new value in d_tl_buf
, complete the for loop iteration, then re-enter the while loop and wait for new data.
Using my CPU, I get the expected result:
$ PYOPENCL_COMPILER_OUTPUT=0 NSAMPLES=10 DT=10 python3 teamwork.py
PID: 6997
/home/mitchell/.local/lib/python3.11/site-packages/pyopencl/cache.py:495: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
_create_built_program_from_source_cached(
DT = 10us | nsamples = 10
d_hb=[3735928559] (before kernel launch)
tlatch_sim() complete...
d_hb=[255] (after kernel launch)
d_tl_buf=[ 10 20 30 40 50 60 70 80 90 100]
$ clinfo | grep DEVICE_TYPE
Device Type: CL_DEVICE_TYPE_CPU
But when I run this on my AMD Firepro W5100 (amdgpu-pro driver 21.10-1263777), I get this unexpected result:
$ PYOPENCL_COMPILER_OUTPUT=0 NSAMPLES=10 DT=10 python3 teamwork.py
PID: 2432
DT = 10us | nsamples = 10
d_hb=[3735928559] (before kernel launch)
tlatch_sim() complete...
d_hb=[255] (after kernel launch)
d_tl_buf=[255 255 255 255 255 255 255 255 255 255]
$ clinfo | grep DEVICE_TYPE
Device Type: CL_DEVICE_TYPE_GPU
I have been racking my brain trying to figure out why this kernel doesn’t run as expected on the GPU.
I have gotten kernels like this to run on NVIDIA GPUs using CUDA/pyCUDA and using two streams in the same manner that I use the two command queues.
The fact that d_tl_buf
is all 0xff
(255
) when run on the GPU, implies that the entire host for loop is run before the kernel even launches, which is why the time.sleep()
is there to ensure adequate time elapses to wait for the kernel to launch on the device, but it seems to have no affect.
It is also worth noting is that if I remove the if (tl1 == 0xff) break
clause from the kernel code, the kernel hangs on the GPU.
Does anyone know why I’m getting different results?