Beginner Question (PyOpenCL + OpenCL): Large bound on loop causes GPU to freeze?

Hi there!

I’m experimenting with (Py)OpenCL in an effort to learn more about it; recently I’ve been running the following piece of code, the purpose of which is just to try and find some way to tell all work-items to stop as soon as one of them has achieved what I want them to achieve, rather than waiting for all kernels to complete the task. The code’s operation is intentionally trivial; it doesn’t really do anything useful. It just pitches a vector of length 5 to the GPU, initialized to [1,2,3,4,5], and then increments each entry in that vector by its initial value (so 1 becomes 2, 2 becomes 4, etc) until one of them crosses a predefined threshold, lim. Once one of them (the 5th entry, since it counts by 5’s) crosses that threshold, a variable end[0] - writable to by all work-items - is over-written with a large value, which triggers the while loop in each work-item to abort. The code is:

from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import os
import time

os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'
os.environ['PYOPENCL_CTX'] = '0:1'    ###### NOTE: this is my CPU; 0:0 is my GPU

start_time = time.clock()

a_np = np.random.rand(5).astype(np.float32)
for i in range(len(a_np)):
	a_np[i] = a_np[i]
b_np = np.random.rand(5).astype(np.float32)
a_np = np.array([1,2,3,4,5]).astype(np.float32)
b_np = np.array([1,2,3,4,5]).astype(np.float32)
lim = np.array([10000000.0]).astype(np.float32)
end = np.array([-10.0]).astype(np.float32)
print("a_np = " + str(a_np))
print("b_np = " + str(b_np))

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)
lim_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=lim)
end_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=end)

print("READ WRITE")
prg = cl.Program(ctx, """
__kernel void sum(global float *end, global float *lim, global float *a_g, global float *res_g) {
  int gid = get_global_id(0);
  float start = a_g[gid];

  while(start <= lim[0] && end[0] < 20.0f)
    start = start + a_g[gid];
	if(start > lim[0])
		end[0] = 40.0f;

  res_g[gid] = start;

res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
prg.sum(queue, a_np.shape, None, end_g, lim_g, a_g, res_g)

res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)

end_time = time.clock()

print("Run-time = " + str(end_time - start_time))

output = open("my_gpu_hates_me.txt", 'w')
output.write("Run-time = " + str(end_time - start_time))

The code runs as expected (and hoped!) for lim values of 10,000,000.0 or less; for these values, the output is:

[ 2000001. 4000002. 6000003. 8000004. 10000005.]

Run-time = 1.0527003762980927

seemingly verifying that the 5th work-item won the (trivial) horse-race (as you’d hope, since it’s counting by 5’s while the others count by 1’s, 2’s, 3’s, or 4’s), and that the computation then ended.

For these same values of lim, commenting out the lines

	if(start > lim[0])
		end[0] = 40.0f;

yields the output

[ 10000001. 10000002. 10000002. 10000004. 10000005.]

Run-time = 3.6010642782902567

which is also as anticipated. Having removed the if statement that was used to prematurely abort the losing work-items’ while loops, now all work-items must count to lim[0], which of course takes longer than just having the fastest work-item count to lim[0]. All’s great so far!

However, if I set lim to 100,000,000.0 (which has no particular meaning in this context; I just happened to plug this value in while testing), my screen eventually freezes. I thought at first that the computation was just taking longer and that I’d recover monitor control once it had finished, so I waited for about 2 minutes by standard clock time (which is a great deal longer than the order-of-magnitude or so increased run-time, i.e. 10s-40s, that I was expecting). The screen simply remained frozen, unfortunately, until I gave up and hard-rebooted my PC; I’ve tried this several times and the behavior’s quite consistent.

Oddly, the same problem doesn’t occur if I manually copy/paste the loop (re-initializing the values of start and end[0] in between, of course) 10 times. In this case, my run-time is 8 seconds rather than 1 second (on the GPU), or a little less than an order of magnitude longer, which is about what I would expect.

Is there any reason this code should behave so very differently for lim[0] set to one-hundred million versus setting it to ten million and writing out the loop 10 times, freezing in the former case while scaling as expected in the latter?

Relevant (?) Additional information:

I’m running Windows 7 with a dual-core J2900 2.41Ghz Intel Pentium processor with its pre-installed the Intel® HD Graphics GPU. (don’t laugh! I was on a serious budget and in a hurry! :p)

I’ve added registry keys to turn off TdrLevel and to set TdrDelay to an extremely large number; I was having trouble getting just the latter by itself to work for some reason, so I ended up doing the former as well. Without this, computations lasting more than 2 seconds on the GPU are automatically killed by Windows and the GPU reset.