Writing a Kernel for a Explicit Method Heat Equation

Hello all,

The greatest difficulties I encounter are those with kernel writing. I am learning OpenCL for my job and as an exercise I am attempting to parallelize the explicit method for a 1D heat equation. The serial C implementation is actually very simple:

	int i,j;		// Discrete Mesh Indices
	double s;		// Numerical stability factor
	double x;		// Spatial dimension variable x			
	double StepSize_x;	// Step size for x;			
	int NumMeshPoints_x;	// Number of mesh points for x		
	double t;		// Time dimension variable t
	double StepSize_t;	// Step size for t 			
	int NumMeshPoints_t;	// Max steps for t			

	NumMeshPoints_x = 9;
	StepSize_t = 0.0025;
	NumMeshPoints_t = 10;
	StepSize_x = 1.0/(NumMeshPoints_x+1); 	
	s = StepSize_t/pow(StepSize_x, 2 );	// Satisfy s < 0.25 for stability
	double w[NumMeshPoints_x+1];		// Level Down Array
	double v[NumMeshPoints_x+1];		// Level Array

	for( i = 0; i < NumMeshPoints_x+1; i++ )
		x = i*StepSize_x;
		w[i] = 100;

	t = 0;
	for( j = 1; j <= NumMeshPoints_t; j++ )
		v[0] = 10;
		v[NumMeshPoints_x+1] = 10;

		for( i = 1; i < NumMeshPoints_x; i++ )
			v[i] = s*w[i-1] + (1-2*s)*w[i] + s*w[i+1];

		t = j*StepSize_t;	
		for( i = 0; i < NumMeshPoints_x+1; i++ )
			w[i] = v[i];

So I am trying to make the meat of this program, the double for loop, into my kernel. So as it stands, each layer of the array v[] is calculated at a given time step t, so that stepping in the t direction expands the grid, layer by layer. However, to calculate a layer requires the previous layer calculated. So I had two thoughts about this and I wasn’t sure how to really go about this:

  1. Write a kernel that handles the two for loops, prototyped as such:

	__kernel void heat( 	__global double *prevLayer,
				__global double *currLayer,
				int NumMeshPoints_x,
				int NumMeshPoints_y );

but I just don’t understand what would happen here. So I understand that I can access the arrays through the get_global_id(0) OpenCL command but can have each thread compute one point in the layer, then wrap that in the for loop to time step it? Otherwise my second though was to

  1. Have the kernel only handle a single layer, then enqueue it in a for loop for the time steps.

I am having difficulties even realizing a problem space. In MPI there is the probelm that threads would have to communicate continuously to compute the end points of their respective segments of the layer. I don’t know if this is a problem in OpenCL. OpenCL programmers, I need your help! Please any advice as I am lost!

Well how you solve this really depends on the problem size. If you’re only doing a pile of 9x10 problems you might approach it differently to if you were solving 100x1000, or 1000x100 …

But there are a couple of general approaches:
a) convert inner loops to multiple work-items within a work-group
b) convert outer loops to individual work-items
c) leave the code more or less the same and get parallelism by doing many at once in a batch

I’l just consider a) here, and that the problem size will be much bigger than 9x10 in practice.

For a, the cross work-item communication (which is required here) can be performed in two ways:
a) using local memory. Everything stays on the individual compute unit which has it’s own local memory. This is extremely fast. One kernel invocation can do any number of cross-work item communications, so for example you could put the time step loop inside the kernel. The problem is local memory is restricted in size, and so is the size of the work-group.
b) using global memory. Here memory has to go to/from global memory. If the information must be communicated outside of the work-group, then the only (practical) way to synchronise the data is with separate kernel calls. In this case time step loop is on the cpu host. If there is no cpu-device synchronisation within each iteration, the overhead of the calls isn’t too bad.

(this is why the size of the problem matters).

I’ll only look at b) here since it’s simpler and more general, and it equates to your case 2. If you want to do a) then you’re limited to the number of work-items that can run in a single work-group that run on a single compute unit - so you will not get effective parallelism unless you need to do many such problems separately. (there are other ways to communicate integer/float values between work groups for the overlapping cases, but they can be very expensive and will complicate the code and will probably be much slower).

Basically you convert the inner loop into a single bit of code which is run concurrently.

// implements this loop in parallel:
// v[0] = 10;
//      v[NumMeshPoints_x+1] = 10;
//      for( i = 1; i < NumMeshPoints_x; i++ )
//         v[i] = s*w[i-1] + (1-2*s)*w[i] + s*w[i+1];

kernel void
populate_layer(const global double *w, global double *v, double s, int NumMeshPoints_x) {
	int i = get_global_id(0);
	double result;

	// Check range
	if (i <= NumMeshPoints_x) {
		// Boundary condition
		result = (i == 0 || i == NumMeshPoints_x) ? 10
			: result = s*w[i-1] + (1-2*s)*w[i] + s*w[i+1];
		v[i] = result;

So most of this is a fairly straight-forward realisation of the inner loop, but a couple of specific points which might not be clear:

  1. The range is checked because when you launch on a gpu you will get optimum parallelism if you give it a global work size which is a multiple of the device width. This should be at least 64 for GPUs. You then have to do your own range checking.
  2. I include the boundary conditions ‘in the loop’.
  3. I assume the buffer pointers are swapped/double buffered in the host code, so there’s no need to copy the results back to w (due to the concurrency this wont work anyway, not with one kernel).

Having that, you then write host code that calls that kernel repeatedly for the outer loop, and then just read the result at the end. It will go something like (pseudocode):

// wcl = cl buffer of NumMeshPoints_x +1 doubles
// vcl = cl buffer of NumMeshPoints_x +1 doubles

 ... some init code or call another kernel to do it ...
	t = 0;
	for( j = 1; j <= NumMeshPoints_t; j++ )
k = populate_layer
k.setArg(0, wcl)
k.setArg(1, vcl);
k.setArg(2, s);
k.setArg(3, NumMeshPoints_x);
clEnqueueNDRangeKernel( globalworksize=roundUp(NumMeshPoints_x, 64))
	... swap wcl and vcl values ...
// read the results from wcl (it was the last vcl after swap)
clEnqueueReadBuffer(wcl, CL_TRUE);

If you don’t understand the memory stuff and kernel invocation you’d better look at some examples and read the specifications and vendor sdk stuff.

Perhaps I should have been a little more confident, because I actually set this guy up almost exactly as prescribed. My problem was with the NDRange. But I had previously attempted to implement the same exact kernel as you had again with the NDRange problem. Now it is all fixed thanks to your excellent explanation of Approach (a) using global memory.

What I would like to do now is attempt it using local memory (Approach a), after I get the global memory implementation timed for comparison with the serial implementation. I will keep you posted notzed if you don’t mind helping some more later on. Thanks again. You are good at what you do. :smiley:

I haven’t touched this code in quite a while, but after coming back to it two months later, I just noticed that while the code looks and acts how it seems it should, it is actually as precise in its calculation as the serial version. I mean its really really close, but the OpenCL code is ultimately not as precise.

All my types are correct and the programs are running with the exact same parameters. What could be going on?

Double precision vs Single Precision? What GPU are you using?

You can also test this in two ways:

  1. Write a version of your CPU code to use floats instead of doubles. Compare your differences then.
  2. Run your OpenCL code on your CPU using the Intel OpenCL SDK or the AMD SDK. Make sure you enable double-precision on your kernel, as it needs a flag.

Compare your results. If they are still different by a noticeable margin, we can dig deeper.

The opencl specification lists the numerical precision required for the standard, and some vendor material lists any deviations from that. Maybe it isn’t the same as IEEE demands.

And apart from that, GPU’s sometimes use less accurate versions of some instructions, e.g. divide is usually implemented via reciprocal, the faster version of fused-multiply-add is used, etc.