The host provides a 3d vector field, i.e. a 4d-float-matrix:
The first three dimensions represent a lattice and the fourth dimension of length 3 stores the three vector components x,y,z at a given lattice point. Before passing this structure to the kernel, it is flattened to a 1d array of length 3NxNy*Nz. Inside the kernel an iteration for each lattice point (i.e. each vector) has to be done for let’s say 10 steps. BUT: For each iteration step the values of adjacent lattice points (6 for each lattice point) have to be considered. Without this restriction I can just let each worker do all 10 iteration steps for each lattice point as they are all independent. But with this restriction I have to wait for each lattice point to reach the current iteration step before the next step for any lattice point can be done.
Is there a way to cope with this? I’m not very experienced with OpenCL.
The only way to synchornise global memory writes across multiple work groups is to run another kernel. i.e. do one step at a time. The best way to think about it is that global memory is either read-only, write-only, or only read-write to the same address range (per work group).
There are many reasons, including that kernels might not even be running physically because they can’t fit on a specific device, and it allows the hardware to run faster as it doesn’t need to worry about coherency across 20+ devices.
Atomics are no solution here, they will be too slow and are not designed for it.
If the problem had only a local requirement then some synchronisation could occur inside the kernel using local memory instead. e.g. depending on what you do with the adjacent values, you could just over-calculate overlapping regions so you can do synchronisation in-kernel. But I doubt this is the case.
I also thought about this possibility. Does the data have to be moved between host and device every kenrel call or can it stay in device memory until the last timestep/call is done and then fetched only once to the host? (I’m not that familiar with the OpenCL memory model.)
You should probably read up on the relevant parts of the spec, section 3.3 is about the memory model. Chapter 3 overall is a fairly light read and introduction to the architecture and you should at least read that.
OK, I’ve read section 3. Now I have a question about the synchronization. Currently I’m using AMD APP SDK and a Intel Core 2 but in near future I will switch to a Nvidia GTX 560. The device info method says:
MAX_WORK_ITEM_SIZES: [1024, 1024, 1024]
So, if my vector field matrix doesn’t exceed this dimensions, I can synchronize the work-items inside the kernel, right? Would it be more efficient than synchronizing by recalling the kernel?
You should read the NVIDIA OpenCL programming guide and the OpenCL best practices from here http://developer.nvidia.com/nvidia-gpu-computing-documentation. There are many ways you can organise your workgroups and workitems. For the GPU in particular, yes you can put them all in one workgroup but you won’t get very good performance as this workgroup will only use a single SM (since you cant synchronise across SMs).
You want to allocate your workitems in multiples of 32 (a warp) and then you make however many workgroups you need based on the multiple you use for best performance. As I say, the programming guides explain it very well. But yes, if you want to synchronise across all workitems you need one big workgroup.
Not especially. Just minimise the amount of data transfer you do at each iteration and do as few kernel calls as possible [/quote]
Yeah, I was expecting an answer like this.
Is anybody here who has used Python and C and can compare the performance? I’m wondering about how much overhead causes a kernel call in C compared to one in Python.
Yes, I have. However, the code is not finished and it may be true that further changes will affect the problem in such a way, that the overhead becomes negligible or even more significant.
What I’ve done so far was programming a kernel for numerical integration based on heuns method for many equations with different initial conditions. At some point I had to remove the iteration loop from the kernel and place it in the host in order to synchronize all work items. This has lead to a dramatic slowdown which depends on how much equations there are to integrate and how much timesteps to do. Assuming that there will be enough equations to utilize a GPU well, the remaining factor is the amount of timesteps per equation, which corresponds to the length of the iteration loop and thus to the amount of kernel calls.
for nt in range(amountOfTimesteps):
This loop iterates for a certaint amountOfTimesteps in 16.7 seconds. When I add a dummy kernel (without any instruction inside and without passing any data) into the loop the execution grows to 31.3 seconds in total. A second dummy kernel per iteration results in 45.6 seconds. Hence, calling the dummy kernel takes 45.6 - 31.3 = 14.3 seconds. Assuming that the call of the more complex “real” kernel takes at least the time of calling the dummy kernel there are only 16.7 - 14.3 = 2.4 seconds left for the actual calculations inside the kernel. The calculations take only 7% of the whole process.
Of course this effect becomes negligible when the kernel becomes very complex or just time expensive (this was the case when the iteration loop was inside the kernel). But for now I can’t say how complex my kernel will be in future. Therefore I’m wondering about this overhead in C.
Try some empty function calls in python with your secret number of iterations and see how slow it is, compare to C (but make sure the compiler doesn’t optimise it away - put the function in another .o file).
You’d have to ask the python binding guys how much overhead they expect in that sort of use-case, as who else knows what it’s really doing under the bonnet.
You absolutely cannot obtain any ‘relative’ performance numbers by comparing broken ‘fast’ code that doesn’t work with ‘slower’ code that does!
The numbers are totally meaningless if it doesn’t work.
Any sort of global synchronisation is extremely expensive to perform without special functional units, and to do it how you are asking amounts to effectively turning off the caches - so orders of magnitude slower results should be expected.
Hm, okay. But have I understood correctly, that inside a kernel a work group of not more than 1024x1024x64 (this is MAX_WORK_ITEM_SIZES) work items can be synchronized?
edit: The code is not broken, it is only be limited to 1024x1024x64 matrices. I have to check whether I can live with this restriction or whether I have to find a way around this. And if the solution I found (moving the iteration loop to host) is too slow, then I will have to find another way.