Question about "Definitions", Basic Ideas for MD-Simulation


I want to use my own MD-Simulation C code with Open CL and for that purpose I have a few question.

The numbers of particles I want to use is 4096, so my global work size would be 4096 (?) and each force calculation and integration for the movement can be a single work-item (?).

What is the goal of the Work_Group ? My GPU can handle only a certain number of work items with shared memory so I group my workitems to make it easier for my gpu to parallize ? Or do I not need this parameter in my example, but when I increase the numbers of particle it can become handy. In other words, can I fire up all 4096 calculation at once and will it be solved at once ? Furthermore, my data would be stored in a structur, so the “data dimension” is 1 ?

What is a reasonable Work_group size ?

Is there a list of the error codes available on the website, because I have a probleme with passing arguments to the kernel and I get the error number 38.

Is there a list available for the clGetDeviceInfo command which tells me what variable I need for a certain quantity (for example:CL_DEVICE_MAX_WORK_ITEM_SIZES)

I hope someone can answere these questions to me, because I´m really excited about this new piece of technology.
As a remark, I found the simulation of stars by apple, but its to complicated for me to get all the answers to my question

best regards

Your global work size should be 4096 if you want to calculate each particle in parallel. That may not be optimal for performance on different architectures, but it’s a reasonable start.

The work-group is the number of work-items run together on a compute-unit that can share a local memory and synchronize. The size of your work-group should be optimized for best utilization of the hardware or, if you need local memory or synchronization, to optimize those. For example, to do an n-body simulation efficiently you generally want to block some of the particles into the local memory to reduce memory traffic. (E.g., if you are iterating over all particles for each particle you can roughly halve your memory traffic by doing so.)

A reasonable work-group size assuming your use of the local memory or synchronization is not work-group dependent is between 64 and 256, but you need to check both the device maximum supported work-group size (in total and in each dimension) with clGetDeviceInfo and the particulars for your kernel with clGetKernelInfo.

The error codes are all listed and defined in the OpenCL header.

clGetDeviceInfo() is documented in the spec document which you can find on the Khronos website.

Thx for your answer.

I now got the problem that my code is a factor x2 slower than the CPU version (not OpenCL, plain C with -03 and auto-vectorization). Of course, I won´t post my whole code here to fix this, but can someone help me guessing where the bottleneck may be. Its a straight forward MD Simulation with bruteforce-calculation technique under the use of 3 kernels(particle move 1, force calculation and particle move 2). After each kernel execution I use a cl_Finish command to wait that the calculation is finished before the next one starts. I use shared memory for the force calculation and don´t copy memory between host an device during the simulation time. Gloablworksize is 4096 and localworksize is 64.

I tried to spot the possible “error” and found that the cl_finish command makes a huge time difference for the run-time. Of course, I need a thing like that for a correct calculation, but is there a workaround for it ? Or lies the error somewhere else

You shouldn’t need to use clFinish. You should either use events to make sure that each kernel waits for the one before it, or just use an in-order queue so they will execute in-order. clFinish is extremely expensive as you have noted. In general you want to enqueue as much work as possible to the device (multiple kernels, double-buffer if possible to overlap computation) and use events to determine when data is ready to be read back.