Reliable distributed computing w/ OpenCL, anyone doing this?

I have been working on gettting a topic for my master’s thesis. I have been mostly on the hardware side, but covering topics like computer architecture, fault tolerance and reliability. The latter are topics of interest to me. For the past 2 years, I have been working as a software engineer for a company that uses exclusively macs and was curious about OpenCL.
I started looking into it and it seems great for serious computation, eventually I imagine people would want to use it as the “horse power” for distributed computer systems.
I don’t want to bore anyone to death, and would appreciate anyone that might have any info to get me going. If anyone knows of any projects around the area of reliable distributed computing (preferably using OpenCL), please let me know. Like for example, the platform model for OpenCL doesn’t seem to be well versed for recovery/reliability. It would be great to find some more info about this.

I don’t think you are going to find much out there about this with OpenCL for two reasons. The first is that OpenCL is new and the second is that high-end GPGPU work is also relatively new.

You could certainly build a framework around OpenCL where you checkpoint results by calling clFinish() periodically and reading back all cl_mem objects from the device to the host memory. This would allow you to recover from transient failures, but I don’t think anyone has done this.

I would imagine it wouldn’t be much out there. It’s actually great since I can work on anything and not worry about it being already done. How about any similar projects in CUDA or anything of that sort? It might give me a starting point.
Thank you for the tip with clFinish(). I have been leaning against that exact topic of creating a “reliability framework”. :smiley:

Indeed, it’s hard to find something about distributed OpenCL. But it exists!

I began a project, a year ago, to create a open source distributed OpenCL framework. It’s in this early stages, but I already have a lot of code done. The progress is not the best one, since I’m working alone, but I’m trying to get some collaboration.

If you want to take a look:

There’s also another project being talked about here: viewtopic.php?f=40&t=2536&start=0

Now two years later, is there anyone working on a checkpoint/restart facility for OpenCL?

How about adding a checkpoint/restart feature to the OpenCL standard?

The idea is of a function call that stores in memory everything about a given OpenCL context: the program can then relinquish the GPU(s) and continue later from where it was. This is a standard feature of CPU programs, but currently to my knowledge has no GPU equivalent.