CPU code for converting from floats to halfs?

It would have been really useful if the OpenCL standard included a CPU side converter from float to half. In order to circumvent this I’ve created a kernel to specifically convert from floats to half, however, this is now a bottleneck in my application. I imagine because it has to touch the device for such a small amount of work.

Does anyone have some CPU code that will convert from a float to a cl_half without having to touch the device? The inverse (half to float) would be useful for completeness.


I would suggest writing a simple kernel that does this and just run it on the CPU. That way you’ll get it fully accelerated and portable.

Is there a generic CPU implementation available yet? Currently, the NVidia implementation on Linux doesn’t support a CPU device.


Furthermore, for program abstraction purposes, it is cleaner to be able to convert small chunks (~100) of floats at a time. The overhead of launching a kernel, even a CPU kernel seem too great. The abstraction is as follows, and doesn’t even require the API user to know OpenCL is being used in the background.

class Database {
  void AddRecord(float *data, size_t size);
  size_t Query(float *query, size_t size);

So the user just adds floats (they get converted by the AddRecord method to halfs). And then the user queries however often they want.

Therefore I would still prefer a pure C solution. I’ve found the following article and code describing how to do it: http://www.mathworks.com/matlabcentral/ … ange/23173 It says it will convert according to the IEEE standard, so can I then reliably load that memory to GPU memory and be assured it will work cross platform?


glm library http://glm.g-truc.net/ have an implementation of CPU half … but I don’t if it’s completely crossplatform.

I agree that CL will have a large overhead for converting only a few (<1-8MB) of floats. I’m curious as to why you are converting only a few hundred floats at a time? If you’re processing that data you’ll have a similar overhead in the kernel itself. I’m guessing you want to use half values to save storage space on the device, but most devices should have >128MB of memory. Are you doing this to save space in local memory?

Yup, I can effectively double my occupancy (and i.e. performance for a latency bound kernel) by keeping the values in local memory as halfs. Really useful trick, thanks for getting halfs into the standard. I’m also storing the values in host memory, so it doubles the amount I can hold there (and therefore don’t have to touch the disk to get more database records).

The glm library worked fantastic. I just lifted the code straight out of it and it works like a champ. I love MIT licensed code.