[XF] Float16 vs 16 float

Hello every body,

I’m new with OpenCL. I try to illustrate the power of float16, but I failed to.
I built a program which add to 1024102416-array of float. With GPU, when I run with float16, the time of computation is 0.03 secondes. With GPU when I run with 16 * float, the time of computation is 0.006 secondes. And with CPU, the time of computation is 2 secondes. But Why it’s longer with float16 than 16 * float?

Thanks for your help.

A part of my code :

Fichier Main.cpp :

// Define an index space (global work size) of threads for execution.
// A workgroup size (local work size) is not required, but can be used.
size_t globalWorkSize[1];
size_t localWorkSize[1];
// There are nbKernel threads
globalWorkSize[0] = nbKernel/16;
localWorkSize[0] = 512;

// Execute the kernel.
// 'globalWorkSize' is the 1D dimension of the work-items
status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);


Fichier.cl :

__kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C)

unsigned int const i = get_global_id(0);

C[i] = A[i] + B[i]; 


Xavier Faure

Can you show us the kernel code for 16 floats – you only showed the float16 code.

What is the value of nbKernel in both cases? Have you tried passing NULL instead of localWorkSize?

There’s all sorts of reasons the float16 case might run slower:

A) The biggest problem is the memory accesses will not be coalesced. The float16 load will be serialised into a number of smaller loads (float4?), which are spread across the address space. With a float type, every thread will access a single float in a packed range which can be serviced by a single memory load for each wavefront/warp.
B) You get less parallelisation of the ALU. i.e. each processor needs to do more than one operation to implement the arithmetic. (whether this is faster or slower though depends on the problem, extra parallelism isn’t always a win)
C) You will have 1/16th as many threads running, and that might not be enough to hide the memory latencies depending on the problem size.
D) Probably not a problem in this case, but if you did something more complex, you are using many more registers - this limits how many threads can run concurrently on a given multi-processor.
E) Maybe … if the compiler does a full load, then a full alu op, then a full store, you don’t get such good interleaving of memory + alu ops, which may prevent the memory latency from being hidden.

I’ve never used a float16, and I can’t imagine where they’d be particularly useful for GPU, or CPU performance - they might be worth it on CELL though, where instruction-level data pipelining is critical to performance, and there are lots of registers.

GPU’s seem to be optimised for float4 (not surprisingly: RGBA, or XYZW, etc).

There are just two differents :

  1. The number of nbKernel
    nbKernel = 1024102416, the size of the array in case of 16 floats
    nbKernel = 1024*1024, in case of float16

  2. The Declaration of the function :
    __kernel void vecadd(__global float const * const A, __global float const * const B, __global float * const C) in case of 16 floats
    __kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C) in case of float16

I tried passing NULL but it gave the same result.

Thanks for your help

I tried a lot of configuration with a 8388608-array with float.
The two parameters I tried to optimize are local size ( 16 32 64 128 … ) and type float dimension (float float2 float4 … )
And look the results :
So, the type float seems to have a bad impact on the computation time.

Does it mean my kernel doesn’t have enough job?

I try with a lot of size and every time it’s the same problem.

You are right : here, it 's not my problem.

I don’t understand this anwer.

OK thanks a lot for your answer and sorry, I took a long time to answer.

Have nice day.


Well … yes.

Anyway - you’ve demonstrated that there is no benefit from using float16 but a negative impact on performance. This is a result.

The reasons I listed are possible ones - some of the actual hardware details are proprietary so some of them are only guesses.

I suspect the main one here is the memory reads aren’t coalesced properly. See the nvidia or amd documentation (the ‘programming guide’ ones), they cover this pretty well with nice diagrams.