I’m currently working with opencl and i’m getting issues with a high amount of registers per thread in my main kernel.
The main kernel use a quite large amount of float4 but actually it could be float3 most of the time. I know cl_float3 is a typedef of cl_float4, i also know that float3 on device side is a 16 bytes struct.
Am i right, if i think that extra unused float is a waste of register ?
if yes I’m looking for a tip to bypass this problem ?
sorry for bad english.
You are correct, float3 is a float4 to fix the alignment. There are ways to use float3 in kernels. Have a look ate the vloadn function. But it is wayyyyyyy slower than using cl_float3 (which is a cl_float4). And because of the register question I’m not really shure. Best way is to use the 4th component for data you will need in your computation somwhere else (index of vecor in Hostmemory ect)
ok it seems NVIDIA compiler is able to save register when i use float3 instead of float4.
And i can keep float4 in global to keep the alignment using as_float3 convertor.