How to represent 4x4 matrix?

i am trying to submit a set of 4x4 matrices to an OpenCL kernel. I tried two ways:

  1. float16
  2. struct { float4 colums[4]; }

I load the matrices into a OpenCL buffer from the host side:

#ifndef __OPENCL_VERSION__
#include <core/math.h>
typedef math::vec4f    float4;
typedef math::mat4f    float4x4;

#if 1
struct mat4f
    float4 col[4];
typedef struct mat4f float4x4;
typedef float16 float4x4;

struct volume_uniform_data
    float4      _volume_extends;     // w unused
    float4      _scale_obj_to_tex;   // w unused
    float4      _sampling_distance;  // yzw unused
    float4      _os_camera_position;
    float4      _value_range;

    float4x4     _m_matrix;
    float4x4     _m_matrix_inverse;
    float4x4     _m_matrix_inverse_transpose;

    float4x4     _mv_matrix;
    float4x4     _mv_matrix_inverse;
    float4x4     _mv_matrix_inverse_transpose;

    float4x4     _mvp_matrix;
    float4x4     _mvp_matrix_inverse;
}; // struct volume_uniform_data

The kernel interface looks like follows:

main_vrc(__write_only image2d_t                        output_image,
         __read_only  image3d_t                        volume_image,
         __read_only  image2d_t                        colormap_image,
         __constant   struct volume_uniform_data*      volume_data)

The problem i am facing is that the first version using a float16 for the matrices fails as the float16 variable contains wrong data. The struct on the other hand works perfectly.

Why is that? As i understand everything should be 16byte aligned in the volume_uniform_data struct, which it should be using both solutions.

I am trying this on Nvidia GeForce 480/580 hardware using r285 drivers.

You probably need to consider that your structure has 4 float4’s and they are not columns, but rather, they are rows.

This snippet should help you understand:

float4 mat1[4];
float16 mat2;

// Assume you could do mat2 = mat1; here. It would look like:
mat2.s0123 = mat1[0].s0123; 
mat2.s4567 = mat1[1].s0123;
mat2.s89AB = mat1[2].s0123;
mat2.sCDEF = mat1[3].s0123;

// (.s0123 here on the float4 is the default swizzle, used for demo purposes only)

// Your col[4] struct would need the following transpose swizzle to fix your notion of how the data was stored:
mat2 = mat2.s048C159D26AE37BF;

thanks for the suggestion, but i originally also tested if the matrix is stored transposed in the float16. my matrix on the host side is column major, so a simple copy to the struct does what i expect, but why should a float16 be not stored linearly?