Most efficient way to write vec3s to an SSBO from a compute shader?

I’ve written a compute shader that generates mesh data. I have two output buffers that I need to fill with the point and normal data I’ve computed. In the current version of my shader, I’m declaring them as

layout(r32f, set = 0, binding = 4) writeonly restrict uniform image1D result_points;
layout(r32f, set = 0, binding = 5) writeonly restrict uniform image1D result_normals;

and writing to them as

		imageStore(result_points, write_pos + i * 3, local_point_pos.xxxx);
		imageStore(result_points, write_pos + i * 3 + 1, local_point_pos.yyyy);
		imageStore(result_points, write_pos + i * 3 + 2, local_point_pos.zzzz);
		
		imageStore(result_normals, write_pos + i * 3, grad.xxxx);
		imageStore(result_normals, write_pos + i * 3 + 1, grad.yyyy);
		imageStore(result_normals, write_pos + i * 3 + 2, grad.zzzz);

imageStore only accepts vec4 as a datatype so I’m passing redundant data to it.

I noticed during testing that if I switched the buffer type, this shader runs much faster:

layout(rgba32f, set = 0, binding = 4) writeonly restrict uniform image1D result_points;
layout(rgba32f, set = 0, binding = 5) writeonly restrict uniform image1D result_normals;
...
		imageStore(result_points, write_pos + i, local_point_pos);
		imageStore(result_normals, write_pos + i, grad);

Probably because of the fewer imageStore calls. Unfortunately, this also means I’m returning 4 floats per point instead of 3 and I need to return a buffer of vec3 since that is what the next stage of my pipeline requires.

Is there an efficient way to write vec3s to my buffer, or should I write vec4s and then postprocess the buffer on the CPU?

You can just write to a storage buffer that’s an array of floats. Simply write 3 values instead of one. The std430 layout allows the array of floats to have a 4-byte stride (note that an array of vec3s will still have a 16-byte stride).

But if you want something more direct, where you can have an actual array of vec3s, what you’re looking for is the EXT_scalar_block_layout extension. It allows all arrays and structs to have alignments and and array strides based on the alignment of the individual scalars that make them up. Something is only 8-byte aligned if you store a double, for example. Basically, it works like C alignment.

It’s implemented on a fairly broad set of hardware, but it’s far from universal.

should I write vec4s and then postprocess the buffer on the CPU?

You should never do that. Even if neither of the above is palatable, you could just change “the next stage of my pipeline” to use vec4s.