The new stuff

Hi,
I haven’t been following much the development of DX11 GPUs. I can see that GL 4.0 GLSL has some functions like determinant(), transpose(), inverse(), fma() and some bit manipulation for casting float to int or int to float.
Are all these GPU accelerated now?

I think determinant() and transpose() might have been there since GL 2.0.

Define “GPU-accelerated”.

transpose and inverse may be able to use bits of hardware that implementations don’t expose directly to users. So in that sense, it is likely that they will be faster than anything you could write.

You can probably assume that fma is directly supported by hardware.

These are shader functions, of course they’re running on HW and are therefore ‘accelerated’, but whether the compiler simply puts a macro in there or there’s some more targeted hardware support is going to vary and is the stuff of GPU wars.

It would be trivial to implement transpose in HW, it’s a single specific case of a 16 register swizzle(on a mat4), the issue is would a designer look at this and think it’s worth the effort when you can just do a few copies. There’s also a transpose flag when sending in uniforms, which is where an app should set this if possible. A robust full inverse would not be so straightforward and is either a macro or some hybrid.

MAD is common enough and useful enough that it’s sure to be in there as a single instruction, probably was already and optimizing compilers would have been spitting out this instruction already.

So I think it’s ALL going to be hardware accelerated where the vendor supports the API. The real issue is how many instructions they use in hardware. It’d be nice to call inverse on a 4x4 in a shader for a one clock solution, but it ain’t gonna happen, but it will still run on the GPU and it will be HW accelerated, of course in a shader you’re doing a lot of potentially redundant matrix inversions if you throw that kind of code in there under the wrong circumstances, so make sure it’s justified.

Well, that’s the thing. Let’s take the case of the transpose. If there isn’t any GPUs that has a assembler transpose instruction, then it shouldn’t exist.
And if Khronos decided to include it in GLSL, there should be a doc that explains the reasoning. “We included this because… and for performance reason X”

“We included so and so because the following generation X GPUs now support it”.

Perhaps they should just offer functionality that exposes hw features and no more convenience functions.

Well, I would say convenience is convenient, and in the inverse() case it is or can be hardware optimized.
Both points makes it a worthy addition.

Is there really a need on lengthy hazy explanations like “maybe some vendors can do this or this to optimize better bla bla bla” ?

Let’s take the case of the transpose. If there isn’t any GPUs that has a assembler transpose instruction, then it shouldn’t exist.
And if Khronos decided to include it in GLSL, there should be a doc that explains the reasoning. “We included this because… and for performance reason X”

I don’t buy that logic at all. Just because there is no GPU opcode for it doesn’t mean that the implementation can’t implement it more efficiently than the user.

The point of the standard functions is not to say that each one is a single opcode. It is to allow implementations the freedom to optimize what they would otherwise not easily be able to.

Remember: these functions will be at least as fast as a user-implemented one.

Alfonse just formulated exactly my thoughts.

This is nonsense, there are many instructions that do not produce single opcodes. Do you think a matrix multiply is an atomic operation at the hardware level? Or even vector xform?

Hardware and their associated compilers have the potential to optimize this. Regardless of how much hardware & compiler resources are dedicated to it it is IS hardware accelerated.

Feel free to write your own and don’t use the intrinsic function.