[QUOTE=l_belev;1260278]To summarize your post, nvidia and amd are 100% scalar-firendly while intel are semi-scalar-friendly (only fragment shaders).
But what I know from intel’s documentation is that the issue with vertex shaders is not actually hardware but software-related, that is, it is their driver that puts the hardware in vectored mode for vertex shaders and scalar mode for fragment shaders.
In other words the hardware can actually work in scalar mode for vertex shaders too, it’s up to the driver. They could change this behavior by driver update, which would be needed anyway in order to support the hypothetical new binary shader format.
When the vectored mode is left unused, they could clean-up their hardware from this redundant “flexibility”, which would save power consumption and dye area. Thats what all other GPU vendors figured out already, some of them a long time ago.
[/QUOTE]
For Intel’s Gen7 and before, that only 2 vertices are handled per invocation is a hardware limit. So for vertex and geometry shading, and tessellation too, the backend to the compiler should do everything it can to vectorize the code to vec4s. If a vertex shader cannot be vectorized at all, then for Intel Gen7 and before, ALU utilization is at 25%. However, most programs are almost never vertex bottlenecked. To do 8 vertices at a time requires sand and changes to the logic for pushing vertices into the pipeline and the post vertex cache and so on. On the subject of tessellation, 8-wide dispatch would be great, but I suspect more sand will then be needed on the GPU to buffer the tessellation. Geometry shader 8-wide dispatch is quite icky: lots of room needed and one needs to execute it -every- time (no caching for GS really) and fed to to the triangle-setup-rasterizer unit in the API order, again requiring even more buffering, namely 4 times as much as 2-wide dispatch. I hate geometry shaders
For the case of fragment shading, SIMD8, SIMD16 and SIMD32 mode, it is actually wise to have multiple fragment dispatch modes. The reasoning is register space: a hardware thread has 4KB of register space, under SIMD8 that gives one 512B of scratch per thread, under SIMD16 gives 256B of scratch and under SIMD32 128B of scratch. If the shader is too complicated, then more scratch space is needed, hence the different modes for fragment shading.
That would also ease the job of their driver team.
One would expect that they should have learned lessons from their long history of chip-maker that over-engineering stuff does not result in more powerful hardware but in weaker one (remember itanium?).
Nvidia also learned a hard lesson with geforce 5 when they made it too “flexible” for supporting multiple precisions.
The reason the GeForceFX had such a hard time was because it was really designed for lower precision in fragment shader (fixed and halfp). Then DirectX9 came around and said: you need fp32 support in fragment shader. So although the FX could do it, it was not optimized for it at all. Worse, the word on the street was that because MS and NVIDIA were having a spat over the price of the Xbox GPU’s is why NVIDIA was left a little in the dark about the fp32 in DX9. That is rumor, though and I cannot find hard data confirmation of it.
As for implementing fp16 or fp32. The gist is this: lets say you have an ALU that is N-SIMD for fp32. It turns out adding fp16 to the ALU is not that bad, then the ALU can do 2N-SIMD for fp16. So now, that is a big deal as one literally doubles the FLOPS if the code can be completely fp16. So the “easier” thing to do is to make the compiler vec2 fp16 operations. The easiest thing to do would be to only support fp16 ops and then double the fragment shader dispatch, but there are plenty of situations where fp32 is really needed, so going for pure fp16 fragment shaders is not going to happen. The case for vertex shading is similiar and stronger: fp16 for vertex shading just is nowhere near good enough.
But back to a shader binary format that is pre-compiled (i.e all high level stuff done). The main icky is vector vs scalar. When optimizing, if doing at the scalar level it is heck-alot-easier, to optimize for fewer ops. However, what ops should this thing have? it would need much more than that GLSL has because there are all sort of gfx commands like say: ClampAdd and so on. Getting worse, is the scalar vs vector issue. On that side my thoughts are pretty simple: hardware vendors get together and create several “modes” for such a thing, for example:
[ol]
[li]All ops are scalars[/li][li]Fp16 ops are vec2-loving, all others are scalars[/li][li]vec4 loving and vec2 loving in general[/li][/ol]
Naturally it gets really messy when one thinks that for some hypothetical hardware that likes vec ops for VS and support fp16, one can imagine a nightmare like fp16-vec8. Ugly. So the hardware vendors would need to get together and make something that is worth while for each of their hardware. On PC that just means: AMD, Intel and NVIDIA. I shudder when I think about mobile though.