Profiling / optimizing a fragment shader in Linux

HI, i’m writing fragment shaders in GLSL for artistic purposes - doing livecoding performances as a vj. I recently started to be interested in optimizing my GLSL code. A big inspiration is the talk “Low level thinking” by Emil Persson at GDC.

I am working using nvidia under Linux and I wanted to ask the community what tools I could use to get some relevant data on the shader performance. I am familliar with Nvidia Nsight, which runs under Linux, however the tool doesnt fulfill my needs, the most relevant metric i get from it is the frame render time.

I would be more interested in seeing how the driver optimizes and compiles my GLSL code into assembly - specifically to get an idea how to write faster code, using the methods that Persson demonstrated in his talk (basically checking instruction counts for any given GLSL command).

Do you have any idea how to proceed ? What tools could i use to see how my GLSL is compiled ? Thanks a lot for any pointers !

Here are a few methods for getting that assembly.

Since you’re running on NVidia drivers, there are several options you can use to get the NV Assembly (NV ASM) for a GLSL shader:

  1. Bring your OpenGL program up in Nsight Graphics, OR
  2. Use the Cg command-line compiler (cgc) to compile GLSL -> NV Assembly from the command-line, OR
  3. Compile and link your shader programs with OpenGL, and then readback the shader binary to get the NV Assembly from the binary.

Nsight Graphics -> NV Shader Assembly

Just bring up your OpenGL program in Nsight Graphics and capture a frame in the Frame Debugger. This is pretty easy, once you get your program running in this tool.

I’ll have to check my notes and update this, but IIRC you can see the NV ASM for GLSL shaders that you provide to OpenGL in Nsight Graphics. Note that this assembly is not the lowest-level assembly that runs on the GPU (sometimes called SASS). It’s a higher level cross-GPU-architecture assembly language well documented in NVidia OpenGL extensions (see the all the extensions named NV_…_program… in the OpenGL Registry, such as NV_gpu_program5).

I’ve also heard from NV folk that, if you get a special-access version of Nsight Graphics from NVidia, you can also view the SASS (which if I understand correctly is the lower-level assembly that’s closer to what runs on the GPU cores).

Also note that Nsight Graphics gives you a number of profiling tools such as the Range Profiler which can help optimizing your OpenGL programs.

Compile GLSL -> NV Assembly using Cg compiler

This option is very easy to use, once you’ve installed the Cg RPM. First, download the latest Cg distribution and install the RPM. NVidia stopped developing Cg 8 years ago, but this still runs fine (I just re-ran it here on my Linux box). Once you’ve done that, you’ll have (among other things) 3 new binaries in /usr/bin: We only care about cgc, the command-line Cg compiler.

cgc was primarily used to compile Cg shaders. But it can also be used compile GLSL shaders (with #version 330 or lower; the latest GLSL released when they stopped developing Cg) You can compile shaders for all shader stages, but here (for example) is how you can compile GLSL vertex, geometry, and fragment shaders down into NV Assembly with cgc:

cgc -oglsl -strict -glslWerror -profile gp5vp tst1.vert
cgc -oglsl -strict -glslWerror -profile gp5gp tst1.geom
cgc -oglsl -strict -glslWerror -profile gp5fp tst1.frag

Just put your GLSL source for a shader stage in a text file, compile it with one of the above, and out pops the NV Assembly on stdout. At the end of the assembly dump, it outputs a 1-line comment that notes the number of assembly instructions generated for the shader and the number of temp registers it uses (each register being space for a vec4). For instance:

...
TEMP R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11;
...
# 1125 instructions, 12 R-regs

The number of registers is important for performance. Fewer registers allows more shader invocations to run in parallel, which makes it easier for the GPU to hide memory access latency (due to tex lookups, etc.) in shaders. Here’s a cool realtime demo of this concept, courtesy of ShaderToy (LINK).

As an alternative to GLSL (and SPIR-V), you can feed NV assembly shaders directly to OpenGL on NVidia graphics drivers if desired.

Using OpenGL to Compile GLSL -> NV Assembly

This is a bit more trouble, but I just mention it because it should support GLSL shaders with #version > 330.

Basically, just feed your GLSL shader stage source code for all shader stages to OpenGL (by writing an OpenGL program), compile all of them, link them into a shader program, and then read back the shader binary. You’ll get a binary blob back. Embedded in the binary blob should be the NV Assembly source code text for each of your shader stages. Picking though the blob to extract the text is a little trouble, but it’s not too bad because you’re extracting text blocks out of the binary blob. Also, those text blocks all start with !NV because that’s beginning of header lines for NV assembly source code.

IIRC, for really, really recent OpenGL extensions (e.g. Turing task and mesh shaders), NVidia has switched from using NV Assembly to using SPIR-V for the intermediate rep. So for those newer shader types, you might see SPIR-V instead of NV ASM (though I haven’t checked). But last few times I’ve used this trick, you get NV Assembly output.

Again note, this is not the lowest-level shader assembly (which is GPU arch specific IIRC), but rather the higher-level, cross-GPU-arch NV Assembly clearly documented in the NV_…_program extensions.

Hi @Dark_Photon, thanks a lot for your tips!

As i have mentioned before i did try to use Nsight (on Linux, my version is 2019.6.0.0) but I cant find the NV assembly of the fragment shader anywhere in the software. It seems like that is only a option for CUDA related computing in Nsight on Windows.

This worked flawlessly. I have spent the whole day trying to lower the total number of instructions and i think i have reached a point where I cant squeeze it down anymore :wink: I have tested various versions of the same shader against each other using Nsight - basically i run the versions and profile 10 consecutive frames and compute the average out of that, and compare the versions against each other. Its not very ‘scientific’ and i have to do it by hand, but i was able to go down from ~90ms (@1280x720) to ~45ms, so thats quite an improvement and the image quality is almost the same.

I have noticed some counter-intuitive results such as that actually lower instruction count does not always mean a faster behavior, or that unwrapping loops doesnt speed up the shader. Seems like the current driver is OK with loops (contrary to what i read somewhere online). In the NV assembly they look like this:

REP.S {5};
MUL.F R6.xyz, R3.w, R5;
DIV.F R7.xyz, R6, c[3].x;
FLR.F R7.xyz, R7;
MOV.F R4.w, {0.5}.x;
MAD.F R6.xyz, -R7, c[3].x, R6;
MAD.F R6.xyz, R4.w, -c[3].x, R6;
MUL.F R6.xyz, -|R6|, {3}.x;
ADD.F R6.xyz, R6, c[4].x;
MAX.F R4.w, R6.x, R6.z;
MAX.F R5.w, R6.x, R6.y;
MIN.F R5.w, R4, R5;
MAX.F R4.w, R6.y, R6.z;
MIN.F R4.w, R4, R5;
MUL.F R3.w, R3, {3}.x;
ADD.F R4.w, R4, {-0.22}.x;
DIV.F R4.w, R4, R3.w;
MAX.F R2.w, R2, -R4;
ENDREP;

When i unwrap this, i get a lot more instructions and a much slower result.

Do you have any resources that would describe how cg assigns the registers ? Initially I had 9. When i moved some computation on the CPU from the shader I got 8, but i cant get lower. Im also thinking that probably a difference between 9 and 5 (for example) registers whouldn’t even be noticable ?

Thanks a lot again !

I just re-checked this, and it looks like you’re right. If you provide GLSL to the GL driver, you get GLSL in Nsight Graphics. It’s only if you provide NV ASM to the GL driver that you get NV ASM in Nsight (…at least with the settings I’m using for this test).

UPDATE: However, I just noticed something I hadn’t seen before. If you pull up a specific GLSL Linked Program in Frame Debugger -> Linked Programs, If you expand the program it shows you the “# Reg” used for each of the component shader stages (VS, FS, etc.). So at least there’s that.

Great!

As I understand it, register count is most important for when your shader is memory access latency limited. A specific shader may or may not be latency limited (or even shader limited, depending on the app). Nsight Graphics’ Range Profiler can provide clues here.

Also IIRC, even when your shader perf is limited by the register count, there’s not a continuous relationship between num registers and number of parallel threads/warps running on the cores. It’s a step function, quantized to various numbers of registers. For example, here’s a pic from a GDC 2018 Nsight Graphics tutorial video around time offset 40:10:

No, sorry. IIRC though, seems like I’d read somewhere that (at the time at least) NVidia used the same compiler core in their GL driver and in the Cg compiler. Which was probably why you could use some of the same “Cg-isms” in GLSL code that you provide to the GL driver, and why you could compile GLSL code in the Cg compiler. If true, then this is (or at least was) what the GL driver was producing them.

You can compare the NV ASM you get from cgc with the NV ASM you get from the “Using OpenGL to Compile GLSL -> NV Assembly” method method to see if that is still the case.

And again keep in mind that this isn’t the low-level ASM that runs on the GPU. As I understand it, it’s an intermediate rep. How NVidia splits register optimization and instruction ordering between this higher-level GLSL->NV ASM step and the NV ASM -> low-level ASM (SASL?) step I have no idea.

Yes. Loops are fine so long as they’re not divergent (and don’t take too long to execute, of course).

Nsight Graphics profiling tools (e.g. Range Profiler) provides other perf statistics which can indicate whether you are shader limited, and if so, why. For example, here are a few of them:

  • SM Issue Utilization - % of SM active cycles a SM scheduler issued at least 1 instruction. If < 70% may not have enough active warps to hide the latency of a warp. If low, see “SM Warp Stall Barrior” and “SM Warp Stall Long Scoreboard”.

  • SM Warp Stall Long Scoreboard - % of active warps stalled because need to use result of a TEX lookup that hasn’t finished yet. Can lead to high L2 and low SM. “SM Issue Utiliz” < 60% and “SM Warp Stall Long Scoreboard” > 20% --> SM perf is TEX_latency limited.

For more on this, see these helpful getting-started tutorials: