NV_command_list

Dark_Photon · January 1, 2015, 6:38pm

So, it’s been about a month since SIGGRAPH Asia and the release of NV_command_list. Looks pretty interesting. Anyone taken it for a spin yet? Thoughts?

Also, has anyone seen Linux drivers supporting this new extension out there yet? Or an extension spec? I was going to give it a whirl, but only have Linux installed here right now.

In the absense of an extension spec, looks like gl_nv_command_list.{h,cpp} in the sample code (+ the presentation) is a decent kickstart.

Refs:

[ul]
[li] OpenGL NVIDIA Command-List (Presentation @ SIGGRAPH Asia, nVidia, Tristan Lorach) [/li][li] Windows Pre-release drivers (347.09) [/li][li] Sample Video [/li][li] Sample Code: https://github.com/nvpro-samples/gl_commandlist_bk3d_models [/li][/ul]

Aleksandar · January 3, 2015, 5:42am

It is too early for this extension. It is not included in the latest Windows drivers. Why did you refer to 347.09? There is no trace of NV_command_list. At least, the extension is not exposed.
Also, there is no reference on the official NVIDIA OpenGL Spec site. So, we have to wait…

Alfonse_Reinheart · January 3, 2015, 9:09am

Is there a version of that presentation that isn’t locked away in some horrible web application? Or at the very least, a way to download it that doesn’t require me to register an account to be able to download it?

Dark_Photon · January 3, 2015, 4:37pm

Really!? That’s interesting. I mentioned it because that was the pre-release driver version/URL that Tristan Lorach (NVidia author of the presentation describing NV_command_list) said was the “driver for trying” (see first link). Guess not. Sorry about that.

Yeah, I’d prefer an open downoad from nVidia’s web site too. But It’s not so bad. Create a throw-away slideshare account and you can download it.

malexander · January 5, 2015, 2:59pm

It seems like an interesting proof of concept. But the fact that it requires bindless everything - textures, vertex arrays, uniform buffers - makes it unsuitable for folding into our existing renderer. And if I’m going to put effort into a new renderer, it’ll be based on the next-gen APIs. Still, I’m glad to see any work done on draw call bottlenecks.

Dark_Photon · January 7, 2015, 12:51pm

Re 347.09, here we go. New NV_command_list example, with info on why the extension isn’t visible and how to get to it:

github.com

nvpro-samples/gl_commandlist_basic/blob/master/README.md

# gl commandlist basic

In this sample the **NV_command_list** extension is used to render a basic scene and texturing is performed via **ARB_bindless_texture**.

> **Note:** The NV_command_list extension is officially shipping with 347.88. The appropriate functions used in this sample can also be found in some older drivers (for example 347.09 and higher), however the performance for all driver/hardware combinations may not be representative there. The [spec](http://www.opengl.org/registry/specs/NV/command_list.txt) is available, and feedback is welcome and should be sent to Christoph Kubisch <ckubisch@nvidia.com>, Tristan Lorach <tlorach@nvidia.com>, or Pierre Boudier <pboudier@nvidia.com>. Additional information can be found in this [slide deck](http://www.slideshare.net/tlorach/opengl-nvidia-commandlistapproaching-zerodriveroverhead) from SIGGRAPH Asia 2014, as well as the latest GTC 2015 [presentation](http://on-demand.gputechconf.com/gtc/2015/video/S5135.html).

This new extension is built around bindless GPU pointers/handles and three more technologies, which allow rendering scenes with many state changes and hundreds of thousands of drawcalls with extremely low CPU time:

- *Tokenized Rendering*:
  - Evolution of the "MultiDrawIndirect" mechanism in OpenGL
   - Commands are encoded into binary data (tokens), instead of issuing classic gl calls. This allows the driver or the GPU to efficiently iterate over a stream of many commands in one or multiple sequences: **glDrawCommands( ...tokenbuffer, offsets[], sizes[], numSequences)**
   - The tokens are stored in regular OpenGL buffers and can be re-used across frames, or manipulated by the GPU itself. Latency-free occlusion culling can be implemented this way (a special _terminate sequence_ token exists).
   - Next to draw calls, the tokens cover the most frequent state changes (vertex, index, uniform-buffers) and a few basic scalar changes (blend color, polygonoffset, stencil ref...).
   - As tokens only reference data (for example uniform buffers), their content is still free to change - you can change vertex positions or matrices freely (which is different from classic display lists).
   - To get an idea of what is currently possible check the **nvtoken.cpp/hpp** files, which also showcase how the tokenstream could be decoded into classic OpenGL calls.

```cpp
// The tokens are tightly-packed structs and most common tokens are 16 bytes.
// Below you will find the token definition to update a UBO binding. Compared 
// to standard UBOs, tokens update the binding per stage.

This file has been truncated. show original

alexbetts · March 27, 2015, 6:28pm

Has anyone had success with this extension yet? I’m getting started with it, posted the following question to an NVIDIA forum but maybe someone here’s got some insights:

(Development environment is Linux Mint with beta driver 349.12; GPU is a GTX 760).

Having some trouble using the new NV_command_list extension. I’ve got a draw function that looks a bit like this:

glBindFramebuffer(GL_READ_FRAMEBUFFER, 0);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fbo);

// clear:
glClearColor(1, 0, 0, 0);
glClearDepth(1.0);
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

// draw:
glDrawCommandsStatesAddressNV(…);

// blit:
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, 0);

glBlitFramebuffer(0, 0, width, height,
0, 0, width, height, GL_COLOR_BUFFER_BIT, GL_NEAREST);

Having trouble getting it to do much. I can’t get the commands for the geometry that I want to draw anything but a blank frame, but even if I fill a command buffer with a single NOP command, then I see the screen clear to red for the first frame, and all subsequent frames are black (same thing happens with the ‘real’ commands I want to try, or commands that just set the uniform buffer addresses but without any draw commands). I’d expected that with a single NOP command, the glDraw() command would ‘do nothing’ and I’d always see the frame cleared to red, but that seems to only happen for the first frame only.

Anyone know what might be going on? Does a command buffer always need an actual draw command or a terminate or something somewhere in it? I am only trying to use the glDrawCommandsStatesAddressNV() and not the compiled command lists (which really wouldn’t be appropriate for this application). I have added debugging callbacks with glDebugMessageCallback but that doesn’t tell my I’m doing anything wrong (it does tell me things about the buffers I’ve created for storing geometry, I and I can do incorrect things deliberately, like not setting a framebuffer when capturing the state, and the debug extension reports errors). The gl_commandlist_* samples posted on github do work for me. I’m using SDL as the windowing environment.

– Alex

Dark_Photon · March 28, 2015, 10:31am

If you haven’t already done this, you could download, run, trace through in the debugger, and then pair down one of nVidia’s NV_command_list examples (see above) to a basic, working shell. Then incrementally add your code, testing periodically to ensure you haven’t broken anything.

For instance, starting with the gl_commandlist_basic example, do this:

[ul]
[li] Make sure CMake v2.8+ (2.8 or higher) is installed. For example:[/li][LIST]
[li] cmake-3.0.0~rc5-139.3 [/li][/ul]

[li] Make sure you’ve got the GL Framework v3 include and lib packages installed. For example:[/li][ul]
[li] libglfw-devel-3.1-8.1 [/li][li] libglfw3-3.1-8.1 [/li][/ul]

[li] git clone GitHub - nvpro-samples/gl_commandlist_basic: OpenGL sample for NV_command_list [/li][li] git clone GitHub - nvpro-samples/nvpro_core: shared source code and resources needed for the samples to run [/li][li] git clone GitHub - nvpro-samples/shared_external: external libraries, needed for the samples (AntTweakBar; ZLib...) [/li][li] git clone GitHub - nvpro-samples/build_all: GO HERE FIRST: nvpro-samples overview [/li][li] cd build_all/ [/li][li] cmake . [/li][li] Make sure configuration completes successfully. For instance, you should see something like:[/li]


...
-- Configuring done
-- Generating done
-- Build files have been written to: .../stuff/build_all

[li] make -j8 [/li][li] Ensure it builds successfully:[/li]


...
Linking CXX executable .../stuff/bin_x64/gl_commandlist_basic
[100%] Built target gl_commandlist_basic

[li] cd …/bin_x64 [/li][li] You should now have a “gl_commandlist_basic” executable in this directory. Run it: [/li][li] gl_commandlist_basic [/li][/LIST]

Now trace through it and tweak to-taste!

Dark_Photon · March 28, 2015, 1:06pm

By the way, another tip: Be sure to check out the gl_commandlist_bk3d_models sample. Just:

[ul]
[li]git clone https://github.com/nvpro-samples/gl_commandlist_bk3d_models [/li][/ul]
and config/build just like the above.

Inside this sample, there’s a basic “NV_command_list to GL 4.3 converter” in the source code (see emulate_commandlist.h), and it’s very instructive to enable emulation and step through this emulator. It’ll quickly show you how this works (and possibly let you test your own code w/ and w/o emulation). It’s like DrawIndirect on steriods! To step through it in the debugger:

[ul]
[li] At the top of gl_commandlist_bk3d_models.cpp, change g_bUseEmulation to true [/li][li] Build for debug (instead of cmake . as above, use cmake -DCMAKE_BUILD_TYPE=Debug; then build with make -j8) [/li][li] Load the executable, gl_commandlist_bk3d_models, into your favorite debugger [/li][li] Set a breakpoint on emucmdlist::nvtokenRenderStatesSW (for example) [/li][li] Run [/li][/ul]
(I’m glad you asked your question, because it got me back into playing with NV_command_list

Incidentally, looking at your above code, you know you can have [b]glDrawCommandsStatesAddressNV/b bind the framebuffer for you, right?

alexbetts · March 28, 2015, 2:11pm

So I pursued your suggestion to hack away at nvidia’s examples (which have otherwise been working fine on my Linux system), and I’m glad I did… if I comment-out the line in basic-nvcommandlist.cpp that draws the user interface (the call to TwDraw()), I pretty much observe the same problems I saw with my own code - the first frame renders as it should; the contents of subsequent frames are just whatever color the default framebuffer was cleared to before the blit was supposed to take place. It behaves like the blit just doesn’t happen after the first frame.

The AntTweakBar stuff that renders the UI appears to be fairly legacy OpenGL (no sign of glUseProgram() in the sources), so maybe there’s a bug in the Linux driver such that some minimal non-NV_command_list rendering is needed to get anything to display. I’m going to try adding various trivial drawing operations to my own program & see if that improves anything.

alexbetts · March 28, 2015, 2:29pm

Yep, if I insert some real old-school GL calls either before or after the blit (in the nv demos, or in my own programs), things work as expected:

   {
      NV_PROFILE_SECTION("Blit");
      glBindFramebuffer(GL_READ_FRAMEBUFFER, fbos.scene);
      glBindFramebuffer(GL_DRAW_FRAMEBUFFER, 0);

      glUseProgram(0);
      glBegin(GL_POINTS);
      glVertex3f(0, 0, 0);
      glEnd();

      glBlitFramebuffer(0,0,width,height,
        0,0,width,height,GL_COLOR_BUFFER_BIT, GL_NEAREST);
    }

– Alex

alexbetts · March 28, 2015, 2:55pm

Dark Photon, thank you for your help. The rest of the command lists I’ve been generating seem to work - I can now see real transformed geometry (albeit with unimpressive shading) using this extension/

Dark_Photon · March 28, 2015, 5:41pm

[QUOTE=alexbetts;1265204]…if I comment-out the line in basic-nvcommandlist.cpp that draws the user interface (the call to TwDraw()), I pretty much observe the same problems I saw with my own code - the first frame renders as it should; the contents of subsequent frames are just whatever color the default framebuffer was cleared to before the blit was supposed to take place. It behaves like the blit just doesn’t happen after the first frame.

…

Yep, if I insert some real old-school GL calls either before or after the blit (in the nv demos, or in my own programs), things work as expected:[/QUOTE]

That’s really interesting. It’s particularly interesting because here it works properly even after I comment out the TwDraw() – in either the gl_commandlist_basic or gl_commandlist_bk3d_models examples. My display still renders and updates with the mouse just fine. I’m also running the same 349.12 beta drivers you are there, just on OpenSuSE 13.1 with KDE 4 window manager. I’m also running a GTX760 just as you are (4GB, though that’s probably irrelevent).

Can you get it to work by putting a glFlush() or glFinish() before and after the Blit? Wonder if there’s a work queue that’s not being flushed properly.

Doubt it, but perhaps you’ve corrupted your driver state. Try a reboot. Then try an NVidia example without mods. Then with TwDraw() commented out.

Might be a GLFW-related issue (running GLFW 3.1 here).

Which window manager do you have? Shot in the dark: try disabling the compositor, if it’s enabled.

alexbetts · March 30, 2015, 12:34pm

Interesting you haven’t seen these problems on a Linux system yourself. I can observe the same issues on a different machine (MacBook Pro with I think an 680M GPU). That system is running Linux Mint, too, but with XFCE instead of Cinnamon (which the first machine I debugged is using). XFCE does let you disable compositing (Cinnamon apparently can’t disable it completely), but that doesn’t seem to change things on the MacBook Pro.

I’ve sprinkled glFinish()'s around a bit, but I should make a more methodical effort with that and see what happens.

My own application uses libSDL; so it seems to happen with both GLFW and SDL

alexbetts · April 2, 2015, 1:06pm

Looks like the nvidia’s devs found this on Linux & are patching it:

Dark_Photon · April 2, 2015, 6:44pm

Good deal!

(Looking forward to Vulkan when this kind of thing will be available across all GPUs!)

codepilot · April 13, 2015, 10:46pm

glGetCommandHeaderNV(GL_NOP_COMMAND_NV, 4)

always seems to return ZERO, but no error.
I take this to mean that I can just pad all of my command structs with ZEROED-UINT32s for alignment.
This helps, because I like to use streaming stores when building the command lists in mapped buffers.
I use streaming stores for the write-combined memory that opengl likes to give for write-only buffer mapping.
But, streaming stores might help for normal memory too, IDK?

I initial the xmm/ymm register with:

__m128i _mm_setzero_si128()

__m256i _mm256_setzero_si256()




I build the command with:

__m128i _mm_insert_epi16 (__m128i a, int i, int imm8)
__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)
__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)
__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)



I store using:

void _mm_stream_si32 (int* mem_addr, int a)
void _mm_stream_si64 (__int64* mem_addr, __int64 a)
void _mm_stream_si128 (__m128i* mem_addr, __m128i a)
void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)




I was using the 32 bit and 64 bit streaming stores at first, but I found that I could use the 128bit and 256bit aligned streaming stores by first zeroing the xmm/ymm register and then setting the needed fields. I know that this can end up leaving a lot of NOP commands in the list, but that doesn’t seem to impact performance noticeably for my tests. It appears that a few NOPs here and there for padding is ok performance-wise, but I would imagine that huge thousands or millions of them might be a bad thing.

I have not noticed any difference when using

glCompileCommandListNV

 to compile a tightly packet command list versus a command list where commands are 256bit aligned and NOP/ZERO padded.

Also, since NOPs appear to be ZERO, I believe one can safely initialize a command memory buffer with zeros and it would just be lots of NOPs. And since lots of memory allocating methods (HeapAlloc(HEAP_ZERO_MEMORY), VirtualAlloc, OpenGL buffers(using nullptr for data)) can do this, it seems super convenient, almost like someone planned it that way ;).

Can someone officially comment on whether NOPs are ZERO, and if they will be ZERO forever? Also, please let me know if this is a fragile abuse of some corner-case, or if it is ok for standard practice.

Thanks

codepilot · April 16, 2015, 2:32pm

I have been using NV_command_list on a 750ti, 980, 960, but when I tried a laptop with an NVS 4200M I can’t find a way to use glCallCommandListNV. That is because glCallCommandListNV requires the FBO’s texture to be made resident, and the NVS 4200M is missing NV_bindless_texture and ARB_bindless_texture because the GPU is too old. glDrawCommands works because it doesn’t directly reference any FBO, and so I don’t need to make it resident.

Is there some way to make textures resident without the bindless extensions, or could an extension be made to add direct residency management for older gpus that lack the bindless abilities?

Alfonse_Reinheart · April 16, 2015, 3:02pm

Texture residency is defined by bindless texturing. So if your card can’t handle bindless texturing, it can’t really handle NV_command_list.

I’m surprised that NVIDIA exposed NV_command_list on that hardware at all. It’s possibly a driver bug.

Oh, and it’s OK to make a new thread to ask questions about NV_command_list. You don’t have to post them all here.

codepilot · April 16, 2015, 4:14pm

I thought exposing NV_command_list without bindless textures was kinda pointless too. But, since the hardware does support NV_uniform_buffer_memory, NV_vertex_buffer_unified, and NV_shader_buffer_load the shaders can still get a lot of data into them, but just not textures. There have been times where I used buffer reads instead of real textures because the picture format was so strange that it didn’t really fit well into the concept of opengl textures. And, so then something like that should still work on hardware lacking bindless textures. Using the old hardware was more of curiosity then a necessity though.