Official Bindless Graphics feedback thread

Hey this looks really very interesting - hats off to NVidia.

A couple of questions:

With Shader Buffer Load the addresses are virtual, and mapped to the GPU address space at Init() - presumably it doesn’t matter from a functionality perspective whether the data is in Host or V-Ram here right? Is the idea to use the standard GL buffer API to upload the data to V-Ram prior to the draw call for speed (presumably multiple buffers of dependent fetches that might criss-cross Host<->V-Ram may not have the fastest access patterns there).

Secondly when are you likely to be adding support for this within Cg?

Finally - I’m thinking of this firstly in terms of iterating through a light parameter buffer via a buffer of indirected pointers to affecting lights here - so a general but related question: are there any profiles that support dynamic (from constant rather than compile-time) loop iteration counters yet?

Thanks.

Do 185.85 WHQL support these extensions? Or we should use beta for now?

[EDIT] Sorry, stupid question. Just installed these new drivers and saw, that these extensions are supported. Thanks for suc on-the-fly WHQL drivers!

About the loop iteration counters: You mean in Cg or in GLSL? GLSL works fine if you mean through uniforms.

Yep I was referring to Cg there, but thanks - I didn’t know that. I’m a bit locked in to CgFX/GL with my code so I don’t use GLSL.

So am I correct in thinking it’s totally uniform based and not a recompilation job prior to shader upload in the driver?

If so that’s pretty interesting, though surely only some hardware (presumably post G80-class) supports that right?

G80 inclusive.

Right - got you, that’s what I meant, should’ve been clearer.

Thanks.

Kickass!

This is bringing back the old nvidia I liked - taking the lead in experimentation and actual innovation. If this is a sign of a “new” (or simply reborn) spirit - for all that is dear, don’t let it be a single drive-by bullseye!

Some input on the vertex buffer spec:

If I’m to use all (remaining) space in the buffer anyway (as in the example), could perhaps -1 as “size” (last) argument to BufferAddressRangeNV work (using the size from the previous BufferData call)? I just found the code to manually having to adjust buffer size (“vboSizes[i]-4”) to be… inelegant. Possibly also error-prone. Comments?

Would it perhaps make more sense to rename GetBufferParameterui64vNV to simply GetBufferParameterAddr, and have it expect a holder of size void*? That way it could satisfy requirements for both 32- and 64-bit platforms, without wasting the upper 32 bits for 32-bit platforms.

Are the *FormatNV functions just working names (not wanting to interleave the working code path too much with experimental stuff)? I’m not thinking of the “NV” moniker, I’m thinking “Hmmm, haven’t I already seen this, even if in another incarnation, in plain VBO?”.

Are there any scenarios where one could actually want to modify stride for an interleaved attribute in a single array? In the example “20” (4sizeof(float)+4sizeof(UBYTE)) is used in both the ColorFormatNV call, and the immediately following VertexFormatNV. Could this be simplified in some way? F.ex. a few client-side-only functions with a stride-setting call first, then a few calls to set the different buffer’s offset? (just brainstorming here)

Anyway, while I just noticed this announcement and haven’t had time to play with it; bloody good work!

tamlin,

> could perhaps -1 as “size” (last) argument to BufferAddressRangeNV work

A GLsizeiptr is signed, but you could use INT_MAX.

> GetBufferParameterAddr, and have it expect a holder of size void*

This is a GPU address, not a CPU address, and may be 64bit even if the CPU address space is only 32bits.

Sorry, I didn’t follow the question about the *Format functions.

There are useful scenarios where you can mix interleaved and non-interleaved atributes in one VBO or have a mixture of formats spread across separate VBOs. I would not restrict this to either interleaved or linear, EXCEPT for real performance advantages (someone of the HW guys can comment on this?).
I’d suggest vertex format objects instead (or put the format specification into a display list). This would allow the driver to detect “all attributes have same stride” or “all atributes are linear” etc. and then do special handling of these cases. But this only pays off, if the driver could take any advantage of this knowledge. But as I’m not a HW guy, I don’t know…

I’ve finally had time to go through these specs, and it really looks cool.

I just have a question:

From the examples of NV_shader_buffer_load:
in vec4 **ptr;

glVertexAttribI2iEXT(8, (unsigned int)pointerBufferAddr, (unsigned int)(pointerBufferAddr>>32));

It seems like there is some implicit packing/unpacking going on.

Why not introduce a function like this:
glVertexAttribui64NV(8, pointerBufferAddr); (plus corresponding type specifiers for glVertexAttribFormat)
You did it with glUniformui64NV, so why not for attributes/varyings? It would make the code a lot more explicit and less confusing.

jeff,

I don’t get these:

  • “A GLsizeiptr is signed”. Uh, yeah, that’s why I suggested -1 instead of ~0 (a negative size makes no sense, why I thought -1 would fit perfectly).

  • “This is a GPU address, not a CPU address”. While true, the driver would still have to verify it on every use by the application (else it’d open a whole factory of worms and system crashes). Right? Would it then be a too large overhead to not only verify, but also internally perform the “32-bit user address space -> 64-bit PCI address space” translation (on 32-bit processes/operating systems)? Also, as you can’t (normally) address anything outside “your” address space, how would a 64-bit space help a 32-bit app?

Point taken. Format objects would likely be a better long-term solution for what I had in mind.

" “This is a GPU address, not a CPU address”. While true, the driver would still have to verify it on every use by the application (else it’d open a whole factory of worms and system crashes)."

As far as i can see, this extension DOES open up a huge can of worms. I think crashing you favorite OS will become easy again.

Jan.

Worst ever that can happen is a GPU reset, imho. A gpu reset is as ‘damaging’ and slow as changing the screen resolution.

I’ve started to implement bindless graphics for my app, but I cannot find the appropriate header files or other support for the new function calls. I’m using the latest OpenGL SDK (10.52), and the latest drivers (185.85).

Do I need to go back to a beta driver (that presumably includes headers, etc)?

Thanks!
-mike

You’ll need to code up and grab your own extension procs for the time being.

Quick and dirty loader for tinkering…


//
// Declare stuff (just paste and comma delimit from spec)
//

#define GLDECL(ret, name, ...) \
	typedef ret (APIENTRYP PFN##name##PROC)(__VA_ARGS__); \
	PFN##name##PROC name = NULL;

// ------------------------------------------------------------------------------------------------
// NV_shader_buffer_load
// -----------------------------------------------------------------------------------------------

// Buffer operations
GLDECL(void, glMakeBufferResidentNV, GLenum target, GLenum access);
GLDECL(void, glNamedMakeBufferResidentNV, GLuint buffer, GLenum access); // Not in beta
GLDECL(void, glMakeBufferNonResidentNV, GLenum target);
GLDECL(void, glNamedMakeBufferNonResidentNV, GLuint buffer); // Not in beta
GLDECL(GLboolean, glIsBufferResidentNV, GLenum target);
GLDECL(GLboolean, glIsNamedBufferResidentNV, GLuint buffer);
GLDECL(void, glGetBufferParameterui64vNV, GLenum target, GLenum pname, GLuint64EXT *params);
GLDECL(void, glGetNamedBufferParameterui64vNV, GLuint buffer, GLenum pname, GLuint64EXT *params);
// New Get flavor
GLDECL(void, glGetIntegerui64vNV, GLenum value, GLuint64EXT *result);
// (Named) program uniform get/set
GLDECL(void, glUniformui64NV, GLint location, GLuint64EXT value);
GLDECL(void, glUniformui64vNV, GLint location, GLsizei count, GLuint64EXT *value);
GLDECL(void, glGetUniformui64vNV, GLuint program, GLint location, GLuint64EXT *params);
GLDECL(void, glProgramUniformui64NV, GLuint program, GLint location, GLuint64EXT value);
GLDECL(void, glProgramUniformui64vNV, GLuint program, GLint location, GLsizei count, GLuint64EXT *value);

enum NV_shader_buffer_load
{
	// Accepted by the <pname> parameter of GetBufferParameterui64vNV, GetNamedBufferParameterui64vNV:
    GL_BUFFER_GPU_ADDRESS_NV	= 0x8F1D,

	// Returned by the <type> parameter of GetActiveUniform:
    GL_GPU_ADDRESS_NV		= 0x8F34,

	// Accepted by the <value> parameter of GetIntegerui64vNV: 
    GL_MAX_SHADER_BUFFER_ADDRESS_NV = 0x8F35,
};

// ------------------------------------------------------------------------------------------------
// NV_vertex_buffer_unified_memory
// ------------------------------------------------------------------------------------------------

GLDECL(void, glBufferAddressRangeNV, GLenum pname, GLuint index, GLuint64EXT address, GLsizeiptr length);
GLDECL(void, glVertexFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glNormalFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glIndexFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glTexCoordFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glEdgeFlagFormatNV, GLsizei stride);
GLDECL(void, glSecondaryColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glFogCoordFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glVertexAttribFormatNV, GLuint index, GLint size, GLenum type, GLboolean normalized, GLsizei stride);
GLDECL(void, glVertexAttribIFormatNV, GLuint index, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glGetIntegerui64i_vNV, GLenum value, GLuint index, GLuint64EXT result[]);

enum NV_vertex_buffer_unified_memory
{
	// Accepted by the <cap> parameter of DisableClientState, 
	// EnableClientState, IsEnabled:
	GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV		= 0x8F1E,
	GL_ELEMENT_ARRAY_UNIFIED_NV				= 0x8F1F,
	// Accepted by the <pname> parameter of BufferAddressRangeNV 
	// and the <value> parameter of GetIntegerui64i_vNV: 
	GL_VERTEX_ATTRIB_ARRAY_ADDRESS_NV		= 0x8F20,
	GL_TEXTURE_COORD_ARRAY_ADDRESS_NV		= 0x8F25,
	// Accepted by the <pname> parameter of BufferAddressRangeNV 
	// and the <value> parameter of GetIntegerui64vNV: 
	GL_VERTEX_ARRAY_ADDRESS_NV				= 0x8F21,
	GL_NORMAL_ARRAY_ADDRESS_NV				= 0x8F22,
	GL_COLOR_ARRAY_ADDRESS_NV				= 0x8F23,
	GL_INDEX_ARRAY_ADDRESS_NV				= 0x8F24,
	GL_EDGE_FLAG_ARRAY_ADDRESS_NV			= 0x8F26,
	GL_SECONDARY_COLOR_ARRAY_ADDRESS_NV		= 0x8F27,
	GL_FOG_COORD_ARRAY_ADDRESS_NV			= 0x8F28,
	GL_ELEMENT_ARRAY_ADDRESS_NV				= 0x8F29,
	// Accepted by the <target> parameter of GetIntegeri_vNV:    
	GL_VERTEX_ATTRIB_ARRAY_LENGTH_NV		= 0x8F2A,
	GL_TEXTURE_COORD_ARRAY_LENGTH_NV		= 0x8F2F,
       // Accepted by the <value> parameter of GetIntegerv:
	GL_VERTEX_ARRAY_LENGTH_NV				= 0x8F2B,
	GL_NORMAL_ARRAY_LENGTH_NV				= 0x8F2C,
	GL_COLOR_ARRAY_LENGTH_NV				= 0x8F2D,
	GL_INDEX_ARRAY_LENGTH_NV				= 0x8F2E,
	GL_EDGE_FLAG_ARRAY_LENGTH_NV			= 0x8F30,
	GL_SECONDARY_COLOR_ARRAY_LENGTH_NV		= 0x8F31,
	GL_FOG_COORD_ARRAY_LENGTH_NV			= 0x8F32,
	GL_ELEMENT_ARRAY_LENGTH_NV				= 0x8F33,
};

//
// Grab procs...
//

#undef GLDECL
#define GLDECL(ret, name, ...) \
	name = (PFN##name##PROC)wglGetProcAddress(#name); \
	if (name == 0) cerr << "Missing extension: " << #name << endl;



// Add these to your init function
GLDECL(void, glMakeBufferResidentNV, GLenum target, GLenum access);
GLDECL(void, glMakeBufferNonResidentNV, GLenum target);
GLDECL(GLboolean, glIsBufferResidentNV, GLenum target);
GLDECL(void, glNamedMakeBufferResidentNV, GLuint buffer, GLenum access);
GLDECL(void, glNamedMakeBufferNonResidentNV, GLuint buffer);
GLDECL(GLboolean, glIsNamedBufferResidentNV, GLuint buffer);
GLDECL(void, glGetBufferParameterui64vNV, GLenum target, GLenum pname, GLuint64EXT *params);
GLDECL(void, glGetNamedBufferParameterui64vNV, GLuint buffer, GLenum pname, GLuint64EXT *params);
GLDECL(void, glGetIntegerui64vNV, GLenum value, GLuint64EXT *result);
GLDECL(void, glUniformui64NV, GLint location, GLuint64EXT value);
GLDECL(void, glUniformui64vNV, GLint location, GLsizei count, GLuint64EXT *value);
GLDECL(void, glGetUniformui64vNV, GLuint program, GLint location, GLuint64EXT *params);
GLDECL(void, glProgramUniformui64NV, GLuint program, GLint location, GLuint64EXT value);
GLDECL(void, glProgramUniformui64vNV, GLuint program, GLint location, GLsizei count, GLuint64EXT *value);
GLDECL(void, glBufferAddressRangeNV, GLenum pname, GLuint index, GLuint64EXT address,  GLsizeiptr length);
GLDECL(void, glVertexFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glNormalFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glIndexFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glTexCoordFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glEdgeFlagFormatNV, GLsizei stride);
GLDECL(void, glSecondaryColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glFogCoordFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glVertexAttribFormatNV, GLuint index, GLint size, GLenum type, GLboolean normalized, GLsizei stride);
GLDECL(void, glVertexAttribIFormatNV, GLuint index, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glGetIntegerui64i_vNV, GLenum value, GLuint index, GLuint64EXT result[]);


Thanks a lot! I’m having some compile issues with it right now, but I need to put more time into it.

-mike

Thanks again for the code snips, Brolingstanz.

I have it working now. Yes, you can crash your card pretty easily (when you’re doing things wrong, of course), and yes, the card resets pretty much instantly (at least my GTX 280 does).

I thought that I had gotten a 50% speed increase, but when I updated my old VBO code to match the test case’s simplifications, I got the same performance in the end. My bottlenecks may be elsewhere.

Thanks!
-Mike

Just a question.
Does making buffer resident mean it’s now sort of GPU-located? I mean, doesn’t it result, that we can’t exceed VRAM with resident VBOs?
And second reason why I’m asking that - is there any possibility now to make VBO, which is totally in VRAM? Yes, up to mine responsibility, and so on and so on… But can we avoid having driver-side copy of VBO content?

Congratulations to the NVIDIA team. Bindless rendering really improved our rendering speed, especially on systems with a slow CPU and a fast 3d graphics card. However so far we only use the glBufferAddressRangeNV functionality to speed up the VBO submission to the graphics card.

What we do not understand yet is the new way of submitting uniform variables to a shader program. Can bindless rendering be used in the following scenario:

We have one shader program and we submit transform matrices and other float or vec uniform’s to this shader each time we render a triangle strip. Will bindless rendering be able to speed up this case, e.g. can we replace glUniform calls with bindless calls?

The bindless tutorial from NVIDIA is confusing in this respect. I am aware of the sample code using

loc = GetAttribLocation(pgm, “mat”);
VertexAttribI2iEXT(loc, buf1Addr, buf1Addr>>32);

but I have no clue on how to use this in our case.

Can somebody help us with this issue?