Buffer Shaders

Good evening every body.

To present my self very fast, i’m a french student and i want become a 3D developer. i supose like you ! :wink: Now days, i program since 3 years in OpenGL. So i’m very well with all programable shaders and VBOs. I write this message to submit my suggestion for the next release. Since OpenGL 4.0, the graphic shaders pipe line is :

  • Vertex Shader
  • Tessellation Control Shader
  • Tessellation Evaluation Shader
  • Geometry Shader
  • Fragment Shader

And i enjoy Tessellation Shaders : there are a very nice new technology ! Well Done !
My suggestion is : add a new shader type like this :

  • Buffer Shader <- the new shader
  • Vertex Shader
  • Tessellation Control Shader
  • Tessellation Evaluation Shader
  • Geometry Shader
  • Fragment Shader

This mysterious shader should be create for programming our own memory access to GL_ARRAY_BUFFERs and GL_ELEMENT_ARRAY_BUFFERs. It should be specialize in unsigned integer computation. I think you will understand why.

example code for GL_PATCH_VERTICES = 3:
// ------------------ BUFFER SHADER


#version 420 // (I supose ??? =D)

uniform layout(location = 0) arrayBuffer VertexBuffer ; // it's a 3 float components array store in the GPU's memory
uniform layout(location = 1) arrayBuffer TextureCoordBuffer ; // it's a 2 float components array store in the GPU's memory
uniform layout(location = 2) arrayBuffer TriangleNormalBuffer ;  // it's a 3 float components int array store in the GPU's memory
uniform layout(location = 3) arrayBuffer IdBuffer ; // it's a unsigned int array store in the GPU's memory

out vec4 Vertex ;
out vec2 TextureCoord ;
out vec3 TriangleNormal ;

void main ()
{
// ---------- CALCUL VERTEX MEMORY POSITION
    uint VertexPosition = getUint(IdBuffer, gl_PatchId * 3 + gl_VertexId);

// ---------- LOAD VERTEX VECTORS
    Vertex = vec4(getVec3(VertexBuffer, VertexPosition * 12), 1.0);        // 12 = 3 * sizeof(float) = 3 * 4
    TextureCoord = getVec2(TextureCoordBuffer, VertexPosition * 8);    // 8 = 2 * sizeof(float) = 2 * 4

// ---------- LOAD FACE VECTORS
    TriangleNormal = getVec3(TriangleNormalBuffer, gl_PatchId * 12);     // 12 = 3 * sizeof(float) = 3 * 4
    // compiler should detect that this previous line doesn't change for each Vertex and execute it only one time per patchs
}

// gl_PatchId is the face drawing number
// gl_VertexId is the vertex drawing number varying from 0 to (GL_PATCH_VERTICES - 1)

And a vertex shader to show inputs.
// ------------------ VERTEX SHADER


#version 420 // (I supose ??? =D)

// ---------- VERTEX SHADER

uniform [...]

in vec4 Vertex ;
in vec2 TextureCoord ;
in vec3 TriangleNormal ;

out [...]

void main ()
{
    // classic vertex shader with any differences needed from GLSL 4.1
}

And now 2 new functions for OpenGL API
// ------------------ NEW FUNCTIONS


/*
 function to assign each buffer to locations (little like textures)
*/
glBindBufferAtLocation (GLenum target, GLuint buffer, GLuint location);

/*
 function to draw with varying gl_PatchId of the buffer shader like :
 for ( gl_PatchId = start ; gl_PatchId < (start + count) ; gl_PatchId ++ )
*/
glDrawBufferElements (GLenum mode, GLsizei start, GLsizei count);
or // glDrawBufferElements (GLenum mode, GLuint start, GLuint count);

// ------------------ C PROGRAMM


glUseProgram (BufferShaderId);
glPatchParameteri (GL_PATCH_VERTICES, 3);

glBindBufferAtLocation (GL_ARRAY_BUFFER, VertexBufferId, 0);
glBindBufferAtLocation (GL_ARRAY_BUFFER, TextureCoordBufferId, 1);
glBindBufferAtLocation (GL_ARRAY_BUFFER, TriangleNormalBufferId, 2);
glBindBufferAtLocation (GL_ELEMENT_ARRAY_BUFFER, TriangleIndicesBufferId, 3);

glDrawBufferElements (GL_PATCH, TriangleStartId, TriangleCount);
// may be a GL_PATCH_STRIP ?

Of course all new name i have invented here is only to show how it should work with maximum explicit intentions.
I think this new kind of shader can bring a lot memory size and band-witch optimizations, because in this example, we can separate vertex access and face access, and do lot of other things !
What do you thing about ?

I will be very excited if you like and add my idea to opengl ! Thank you for reading. Please excuse me for my english.

This buffer shader sound a bit like a generic OpenCL kernel. Not sure it would useful/possible to support that on GL side.

This is actually a programmable vertex puller which might be doable for AMD or NVIDIA GPUs but unlikely for tiled base GPUs which has dedicated and very optimised hardware for the purpose of pulling vertices.

In any case this is more likely to be a OpenGL 5 (or 6) feature kind of thing as it would require new hardware.

This shader is about to destroy any possible hardware accel of attrib-fetch (which remains on geforces afaik), and complicate things a bit;
for what purpose?
(you know there’s transform-feedback, that can do what you’re after, right?).

PS, besides, you can already do all that inside the vertex-shader; for which there was a recent discussion about drawing without attrib data.

I am actually curious to say the performance cost!

But I can see how this approach belong to a certain direction of GPU architectures, I am not sure if this is the right direction.

It does mean that you need to handle yourself the vertex caching, good luck with that, but this is probably possible with EXT_shader_load_and_store.

Damn, I wish I had some time to experiment on this!

May be we can program our self a very optimized vertices loading too. Likely tessellation evaluations shaders, we can bring a lot of optimizations. Surly not fast as current vertices loading, but memory size optimization can make the difference. We only should have all infos about hardware band-whish. I also understand OpenGL ES may not have a such shader to keep the maximum speed for vertices loading. :wink:
May be it need to change the current buffer shader struct to a more sophisticated. May be with multi call back like tessellation divided with control and evaluation. And what about PATHCH ? I thinks they didn’t have a memory band-whish optimized like a GL_TRIANGLE_STRIP. So may be we should program our custom loading to recreate a GL_TRIANGLE_STRIP like.

Can you explain me why this kind of shader need a new hardware OpenGL 5 (or 6) ? I don’t understand because it’s like a OpenCL kernel who are support on hardwares. And a lot of those hardwares support OpenGL 3.

This shader is about to destroy any possible hardware accel of attrib-fetch (which remains on geforces afaik), and complicate things a bit; for what purpose?

Well, this shaders should be in options of course like tessellation shaders and geometry shader. But now days, with tessellations, displacement map are more and more present, so graphic memory size is becoming very limited for ARRAY_BUFFER. Personally, i’m not a student who is excited with a simple tessellated 3D cub on a black background. Surly my experience isn’t as long as your’s. :smiley: I recently program a fractal trees generation, and i wanted that each trees are unique. Therefore on a big map, trees’ ARRAY_BUFFERS cost a lot of GPU memory. I have optimize a lot with less vertices informations for far trees, and other technics, but i’m sure i can win 40% of actual buffers size with a such shaders. And i’m sure there are a many other renders who need such technic to optimize memory.

Things are probably not as flexible in a GPU as you may expect, especially in term of wires.

Adding this as a proper separate stage would require the “scheduler” to be modified to add this new stage.

However, I don’t believe it’s not necessary to have a new stage for that and it would work perfectly with EXT_shader_load_and_store, probably more naturally on NVIDIA with GL_NV_shader_buffer_load (and NV_shader_buffer_store?) and using rendering without data (OpenGL Samples Pack / None tools).

Performance wise, I am getting really curious but this might be an opportunity to get rid off the VAO. (and the FBO actually for some cases)

As Groovounet said, this can be solved already with EXT_shader_image_load_store.

You still get good cache support from both pre- and post-transform caches. Why?

Nowadays texturing and buffer access (including vertex attrib fetch) share the same L1/L2 cache hierarchy, and post-transform cache caches varyings anyway, so there’s no difference there.

What you lose is that hardware tends to pre-fetch data in case of vertex attrib fetches and as you do it manually, you can expect some latency as a result (this can be frustrating especially if you use indexed primitives, or some custom multilevel fetching as each additional indirection imposes some latency).

Ilian, can you point me to the discussion mentioned? I’m interested in what type of attrib dataless drawing was the discussion about.

I can’t find the post anymore :stuck_out_tongue:

It was just about binding a VAO with no attrib set in it.

As you’ve already said, that’s one reason more not to use VAO. :wink:
I really appreciate NVIDIA’s approach in not making VAO mandatory, as well as enabling attributless rendering in both core and compatibility profile. Furthermore, bindless removes need for VAO even in cases where significant number of attributes are enables. It would be nice to make a benchmark to compare efficiency of VAO and bindless side by side.

I don’t know if Ilian meant about Rendering without Data: Bugs for AMD and NVIDIA , but it was the most recent discussion (about 3 months ago) that I remember.

Surly not fast as current vertices loading, but memory size optimization can make the difference.

Do you have any evidence for this position?

Can you explain me why this kind of shader need a new hardware OpenGL 5 (or 6) ?

It’s the same reason why Geometry shaders that magnified vertex data had terrible performance until DX11/GL4 hardware: because, even though the hardware can do it, it isn’t designed to do so efficiently. So if you want this to actually be worthwhile, you would need hardware that’s designed to make it worthwhile. With optimal caches and such.

Nowadays texturing and buffer access (including vertex attrib fetch) share the same L1/L2 cache hierarchy, and post-transform cache caches varyings anyway, so there’s no difference there.

For which hardware is this true? Furthermore, textures generally do not generally support indexed accessing. You could implement it as shader logic, but it wouldn’t be nearly as fast.

Also, I seriously doubt the post-T&L cache can even work without an index. Since that’s what it uses to decide to pull from the cache: if the index for a vertex is in the cache, it pulls from there. That index is the only thing that makes two vertices equal. The post T&L cache can only work when pulling vertex data normally with actual indices.

This is one of the reasons why a general solution would require new hardware to be reasonably fast.

I really appreciate NVIDIA’s approach in not making VAO mandatory

Yes, I too appreciate how NVIDIA completely ignores the OpenGL specification whenever it suits them. It’s great for a cross-platform API to have one of its chief members treat it like a suggestion rather than a requirement. It certainly helps make OpenGL development that much easier for everyone involved.

:rolleyes:

I really like your sarcasm. :slight_smile:
It is bitter but true. Although, I don’t think it is such a critical thing. Everything that works on AMD works also on NVIDIA. The opposite, of course, is not true.

Yes. I can tell you that I’ve benched them side-by-side on our large-scale app with a lot of real-world data. In those tests:

  • [li] VAOs alone helped some,[] Bindless alone helped more, and [] Bindless+VAOs was slower than bindless alone.

So, I’m not using VAOs at all now but am using bindless (on NVidia GPUs). No sense in wasting CPU time in the driver when you can use it to push more content to your users.

Fispear, what exactly is the current problem that your proposal intends to solve?

You mentioned about optimizing memory size and bandwidth. But the vertex buffer bandwidth/size is seldom a problem. Most of the time the textures far overshadow the vertex buffers.
Furthermore even now you can do custom data pulling by using vertex textures and/or doing your custom unpacking/decoding of the vertex data in the vertex shader. What more would the new shader let you do that cant be done now?

Also what is the actual difference between the “buffer shader” and the vertex shader. Remember that the purpose of the vertex shader is to process the vertex data. Your “buffer shader” appears to have the same purpose… Why don’t you propose just extending the vertex shader instead, if you think has some deficiency.

Excuse me for the late.

Well, i didn’t know the using of a custom unpacking/decoding of the vertex data in the vertex shader since few days. I tried it on my fractal trees render. So you right, this new shader isn’t necessary. :slight_smile:

But, something that it could be very nice, is one (or may be more) “index” shaders to pass more integers to vertex shader, or opengl load “him self” vertex data for each inputs of the vertex shader. Because if we can have custom input like gl_VertexID, generate by custom algorithm, we doesn’t need index buffer any more for fractal or other generation techniques ! gl_VertexID is nice, but i’m sure we can do better with customs integers generation !

Surly a such shader isn’t necessary for classics objects load from files (3ds, obj …).

Definitely not without an index, but you can still do some nice stuff even if you use the fixed function indexing. Anyway, I take back my words about post-T&L cache as as more I think more I realize that the hardware cannot figure out when it can reuse previously determined data (this is true even by simply using the built-in gl_VertexID as using it may change the result of the vertex shader even with the exact same input attribute data).

On this, however, I still hold my word. This is true for many SM4 GPUs and generally true for SM5 GPUs. Obviously indexing or any other indirection introduces additional overhead but this is not the only case when we have to sacrifice some performance in the favor of flexibility.

Why? gl_VertexID is index, so it can cache output of vsh using it safely.
In fact it (HW) probably cant cache shaders using (at least that well) gl_InstanceID and (at all) atomics. Perhaps some other cases elude me, but they are not many afaik.

I also cant see how having programmable vertex puller would prevent caching post transform data. It could be specified in a way that makes it impossible to write shader that pulls differently for the same index (ie make it logically evaluate only once for each index).

Wait. GPUs aren’t memcmp()'ing attributes to see if a processed vertex exists in cache, they’re probably using the gl_VertexID - which is quite coherent in its meaning. It’s the index, used from glDrawElements, or the automatically-generated index when drawing glDrawArrays (in the latter case vtx-cache cannot be used ever).
Right?

image_load_store :slight_smile:
But anyway, the greatest argument was that it destroys the HW-accelerated attrib-fetch. And there’s the problem of - attribute formats, almost 60 of them :slight_smile: . Will need lots of built-in glsl functions to be added to decompress them. And if certain gpus don’t have hw-accelerated attribfetch (modern Radeons iirc), the custom fetch will definitely have more overhead than what the drivers have for “fixed-func”. (e.g. grouping DMA requests in interleaved-attribs nicely, calculating addresses more efficiently)