Road to Single Render Pass : custom depth test ?

bubuche · September 26, 2018, 6:54pm

Hello everbody.
(my english is not very good, but it’s not my native language (“mother tongue” is the correct expression I think))

I am still working on that, get a program with a « not so bad » api and rendering in one pass [1].

So, here is the problem.
Imagine a FPS (first person shooter).
You have a starfield, behind everything.
Then you have a planet A. The planet A is closer to us than the starfield, but behind everything else.
Then you have a planet B, on a lower orbit. B is closer to us than A (and the starfield) but behind everything else.
Then you have clouds (etc).
Then you have the scene (etc).
Then you have the hands/gun of the player (etc. They are closer to us than the rest of the scene because we don’t want them to clip through walls and so on).
Then you have the HUD : health bar, ammunitions. Again : we don’t want hands/gun clip with them.

You can solve this kind of thing by doing one (or more) render pass for each of these elements. Between each pass you discard the depth buffer and voilà.

So, here is my question (at last) :

=> is it possible to add a « macro-depth » to the depth test ? <=

This one would be an integer, not a float interpolated between vertexes. The depth test would then be something like :


if ( macro_depth_A < macro_depth_B ) 
  discard ;
else if ( macro_depth_A > macro_depth_B )
  accept ;
else 
  do_the_old_depth_test(A,B)

This is pseudo-code, of course.
I already did some research on that, and to be honest I used to believe that the depth test was performed after the fragment shader. It seems it’s not the case (for performances reasons, I approve ).
However, I am pretty sure that it’s done after the vertex shader. I don’t see how it could be done otherwise, because the vertex shader decides the final position of vertices.
So I don’t see any technical reason making this not doable.
Is this part of the fixed pipeline accessible ?
If yes, in which version ? (3.x ? hope)
And then how to do it ?

Any help would be welcome.

P.S. : with that macro-depth, the solution for the problem I gave is, of course : give a different macro-depth to the starfield, planet A, planet B, clouds, scene, hands and HUD.
Starfield=7, PlanetA=6, PlanetB=5, Clouds=4, Scene=3, Hands=2, HUD=1
Or this multiplied by 100 to allow the position of elements between them later.
The macro depth would be given as a per-vertex value, and reproduced as-is into the “macro-depth”.

[1] : it may seem a bit weird, but actually « do two separate render pass » is a (wrong) solution to a lot of problems. For this reason, the naïve approach use it all the time, ending with programs with thousands of render pass and terrible performances.
Reducing the number of render pass implies to solve (again) problems already solved with the wrong solution.
Problems solved by the wrong solution includes :

« I have two object with different transformations matrices » => « do two render pass » (solved)
« I have two different materials in my object » => « do two render pass » (solved)
« I have a skybox behind my scene/a hud in front of my scene » => « do two render pass »

And somehow it became personal. Yes, I perfectly understand that, in some case, doing two render pass may be the best solution but … I had too much problems with that solution, I want, I need a program with one single pass, it’s between me and that « solution », personal feud XD
Also, once I will have this one pass approach, I still can split it in several pass, artificially, like you can have multiple thread to do one job. But then the number of pass will be a personal choice, not something forced on me.
Like s** : it’s better when it’s free.
(I said it : it became personal XD)

GClements · September 26, 2018, 8:16pm

The closest thing is to split depth into disjoint ranges, simulating the combination of changing the projection matrix’ near/far planes and calling glDepthRange(). To do this, you would need to modify gl_Position.z (and if there’s a possibility of near/far plane clipping, write to gl_ClipDistance; the appropriate user clip planes must be enabled). Alternatively, you could just modify gl_FragDepth in the fragment shader, but that disables early depth optimisation.

Yet another possibility is to render the various layers into a distinct layers of a layered framebuffer, then composite them afterwards, but that’s likely to have significant overhead compared to the other options.

Ultimately, I suspect that separate draw calls and glDepthRange() would be the most efficient solution, unless you can avoid the need for the additional clip planes.

Dark_Photon · September 27, 2018, 4:37am

As GClements said, I think glDepthRange() is probably your best bet here …if in-fact you can’t just clear the depth buffer between Z-slice render passes.

Keep in mind that clearing depth can actually be a pretty fast operation and in some cases even faster than not clearing it. In some drivers, particularly those for mobile GPUs (which use slow system RAM for their framebuffer), a clear does not actually perform a clear of the depth buffer in RAM. It sets internal framebuffer flags for the driver that says: the next time you need to read in a tile of the depth buffer, just clear your own internal tile depth buffer cache to implement the depth clear rather than read the depth values from RAM. Then, after the tile has been rasterized to its internal buffer cache, whatever modified depth files exist in the internal tile depth buffer cache are written out to RAM (assuming depth test+write is enabled and you don’t call glDiscardFramebuffer or glInvalidateFramebuffer on the depth buffer, which prevents the write-out). So this basically suppresses a potentially full-framebuffer write of the depth buffer (num pixels * num samples * sizeof(depth buffer sample) bytes). Clearing depth also clears the early Z data structure
(the early depth test optimization you refer to that can be performed before the frag shader to prevent needless frag shader executions).

bubuche · September 27, 2018, 6:32pm

I will reply to Dark Photon first, but I also reply to GClements below.

[QUOTE=Dark Photon;1292595]As GClements said, I think glDepthRange() is probably your best bet here …if in-fact you can’t just clear the depth buffer between Z-slice render passes.

Keep in mind that clearing depth can actually be a pretty fast operation and in some cases even faster than not clearing it.
[/QUOTE]

The problem is not the cost of clearing depth buffer. The problem is the cost of doing multiple render pass, i.e. call multiple time glDrawArrays or like. I am aiming to render the whole thing with one single call to this function per frame. Hence the name : road to single render pass.

I had to tackle with the fact that different objects in the scene can have different transformations matrices, and used a texture to store a transformation matrix per object (per logical object, I mean). On the vertex I store the index of the matrix and voilà.
For different materials, I came up with a solution (under the shape of a format : SHU, “Shader Homogeneous Unit”, which is a backronym) which allows the definition of one fragment shader code and multiple materials using it. It’s done with an array of structs containing each material and a texture atlas (and several small things).
Initially I planed to use a switch in the fragment shader, believing that the cost of a switch would be the MAX of every “case”. Nope, it’s not, it’s closer to the SUM.
I also had to solve adding and removing objects from the buffer while keeping it compact.

So the problem is not the cost of cleaning the depth buffer. It’s the mere existence of multiple call to glDrawArrays
(btw, I always thought it was done by “unplugging” the portion of the memory containing the buffer. It’s electricity : unplug and everything goes back to 0. So, no matter the size of the buffer, it will have a constant cost of almost nothing. Well, personal thoughts, sorry).

(the early depth test optimization you refer to that can be performed before the frag shader to prevent needless frag shader executions).

yes, yes, i got it already. But you agree that it has to be done AFTER the VERTEX shader ?
(after the vertex and before the fragment).

So, when I am in the vertex shader, running it, the depth test has not be done yet. So I still can hope to tweak it, because it’s not done yet, it will come after.

And there is something bothering me a bit. I don’t know how to explain.
I read, some time ago, that opengl was going in a direction where you will not be talking about vertices any more, but only about n-dimensional datas and sampling/cursor in them, whatever. Like : going deeper, access to the lower level of the hardware etc.
I don’t know what part of this are true. I don’t know where OpenGL is going and how far it is in that direction.
But I know that somewhere between the vertex shader and the fragment shader lies the treasure I am looking for. If I have to operate at a lower level to do what I am trying to do, so be it. I understand that it’s a very specific requirement, a macro-depth is not a common idea. If I have to go rambo style and manually redo (ok, copy paste mostly, likely) the path between the vertex shader and the fragment shader, I can do it, or at least I can try to do it.
But I don’t see the entry point of all of this. Like there is an elephant in the room, and it’s only getting bigger and bigger (I think I remember Vulcan is going in that direction too), and I can’t find it and nobody talk about it. I can find OpenGL 2 explanations, a lot of “how to do” shader code but not the least piece of information on how to go deeper.
And there is no informations about this in opengl pdf I got too.
As I said : I don’t know how to explain. Maybe nobody understood what I am talking about, and I am sorry if it’s the case.

I thought about it. It may be a solution, but a bit tricky for the user to use. If the planet is in front of you or if it is more on your right, its near and far distance will change. I … need to think more about it.

To do this, you would need to modify gl_Position.z (and if there’s a possibility of near/far plane clipping, write to gl_ClipDistance; the appropriate user clip planes must be enabled). Alternatively, you could just modify gl_FragDepth in the fragment shader, but that disables early depth optimisation.

I will look at this (I never used gl_ClipDistance).

Yet another possibility is to render the various layers into a distinct layers of a layered framebuffer, then composite them afterwards, but that’s likely to have significant overhead compared to the other options.

No !

If I render to distinct layers, it means I am doing several render pass.
I significant overhead ? In execution time or in complexity of the program ?

Ultimately, I suspect that separate draw calls and glDepthRange() would be the most efficient solution, unless you can avoid the need for the additional clip planes.

Hmm, maybe I wasn’t clear enough. I want to do only one call to glDrawArray per frame.
Having a scenegraph, a skybox (or multiple ones) a HUD, different rotations/translations for objects, different materials (wood, metal etc) with different normals/diffuse/specular maps …
in one draw call
Mwahahaha

It’s a bit off topic but I think that optimization is like the core plus-value of an engine. And it’s something a lot of “free/open-source game engine” tend to forget. At best they have a batch options, but then you cannot move objects (cause they are batched) and every time you want to modify it you have to re-send the whole buffer. AND they let the user define a shader per object (or per part of object) and in the end they can’t batch anything and collapse as soon as there is more than 3 objects in the scene.
If you ask 1 thousand times to your GPU to render one triangle => it’s slow.
If you ask 1 time to you GPU to render one thousand triangles => it’s fast.

If you order 1 thousand times 1 pen, it’s slow.
If you order 1 time 1 thousand pens, it’s fast.

Alfonse_Reinheart · September 27, 2018, 10:14pm

For different materials, I came up with a solution (under the shape of a format : SHU, “Shader Homogeneous Unit”, which is a backronym) which allows the definition of one fragment shader code and multiple materials using it.

You should know that there’s already a term for that: “ubershaders”.

And while we’re on the subject of terminology, “render pass” and “draw call” are not the same thing. “Render pass” typically refers to a pass over the objects in a scene. “Draw call” is… a call to draw something.

If you ask 1 thousand times to your GPU to render one triangle => it’s slow.
If you ask 1 time to you GPU to render one thousand triangles => it’s fast.

If you order 1 thousand times 1 pen, it’s slow.
If you order 1 time 1 thousand pens, it’s fast.

You have absorbed the wrong lessons from the AZDO presentation. The AZDO presentation says (among other things) “minimizing the number of draw calls can improve performance”. You have taken that to mean “well, if minimizing the number of draw calls is fast, and 1 is the smallest number, then 1 draw call must be the fastest!”.

Oversimplification is not a path to performance.

You yourself discovered a part of this:

Initially I planed to use a switch in the fragment shader, believing that the cost of a switch would be the MAX of every “case”. Nope, it’s not, it’s closer to the SUM.

See, there are often costs associated with reducing the number of draw calls. You have to force all meshes to use the same vertex formats. You have to force all objects to use the same shader, thus increasing the number of potential branches within the shader. Anything per-object expressions in the shader which would have been uniform if you had drawn it separately stop being dynamically uniform expressions inside of your multi-object draw call (except for multi-draw calls). And so forth.

AZDO lays out the cases where the performance gain you get from removing additional draw calls is greater than the costs you have to pay to remove those draw calls. But as with any optimization, you eventually get diminishing returns. The point where reducing the number of draw calls further forces you to do non-optimal things that slow you down far more than you would have gained.

Like messing with the depth test.

You’re right: “optimization is like the core plus-value of an engine”. But any “optimization” must actually be something that makes things faster. And the goal you’re working towards does not and will not.

Optimization is not easy. Optimization is never as simple as just making a few rules and following them. It is a long, winding path of contention and trade-offs. And "personal feud"s are not a good way to achieve an optimal solution.

It should also be noted that AZDO doesn’t really care about the number of draw calls. Indeed, they specifically mention that the overhead of a second draw call is minimal… so long as you didn’t change any state between the two calls. So it isn’t the cost of an extra draw calls that AZDO preaches against; it’s the cost of state changes.

I read, some time ago, that opengl was going in a direction where you will not be talking about vertices any more, but only about n-dimensional datas and sampling/cursor in them, whatever. Like : going deeper, access to the lower level of the hardware etc.
I don’t know what part of this are true.

Well, considering that OpenGL has not moved in that direction, nor has GPU hardware, I’m guessing that none of it was true. This however sounds like the statements of someone who wanted it to be true, and decided that GPU makers should probably be moving in that direction. But there’s a difference between what people want and what’s actually happening. And maybe it is “actually happening” somewhere deep in the bowels of NVIDIA and/or AMD.

But it isn’t in real hardware right now. Vertex shaders, fragment shader, rasterization, depth tests, all of those things represent real, actual pieces of hardware. There’s nothing in-between them. There is no “deeper”, no “lower level of the hardware”. At least, not of the form you’re talking about.

GClements · September 28, 2018, 9:44am

Of course. It’s part of the rasterisation step, which can’t begin until the primitive’s vertex positions are known.

Right. Nothing you do in the vertex shader will affect early depth tests.

The part between the vertex shader and fragment shader is fixed-function, i.e. you can’t program it. NVidia’s recent “mesh shaders” extension changes this somewhat, but not in any way that would be useful to you.

And AFAICT, it’s a different fixed-function process which is applicable here, i.e. the depth test. That can be performed before or after the fragment shader; preferably before, because failing the depth test before the fragment shader avoids the need to execute the fragment shader, which is where most of the rendering cost usually lies.

A depth test is inherently a read-modify-write operation (along with blending and stencilling), and those aren’t programmable. You could probably fudge something using atomic image operations, but that means foregoing the early depth test, and that’s going to cost you far more than you’ll gain from reducing the number of draw calls.

If you’re desperate to do everything in a single draw call, the least expensive solution is likely to be to emulate the effect of glDepthRange() in the vertex shader. There are two parts to that: one is transforming the clip-space Z values which are used to calculate depth, the other is transforming the near/far clip planes to match (the built-in near/far clipping is fixed to z/w=±1).

If you have a layered framebuffer (a framebuffer to which a 3D texture or a 2D texture array is bound), you can direct the rendering of a specific primitive to a specific layer by setting gl_Layer in the geometry shader. But geometry shaders are expensive (except on Intel, apparently), the memory required for the framebuffer increases in proportion to the number of layers, and you’d need another pass to composite the layers.

True, but as Alfonse points out, going from a “handful” of draw calls to a single draw call will at best produce negligible gains and will quite possibly make everything slower.

bubuche · September 28, 2018, 7:10pm

Thanks. I didn’t know, I will check.
Minor point : I don’t have access to the internet during the day, only during the evening, where I try to pick document and things to read at home, during the day. So I am not in a “quick-problem-go-ask-question-quick-answer” loop. I am more in the old fashion way, where you planned a travel to the library for the day after to try to get a book covering the problem you have.

And while we’re on the subject of terminology, “render pass” and “draw call” are not the same thing. “Render pass” typically refers to a pass over the objects in a scene. “Draw call” is… a call to draw something.

Ok. So I am looking for single drawcall.
(And I am not passing over the object of the scene)

You have absorbed the wrong lessons from the AZDO presentation.

What I will say now is the key point of my answer tonight : I didn’t see any AZDO presentation. I don’t even know what AZDO is (I will look right after this message).

I come from JMonkeyEngine. I didn’t wanted to give name, but it seems I have to.
And they have a terrible, terrible approach of game engines.
(And the full list is way too long to write it tonight).
After that I started to work over this question and built slowly answers, one by one.

If you want to imagine how I got to this kind of ideas, just imagine a guy with gl functions references, a window displaying the number of frames per second and … a lot of time.
Add to this a bunch of random discussions on forums saved on a usb key because they had an interesting sentence/piece of code exhibiting how to use a function etc.
So I think. Hard, a lot.

For the “switch”, for example, it’s because I read “every fragment shader have to finish at the same time. It’s why both if and else are executed on old graphic card”.
It was the only limitation I knew : finish at the same time. So, imagining a compilation with no-op added to make every case the same length was … likely.
Then I read it was a SIMD system, with a possibility to re-distribute “MD” between threads (but with a cost, potentially high).
You see ? A bit of information, imagination, tests, corrections. But I can go far with that.

See, there are often costs associated with reducing the number of draw calls. You have to force all meshes to use the same vertex formats. You have to force all objects to use the same shader, thus increasing the number of potential branches within the shader. Anything per-object expressions in the shader which would have been uniform if you had drawn it separately stop being dynamically uniform expressions inside of your multi-object draw call (except for multi-draw calls). And so forth.

I know. I am computer scientist (master 2 degree) but not in opengl, and I don’t have access to the internet (for years now). It’s a very specific situation. But I am not stupid and I have imagination.

You’re right: “optimization is like the core plus-value of an engine”. But any “optimization” must actually be something that makes things faster. And the goal you’re working towards does not and will not.

It would if it was possible to modify the function. It’s fixed, it’s still fixed.
I couldn’t know. And it’s why I am here, asking the question : is it possible to use a custom depth test ?
Then the answer is “no”.
Ok, I couldn’t know.

The answer could have been
“Yes, dumb, and you should be doing that already, the fixed-function depth test exists only in compatibility pre-OpenGL5 mode and now you should define yourself a depth shader like this blablabla”.

Optimization is not easy. Optimization is never as simple as just making a few rules and following them. It is a long, winding path of contention and trade-offs. And "personal feud"s are not a good way to achieve an optimal solution.

And here you are not teaching me something. Optimization involve optimizing on lights and objects culling based on occlusion of the field of view (the big building in the middle of the map in GTA 5 seem to have this role : be a big source of culling).
Optimization is like and endless quest. Well ok, but I try to address the first big evident problem first.

But it isn’t in real hardware right now. Vertex shaders, fragment shader, rasterization, depth tests, all of those things represent real, actual pieces of hardware. There’s nothing in-between them. There is no “deeper”, no “lower level of the hardware”. At least, not of the form you’re talking about.

Hmm ok. GPU can be used to do other things than rendering (heavy parallel calculation).
I know that for sure, an IRL friend work in a labs where they have a very huge graphic card for this (one that couldn’t be plug in any motherboard).
That, and CUDA.
And now I am talking about it, I remember a name : OpenCL.
Well, it’s … pieces of informations. Garbage.
Sorry, I can’t be more precise tonight, I am already late.

Thanks a lot for the help, I really appreciate every bit of information I get, like every drop of water for a man in the desert.

(I can’t answer to the other person tonight, please take no offence).

bubuche · September 29, 2018, 6:13pm

[QUOTE=GClements;1292620]
The part between the vertex shader and fragment shader is fixed-function, i.e. you can’t program it.

It’s like an end of the road for me. Well, ok.
(But you won a battle Multiple-Draw-Call , not the war. I’ll be back)

[quote]A depth test is inherently a read-modify-write operation (along with blending and stencilling), and those aren’t programmable.

Believe it or not, but some times ago I was wondering “how can I replace built-in blending functions by mines ?”

If you’re desperate to do everything in a single draw call, the least expensive solution is likely to be to emulate the effect of glDepthRange() in the vertex shader. There are two parts to that: one is transforming the clip-space Z values which are used to calculate depth, the other is transforming the near/far clip planes to match (the built-in near/far clipping is fixed to z/w=±1).

And everything will cost more than I will gain from it.
I am not saying that against you, not at all. Thank you, really.
I am tempering my initial will. It’s bitter but it’s better than sweet dreams-never-turning-into-something-real.

If more than one draw call without changing parameters doesn’t cost much, and if clearing the depth buffer is cheap … I’ll have to accept it.

For the moment, at least.
I have an old pc (graphic card : AMD Radeon 6570), I can’t rely on new features.

Thanks a lot anyway.
There is still other aspects of the engine I can work on.