imageStore with correct depth ordering

nostalgic · October 12, 2015, 7:16am

Hello everybody,

I am trying to write geometry normals to an image in a fragment shader using imageStore. I have enabled early-z and want each texel of the image to store the normal corresponding to the color stored in the render target’s texel. I thought this would be a trivial task, but the ordering of stores seems to differ from the ordering of the depth test. Look at this result:

[ATTACH=CONFIG]1293[/ATTACH]

This is an image of the resulting normals converted to colors. The steps are made from separate, instanced boxes and you can see the red fragments where different surfaces of the boxes overlap. The red fragments are different for every frame, so it’s flickering. I know this is a synchronization issue, but I don’t know how I can fix it.

I render all boxes with a single glDrawElementsInstanced call, the fragment shader that I used for the image above is

#version 450
#extension GL_ARB_bindless_texture : enable

layout(early_fragment_tests) in;

in vec3 gsViewSpaceNormal;

layout(bindless_image, rgba32f) uniform coherent writeonly image2D mainNormalImage;

void main()
{
    memoryBarrier();

    imageStore(mainNormalImage, ivec2(gl_FragCoord.xy), vec4(normalize(gsViewSpaceNormal), 0.0));

    memoryBarrier();
}

The memoryBarrier calls are only for show, they do not change the result. So I imagine that two fragments of different triangles, F1 and F2, pass the early depth test, first F1, new depth is written, then F2, new depth is written, then F2 stores its normal and then F1 does so afterwards, leaving the wrong normal in the image. How do I prevent that?

Cheers,
nostalgic

Alfonse_Reinheart · October 12, 2015, 1:45pm

I know this is a synchronization issue, but I don’t know how I can fix it.

You are correct that this is a synchronization issue. But unless you are willing to render each cube with a glMemoryBarrier call in between them (at which point, kiss performance goodbye), there is nothing you can do about this.

You’re really using the wrong tool for the job. Any time you do imageStore(..., ivec2(gl_FragCoord.xy), ...) in a fragment shader, you need to answer a question: why am I not rendering that to the framebuffer? And if you cannot come up with a compelling reason why not, then you should be.

Also, RGBA32F is way overkill for storing normal data. GL_RGB10_A2 is sufficient (though you’ll have to convert them to the [0, 1] range manually).

nostalgic · October 12, 2015, 2:25pm

Thank you for your quick answer! It’s quite a bummer, though. The reason for my imageStore endeavour is precisely to avoid rendering to the framebuffer. The main render target is multisampled 8x for good quality, which means I have to store 8 normal samples (and object id samples and other per-fragment data) per pixel, although I would just need one. But OpenGL is still restricting all render targets of a framebuffer to the same number of samples. I am aware that there is the Nvidia-exclusive extension GL_NV_framebuffer_multisample_coverage which allows to use coverage-sampled anti-aliasing, but only with renderbuffers (in contrast to multisample textures, which prevents you from accessing the different samples for post-processing in a shader) and only for very limited color/coverage sample combinations, i.e. (2, 2), (4, 4), (4, 8), (4, 16), (8, 16). While 4 color/normal/depth/id/… samples is an improvement, it is still much more memory consumption than needed for my particular application. I am also aware of the GL_EXT_raster_multisample and GL_NV_framebuffer_mixed_samples extensions which seem to revive the idea of CSAA, but neither extension is supported on Kepler cards. I also experimented with sparse multisample textures in the hope to commit only the “first sample layer” of the texture, but since the implementation is free to store samples as it pleases, including interleaved, this is not going to happen either.

The source of my problem is that I have large data that I need to render, about 30 million partially semi-translucent polygons, and I cannot afford to render everything multiple times. 8x MSAA is still faster than rendering the additional non-multisampled data in a second render pass. I think in most applications this would be solved by rendering everything once and applying a filter such as FXAA, but this simply doesn’t cut it because temporal aliasing is horrible with filtered “AA” methods and some of the shaders require/make good use of per-sample shading.

Is there another common way to tackle this problem? Multisampling color (or just coverage), single-sampling everything else, in a single pass? I cannot possibly be the only person with these requirements, and I don’t feel very stubborn about my approach, but honestly don’t see any alternative that does not come with severe disadvantages. I either have a massive waste of video memory, flickering, performance loss or need to buy a Maxwell card.

Thanks also for your note about the normals: I am aware of that, and for lighting and post-effects I use a format much more reasonable for normals, but in this case I need precise normals because they are read back to the CPU at certain locations and used for further transformation. In this case, half precision was not precise enough. But even with RGB8 this would be a complete waste of video memory in vanilla multisampling.

Alfonse_Reinheart · October 13, 2015, 5:57am

Is there another common way to tackle this problem? Multisampling color (or just coverage), single-sampling everything else, in a single pass? I cannot possibly be the only person with these requirements, and I don’t feel very stubborn about my approach

Yes, you are the only one with those requirements. As you pointed out, most other people just use an “I can’t believe it’s not antialiasing” method and move on. Or they use lower multisampling. Or whatever.

Performance is the only absolute “requirement” most people have. Thus, the “common way to tackle this problem” is to give up quality. There is no other general solution.

OK, there is possibly one: ARB_fragment_shader_interlock, if that’s available to you. The pixel_interlock_ordered sounds like it should ensure that Image Load/Store writes are ordered the same way as framebuffer writes.

nostalgic · October 13, 2015, 1:03pm

I once again thank you for your help. I was hoping that maybe I was just unaware of a neat extension or other trick that could help me here. But it also helps to know that this is not the case and that I simply have to make a compromise now and prioritize runtime performance, memory consumption and quality. Unfortunately, shader interlock is also not supported on Kepler cards. I willl certainly re-visit this problems once I get my hands on a Maxwell card. In the meantime, I will probably stick with 4x MSAA and maybe additional FXAA.