100% gpu usage when reading ssbo in first frame

Meldexun · March 26, 2021, 5:27pm

Hello,
I’m trying to implement opengl based occlusion culling. The idea is to render a bounding box and when a the fragment shader is reached it sets one bit in my ssbo to 1. Then I read the ssbo to check which object should be rendered.

This my code: https://github.com/Meldexun/OpenGL-Test

It actually does not render bounding boxes and there are also no objects to render. Basically it just passes an objid and in the geometry shader it renders a triangle on the screen and in the fragment shader it sets the bit corresponding to the objid to 1.

The problem arrises when I’m reading the ssbo in the first frame.
For that I change this line (OpenGL-Test/src/main/java/opengltest/Main.java at c17126903ad6c806db8730dde598679587eeba19 · Meldexun/OpenGL-Test · GitHub):

if (counter < 1 || counter >= 12) {

to this

if (counter < 0 || counter >= 12) {

When I do that the gpu usage jumps to 100% (I have a RTX 2060 and I’m just rendering 1 triangle). Also the gpu usage stays at 100% even after the 12th frame (even though I’m not reading the ssbo after that point).

So what is happening here?

I also want to note that I’m not sure if my usage of glMemoryBarrier and atomicOr is correct.

Thank you for your help.

carsten_neumann · March 26, 2021, 7:53pm

Sorry, not really sure what is going on with your program or even if it is expected. Are you aware of OpenGL’s query objects and in particular how they can be used for occlusion queries?

In general it sounds to me like there are a couple of things you are doing that are not ideal from a performance perspective:

If I understand correctly all fragments that are generated for the bounding box of one object would perform an atomicOr on the same location of you SSBO? That’s a lot of fragment shader invocations contesting a single storage location and they all are forced to execute sequentially the atomicOr.
You are reading back to main memory the results of writing to the SSBO and want to use the contents to make culling decisions on the CPU side? That introduces a GPU stall, where everything writing to the SSBO has to complete on the GPU side, so that you can copy the data to CPU memory and then feed new draw commands into the GPU. In general reading back data to CPU memory should be avoided. If it is necessary, use multiple buffers (M) where in frame N you update buffer N % M and use buffer (N+1) % M to make your culling decisions - i.e. you use the culling information from a few frames ago to determine visibility in this frame. That of course introduces some frames of latency between the culling information being computed and the effect being visible on screen.

Meldexun · March 26, 2021, 8:33pm

I tried to use queries in the first place. But it introduced a huge latency and also the gpu usage was significantly higher (from 20% to over 50%).

glGetQueryObjecti(GL_ANY_SAMPLES_PASSED_CONSERVATIVE, GL_QUERY_RESULT);
glBeginQuery(GL_ANY_SAMPLES_PASSED_CONSERVATIVE, query);
// draw bb
glEndQuery(GL_ANY_SAMPLES_PASSED_CONSERVATIVE);

If you are wondering this was the code. So I did request the query result in the next frame.

If I understand correctly all fragments that are generated for the bounding box of one object would perform an atomicOr on the same location of you SSBO? That’s a lot of fragment shader invocations contesting a single storage location and they all are forced to execute sequentially the atomicOr.

Yes you are correct. Because the results have to be sent to the cpu I’m trying to minimize the bandwidth. Otherwise I would have to use an integer (32 bit) for every object instead of just 1 bit. But I haven’t tested this in a worst case scenario yet. But if you think you have a solution for that problem I would be happy to know it. (The upperlimit of objects should be something like 128^3)

You are reading back to main memory the results of writing to the SSBO and want to use the contents to make culling decisions on the CPU side? That introduces a GPU stall, where everything writing to the SSBO has to complete on the GPU side, so that you can copy the data to CPU memory and then feed new draw commands into the GPU. In general reading back data to CPU memory should be avoided. If it is necessary, use multiple buffers (M) where in frame N you update buffer N % M and use buffer (N+1) % M to make your culling decisions - i.e. you use the culling information from a few frames ago to determine visibility in this frame. That of course introduces some frames of latency between the culling information being computed and the effect being visible on screen.

Yeah I also though and read about this. In the future I would probably add some tripple buffering to the ssbo with asynchronous buffer reading and clearing.

carsten_neumann · March 28, 2021, 3:22am

The same thing about not using the result of a GPU operation in the same frame that it is generated in applies to queries. You introduce a GPU stall by waiting for the query result; use results from previous frames to avoid it.

I don’t really understand what you are saying about using a full integer vs just a single bit, but I don’t think I have to. The issue remains that all fragment shader invocations generated for one bounding box are forced to execute sequentially at the atomicOr, since they all contend for a single memory location.

Also, 128^3 is > 2 million. You cannot issue that many draw calls in a single frame and expect any reasonable performance. You should aim to have the total number of draw calls per frame in the low thousands.

Meldexun · March 29, 2021, 4:35pm

I though I made clear that I am doing that. Or do you mean that the code I posted is not working like that.

I tested it and also realized that using atomicOr ruins the performance. Which is sad because now I have to use one full integer (32 bits) for each object. So when sending the cullling information from the gpu to the cpu it sends 32 times as much data.

And the upper limit of 128^3 objects is just the theoretical limit. With frustum culling that number is a lot lower.

GClements · March 29, 2021, 6:45pm

You could reduce that to 8 bits by using imageStore with a GL_R8UI buffer texture (or a 2D texture if you need to exceed GL_MAX_TEXTURE_BUFFER_SIZE, which is only required to be at least 65536). There isn’t going to be any solution using single bits which doesn’t have a penalty related to concurrent read-modify-write operations.

Meldexun · March 30, 2021, 7:22pm

@GClements Thank you for the tip. I might look into that in the future. But first I want to solve another issue I’m having now: Improving my occlusion culling code