Increased glClear processing time when STENCIL is included

Hi All,

When performing a glClear with stencil, the processing time increases.

	// ...
	/*Frame(52)*/
	eglSwapBuffers(egl_Display, egl_Surface);
	eglGetError();
	glGetError();
	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); // no problem! (takes 0.6ms)
	glBindTexture(GL_TEXTURE_2D, TO[28]);
	glUseProgram(PO[4]);
	{
		const GLfloat v[]={1.000000};
		glProgramUniform1fv(PO[4], UF7, 1, v);
	}
	{
		const GLfloat v[]={1280.000000, 0.000000, 0.000000, 0.000000, 324.000000, 0.000000, 0.000000, 0.000000, 1.000000};
		glProgramUniformMatrix3fv(PO[4], UF1, 1, 1, v);
	}
	GetTimeDrawElementst(0);
	glBindTexture(GL_TEXTURE_2D, TO[29]);
	{
		const GLfloat v[]={648.000000, 0.000000, 317.000000, 0.000000, 324.000000, 0.000000, 0.000000, 0.000000, 1.000000};
		glProgramUniformMatrix3fv(PO[4], UF1, 1, 1, v);
	}
	GetTimeDrawElementst(0);
	glBindTexture(GL_TEXTURE_2D, TO[30]);
	{
		const GLfloat v[]={0.000000, 0.713942, 1.000000, 0.713942, 0.000000, 1.000000, 1.000000, 1.000000};
		glProgramUniform2fv(PO[4], UF3, 4, v);
	}
	{
		const GLfloat v[]={1280.000000, 0.000000, 0.000000, 0.000000, 297.500000, 0.000000, 0.000000, 0.000000, 1.000000};
		glProgramUniformMatrix3fv(PO[4], UF1, 1, 1, v);
	}
	GetTimeDrawElementst(0);
	glEnable(GL_CULL_FACE);
	glFrontFace(GL_CCW);
	glCullFace(GL_BACK);
	glEnable(GL_DEPTH_TEST);
	glDepthFunc(GL_LESS);
	glDepthRangef(0.000000, 1.000000);
	glViewport(0, 367, 1280, 480);
	glEnable(GL_SCISSOR_TEST);
	glScissor(0, 367, 1280, 480);
	glClear(GL_STENCIL_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); // problem! (takes time 65ms), If I remove GL_STENCIL_BUFFER_BIT, there is no problem.
	GetTimeClear(1);
	glDepthFunc(GL_ALWAYS);
	glViewport(0, 0, 1920, 720);
	glDisable(GL_SCISSOR_TEST);
	glDisable(GL_DEPTH_TEST);
	glDisable(GL_CULL_FACE);
	glBindFramebuffer(GL_FRAMEBUFFER, FBO[3]);
	glViewport(0, -240, 1920, 720);
	glStencilMask(0x000000ff);
	glClearStencil(0);
	GetTimeClear(2);
	glEnable(GL_CULL_FACE);
	glEnable(GL_DEPTH_TEST);
	glDepthFunc(GL_LESS);
	glViewport(0, 367, 1280, 480);
	glEnable(GL_SCISSOR_TEST);
	// ...

In the code above, calling glClear takes time 65ms.
Conversely, if I remove the glClear with depth and stencil, the subsequent glClear takes time 0.6ms.
This means that if I render multiple frames and then call glClear with stencil for the first time, the time will only increase (even if subsequent glClears include stencil, the time will not increase).

I’m confused if this is a bug in the GPU driver, or if I’m not following the OpenGL programming rules properly, does anyone know anything about this phenomenon?

I’m using the Mali-G51 GPU.

Thanks.

What 2 cases are you comparing? Above you show COLOR and DEPTH. But then you talk about removing DEPTH and STENCIL. (??) Also, are you saying that only queuing the 1st glClear() call with a permutation takes a while or all of them do?

At any rate, I think you might be confused. When you call GL and EGL routines, the work is not being performed. The work is being “queued” to be performed later. This is especially important on mobile GPUs (like yours) where the raster work should be performed 1-2 frames later. And if it’s not (e.g. if your app doesn’t use GL-ES and EGL properly), then your rendering performance will be very bad.

So I think what you’re saying is that the time needed to “queue” glClear() seems to increase.

Keep in mind that 65 ms is suspiciously 4 60Hz frames (i.e. 15 Hz), which tends to suggest that something you’re doing is triggering 1 or more full pipeline flushes which is very bad for performance.

You should not do this. Your goal is for your STENCIL and DEPTH render targets (all of them besides COLOR) to reside in GPU fast on-chip tile memory only and never get written to memory or read back from CPU memory. Because CPU memory on a mobile GPU is dog slow, and reading/writing DEPTH and STENCIL in CPU memory is needless memory bandwidth.

  • Always clear all render targets (including DEPTH and STENCIL buffers) without masking when you start rendering to that framebuffer,
  • Always invalidate the entire DEPTH and STENCIL buffers at the end of rendering to that framebuffer, and
  • Follow all of the GPU vendor’s rules to ensure that your DEPTH and STENCIL buffers exist only in fast on-chip tile memory where the cost of reading and writing them is nearly zero.

It sounds like the latter. Re GPU vendor’s rules, search for all mentions of the “stencil” in this document and follow those rules:

My guess is that whenever you dynamically add or remove stencil from your clear mask, the driver is having to completely re-organize how it’s reading from and writing to your render target to handle this. And this may very well cause 1-N full pipeline flushes, which you just happen to observe on your CPU draw thread where you do.

What 2 cases are you comparing? Above you show COLOR and DEPTH. But then you talk about removing DEPTH and STENCIL. (??) Also, are you saying that only queuing the 1st glClear() call with a permutation takes a while or all of them do?

My explanation was not very clear. To be more specific, my program calls glClear a total of 74 times.

1 frame glClear(COLOR|DEPTH)
2 frame glClear(COLOR|DEPTH)
... 
54 frame glClear(COLOR|DEPTH|STENCIL) // 65ms!
55 frame glClear(COLOR|DEPTH|STENCIL) // 0.6ms
...
74 frame glClear(COLOR|DEPTH|STENCIL) // 0.6ms

In the 54th glClear out of 74 glClear(), we added a mask to clear the STENCIL for the first time (because we didn’t use the stencil buffer until the 53rd glClear).
In the above condition, if we remove the STENCIL mask in the 54th glClear, it will take longer in the 55th glClear, and this will continue to shift.

What I’m comparing is if I remove all of the STENCIL masks from the above code:

1 frame glClear(COLOR|DEPTH)
2 frame glClear(COLOR|DEPTH)
...
54 frame glClear(COLOR|DEPTH) // 0.6ms
55 frame glClear(COLOR|DEPTH) // 0.6ms
...
74 frame glClear(COLOR|DEPTH) // 0.6ms

This will not clear the STENCIL, but it will avoid the problem of taking 66ms to clear.

Thank you for your advice. I followed your advice and changed all glClear to clear COLOR|DEPTH|STENCIL and tested it, but it still takes 65ms on the same 54th frame. So I guess it’s all down to the code between the 53rd and 54th frame, but I don’t know where to look for that.

The code between frame 53 and frame 54 doesn’t do much, or am I missing something?

	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); // 53th glClear
	glBindTexture(GL_TEXTURE_2D, TO[28]);
	glUseProgram(PO[4]);
	{
		const GLfloat v[]={1.000000};
		glProgramUniform1fv(PO[4], UF7, 1, v);
	}
	{
		const GLfloat v[]={1280.000000, 0.000000, 0.000000, 0.000000, 324.000000, 0.000000, 0.000000, 0.000000, 1.000000};
		glProgramUniformMatrix3fv(PO[4], UF1, 1, 1, v);
	}
	GetTimeDrawElementst(0);
	glBindTexture(GL_TEXTURE_2D, TO[29]);
	{
		const GLfloat v[]={648.000000, 0.000000, 317.000000, 0.000000, 324.000000, 0.000000, 0.000000, 0.000000, 1.000000};
		glProgramUniformMatrix3fv(PO[4], UF1, 1, 1, v);
	}
	GetTimeDrawElementst(0);
	glBindTexture(GL_TEXTURE_2D, TO[30]);
	{
		const GLfloat v[]={0.000000, 0.713942, 1.000000, 0.713942, 0.000000, 1.000000, 1.000000, 1.000000};
		glProgramUniform2fv(PO[4], UF3, 4, v);
	}
	{
		const GLfloat v[]={1280.000000, 0.000000, 0.000000, 0.000000, 297.500000, 0.000000, 0.000000, 0.000000, 1.000000};
		glProgramUniformMatrix3fv(PO[4], UF1, 1, 1, v);
	}
	GetTimeDrawElementst(0);
	glEnable(GL_CULL_FACE);
	glFrontFace(GL_CCW);
	glCullFace(GL_BACK);
	glEnable(GL_DEPTH_TEST);
	glDepthFunc(GL_LESS);
	glDepthRangef(0.000000, 1.000000);
	glViewport(0, 367, 1280, 480);
	glEnable(GL_SCISSOR_TEST);
	glScissor(0, 367, 1280, 480);
	glClear(GL_STENCIL_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); // 54th glClear

Thanks for pointing out the possible suspicions, I’ll check with the vendor as well.

Ok. Yeah, it does sound like there’s something else your app is doing around frame 54 that’s making the driver mad.

Try removing (#if 0 … #endif) more and more of your GL code that’s being triggered to see when that stall goes away. That will likely point to what your app is doing that the GL driver doesn’t like.

Well, one obvious thing you’re doing is masking the clear. Don’t do that. Disable GL_SCISSOR_TEST before the glClear(). And if you mess with buffer write masks, make sure they’re set before the glClear() so that all all bits of all buffers get cleared (If you don’t even mess with them however, that’s fine.)

Also, you’re not invalidating DEPTH and STENCIL at the end of rendering. You want to do that.

  • A proper glClear() prevents needlessly reading back the content of one or more buffers from CPU memory.
  • A proper glInvalidateFramebuffer() / glDiscardFramebufferEXT() prevents needlessly writing out the content of a one or more buffers to CPU memory.

And again, see all the tips related to “stencil” in that Arm Mali GPUs: Best Practices Developer Guide . Follow that.

Finally, use ARM’s OpenGL ES profiling tools. This may very well point out exactly what you’re doing wrong that’s causing this stall, and where you’re doing it!

I just noticed that these clear calls have different clear masks. I thought you said:

For this framebuffer, you should always clear COLOR, DEPTH and STENCIL. Every single frame. And of course allocate COLOR, DEPTH, and STENCIL buffers for this framebuffer.

If you vary the clear mask, you are very likely going to trigger nasty pipeline reconfig behavior and at least one full pipeline flush … if not more.

As you’ve got it written above, I could easily see how the driver at the beginning of Frame 54, has to read in the entire COLOR buffer from slow CPU memory. Moreover, at the end of Frame 53, since you didn’t invalidate STENCIL, it may very well have to write out the entire STENCIL buffer to slow CPU memory. You probably don’t want either of these costly behaviors.

Following your kind help, I disabled GL_SCISSOR_TEST before glClear call for all frames, and the problem that was taking 65ms disappeared. Thanks a lot.

Based on this result, I’m guessing that this part you mentioned was the cause after all. I would like to know more details and exactly about this part (so that I don’t repeat the same mistake). Is there any way to understand in more detail that not disabling GL_SCISSOR_TEST slows down the CPU memory because it might have to use the whole STENCIL buffer? Is there any documents I can study about it?

That’s great news! Sure thing!

If you look at that:

on pg. 28 under “Fragment Shading” “Do”:

Do
:black_small_square: Clear or invalidate every attachment when you start a render pass, unless you really need to preserve the content of a render target to use as the starting point for rendering.
:black_small_square: Ensure that color/depth/stencil writes are not masked when clearing; you must clear the entire content of an attachment to get a fast clear of the tile memory.
:black_small_square: Invalidate attachments which are not needed outside of a render pass at the end of the pass before changing the framebuffer binding to the next FBO.
…

it tells you that you shouldn’t mask clears, but it’s not very clear about why, mentioning only “fast clear of the tile memory”. It could have additionally said, “, avoiding a complete read of the entire render target from CPU memory before the clear operation is performed.”

If the driver “knows” the previous contents of the render target won’t be needed, it doesn’t need to waste the time reading it from CPU memory.

The above also talks about invalidating buffers you don’t need anymore at the end. And this avoids writing out buffers to CPU memory that you don’t need to be keep. Typically DEPTH and STENCIL.

1 Like

Here are a few links on this from various Mobile GPU vendors. The same principles apply, as mobile GPUs all have the same basic rasterization rendering architecture (sort-middle with slow DRAM):

1 Like

Mobile GPUs typically use tiled rendering, i.e. they split the screen into tiles and render each one separately. Essentially the same technique used to be used to render high-resolution images on desktop PCs when video cards only had a few megabytes of VRAM. Tiled rendering requires rendering the entire scene for each tile (any geometry which doesn’t intersect the tile will be discarded by clipping).

The GPU only has enough VRAM for one tile, not for the whole screen. This isn’t a problem so long as the VRAM state can be reset at the start of each tile and discarded at the end of it. But if any part of the VRAM contents has to be preserved from one frame to the next, the VRAM has to be copied out to system RAM then restored before rendering the next frame for the same tile.

This is all because OpenGL doesn’t care about “tiles” or “frames” or “the scene”. It’s just a sequence of rendering commands which are (or at least appear to be) executed in sequence. It’s up to the driver to implement the API in a way that’s efficient on the hardware in question.

But it can only do so much. In particular, if you have a mobile GPU with only enough VRAM to store a fraction of the framebuffer, anything which requires the framebuffer contents to be preserved between frames is going to be slow. The OpenGL API was designed around a persistent framebuffer, so efficiently implementing that API on a tile-based GPU involves some … tricks which mostly work for “typical” code but fall down on atypical code. And even something which doesn’t actually require the framebuffer contents to be preserved will be slow if the implementation can’t detect that preservation isn’t actually required.

So e.g. setting a scissor rectangle so that only part of the framebuffer gets cleared means that the rest of the framebuffer has to be preserved, or at least it looks that way to the driver. The driver isn’t going to look into this particularly deeply. Certain operations (e.g. glClear for all attachments with no masks or scissor rectangle) will flag the entire framebuffer as being “cleared”. The absence of such an operation may result in it being preserved. Better still, glInvalidateFramebuffer explicitly tells the driver it doesn’t need to preserve contents; that’s its entire point.

1 Like