SSBO and VBO help

I did as much performance testing as Visual Studio allows me. It claims that most of my performance is gone due to nvoglv64.dll and gdi32.dll, which is everything OpenGL related. I would like to know what other ways I could do to find the issue.

Rendering is still a big part of it tho, if I disable texturing, it jumps from 2-3 FPS to 23-27 FPS for 10000 objects. I should also probably mention that I run it on a 2012 laptop.

If you really can’t transfer more than a few hundred kilobytes per second to your GPU, then your card has serious problems

Actually, it’s almost 4.7 megabytes. 108 bytes of object data 10000 objects for the SSBO, 60 bytes of vertex data for 6 vertexes for 10000 objects.

as far as i can tell you are using the mapped buffer pointer to update the buffer in a for-loop, which means the buffer is mapped the whole time. alternatively you can try to build a local “cpu-sided” buffer (std::vector, ::reserve(buffersize)), set the data in that buffer and then upload it somehow, either glBufferSubData() or glMapBufferRange() (or via buffer streaming if the previous data isnt relevant anymore)

Batch Rendering using SSBO

SSBO will contain an array of structs that contains all data for each objectstruct objectVarsData {
float posVec[3];
float rotVec[3];

SSBO + VBO only provides a 2x performance increase over calling draw for each object separately. For 10000 sprites, with texturing and alpha channels, I get around 3 FPS (up from 1-1.5 FPS

…I thought the performance remained low due to me re-writing the SSBO/VBO with the same object data that remains constant.

…Also I technically draw things twice, due to the nature of my renderer, and stage 4 requires its own FBO since OpenGL doesn’t handle feedback.

…Also I redraw things a couple of times after that, which I can remove completely if I use my brain and move things around.

This whole thread is wandering around in the weeds.

You took something that sounds really simple, you made it a lot more complex, and you still have very poor performance to show for it.

Rather than try to optimize your much-more-complex tech approach…

I’d suggest you ignore your tech approach for a second, pop back up to the top level, and tell us what you are trying to accomplish. What’s the big picture? Are you just drawing a bunch of point sprites (quads) with texturing and alpha? Is it more complicated than that? If so, how? Then sketch out your original (non-SSBO/non-VBO/etc.) implementation for us (show some code snippets). Also, tell us what GPU/driver/OS you are targeting, the number of sprites you’re aiming to render, and at what target frame time. You’re more likely to get good performance in the end with this route.

I’d recommend that you first understand clearly why your original (non-SSBO/VBO/etc.) implementation is slow, and what you need to change (minimally) to remove its primary bottlenecks and net you good performance. Folks here can help you with that.

I did as much performance testing as Visual Studio allows me.
It claims that most of my performance is gone due to nvoglv64.dll and gdi32.dll, which is everything OpenGL related.

Ok, so you’re GL driver (CPU) and/or GPU performance limited. Which means to get better performance, you need to change how you’re using OpenGL to drive the GPU.

There are other ways to profile GPU-based apps than running the MSVS Profiler on them. For instance, having “feature toggles” in your app where you can switch on/off various pieces of your draw loop for debugging can be useful for isolating how much frame time each feature takes.

Would instancing be possible for different geometries within the same draw call?

Please explain what’s different about the geometries. Do these sprites you’re rendering have different numbers of vertices (e.g. != 4)?

I’d suggest you ignore your tech approach for a second, pop back up to the top level, and tell us what you are trying to accomplish. What’s the big picture? Are you just drawing a bunch of point sprites (quads) with texturing and alpha? Is it more complicated than that? If so, how? Then sketch out your original (non-SSBO/non-VBO/etc.) implementation for us (show some code snippets). Also, tell us what GPU/driver/OS you are targeting, the number of sprites you’re aiming to render, and at what target frame time. You’re more likely to get good performance in the end with this route.

For educational purposes, I am building a multi-purpose engine. The idea is to be able to support geometry (triangle based), lines, and points. Right now I am doing the 2D rendering pipeline, where I assume all triangle based shapes to be flat, and to be ordered in a way (since there’s usually some sort of hierarchy to 2D graphics, most important objects on top). In both 2D and 3D pipelines, I will be using (already have, but it’s disabled) the differed shading technique as a way to optimize shading operations. Since differed shading inherently does not work well with the transparent objects, I had to separate operations into 4 stages:

Stage 1: Render solids (alpha == 1) in FBO1
Stage 2: Do differed shading, save to FBO2
Stage 3: Render alphas, using pre-rendered depth buffer from stage 1 to discard all fragments covered by non-transparent objects. Each alpha fragment is rendered with shading applied.
Stage 4: Render the layer’s output to SceneFBO

This is done for each layer, with results from each layer are rendered on top of each other in stage 4. Also for each objects I render control geometry to its own output, where each object has its own unique application-wide control value. After all layers are rendered, the mouse position is extracted to trigger flag in whatever object the mouse is pointing at.

I can’t really show you the code snippets because I wouldn’t know where to start. It’s just about 5000 lines of object oriented code right now.

Most of the things I mentioned, I implemented by calling draw calls for each object. So right now I am trying to learn something new while fixing the performance issue. I am aiming at using Windows OS, but I try to use cross-platform libraries in case I need to use my thing on a Linux machine.

Ok, so you’re GL driver (CPU) and/or GPU performance limited. Which means to get better performance, you need to change how you’re using OpenGL to drive the GPU.

There are other ways to profile GPU-based apps than running the MSVS Profiler on them. For instance, having “feature toggles” in your app where you can switch on/off various pieces of your draw loop for debugging can be useful for isolating how much frame time each feature takes.

I absolutely understand that drawing 10000+ objects by constantly uploading object data into buffers is insanity. It has bad design written all over it, which is why I will be implementing batch rendering next.

Please explain what’s different about the geometries. Do these sprites you’re rendering have different numbers of vertices (e.g. != 4)?

Yes exactly.

Right now I am thinking about making prototypes of each object type that contain its own geometry data and VBO location for batch rendering. That way when I make an object, I can use instanced rendering for drawing data from the predefined VBO. That should remove any need to update the vertex data at all, which is 77% of the data I upload to the video card every frame right now.

i suggest you start reading about rendering techniques (“OpenGL Superbible”, “OpenGL Programming Guide” and other books/articles by nvidia and so on)

I am building a multi-purpose engine.

It should be noted that “performance” and “multi-purpose” don’t go together. Imposing limitations on your scene is what allows you to be able to make optimizations. The more options you give to the user, the fewer options you leave for optimization.

In both 2D and 3D pipelines, I will be using (already have, but it’s disabled) the differed shading technique as a way to optimize shading operations.

… why would you need to use deferred shading for 2D rendering? I could understand needing deferred shading if you’re rendering billboards or something, but most 2D sprite rendering doesn’t even use lighting.

Right now I am thinking about making prototypes of each object type that contain its own geometry data and VBO location for batch rendering. That way when I make an object, I can use instanced rendering for drawing data from the predefined VBO. That should remove any need to update the vertex data at all, which is 77% of the data I upload to the video card every frame right now.

Until you have positively identified the bottleneck, you should not be making those kinds of decisions. After all, what good does it do to reduce your data uploads by 77% if data uploading is not what’s causing your performance problem.

The best way to figure this out is to reduce everything down to just the OpenGL stuff. Rip out your entire engine (or just open up a new OpenGL project), and rebuild just the sequence of operations needed to produce the output. The best way to do that is to get an OpenGL trace tool, have it spit out a log of OpenGL commands, and then put those commands in your new application.

From there, start profiling. Use timer queries to figure out how long operations on the GPU are taking. Pull things out and see if it improves performance. Start figuring out what is causing your problem.

Only when you know what the problem is can you actually solve it.

It should be noted that “performance” and “multi-purpose” don’t go together. Imposing limitations on your scene is what allows you to be able to make optimizations. The more options you give to the user, the fewer options you leave for optimization.

I know that, but I was hoping to get something more than 3 FPS for something basic like 10k sprites. The question is how to achieve that, and that’s why I wanted to try batch rendering.

… why would you need to use deferred shading for 2D rendering? I could understand needing deferred shading if you’re rendering billboards or something, but most 2D sprite rendering doesn’t even use lighting.

I must be nuts, but I want to try something not a lot of people do.

Until you have positively identified the bottleneck, you should not be making those kinds of decisions. After all, what good does it do to reduce your data uploads by 77% if data uploading is not what’s causing your performance problem.

I played around with the “comment out” tool, and it appears that most of my performance loss is in the fragments shaders that use too many if statements. Rendering a plain color, with alpha pass disabled gives me 15+ FPS for 10000 objects. For 1000 objects, the render goes from 30 FPS to 130 if I disable everything. So the main reason for my slowdown is the uber shader, but due to the nature of what I want to do, I guess I cannot change that.

Still tho, it does not mean I should not be looking into other things. Drawing just 2000 triangles at 130 FPS (with all effects off) is still not enough.

Oh and stop reporting performance as “FPS”. Performance is best measured in actual frame time.

Hello again,

I havestumbled upon a Steam Game Dev conference that featured a presentation on modern technique for vertex data streaming, in particular a method utilizes persistent buffers. I implemented the solution the presenter proposed, replacing my SSBO and VBO buffers with persistent ones.

Initialization:

	// Object SSBO
			glGenBuffers(1, &(this->objectSSBO));
			glBindBuffer(GL_SHADER_STORAGE_BUFFER, this->objectSSBO);

			glBufferStorage(GL_SHADER_STORAGE_BUFFER, graphics2DMaximumSSBOSize_Byte * 3, NULL, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);

		this->objectSSBOAddrStart = (graphics2DObjectData *) glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, graphics2DMaximumSSBOSize_Byte * 3, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);

			glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);

	// Object VBO
			glGenBuffers(1, &(this->objectVBO));
			glBindBuffer(GL_ARRAY_BUFFER, this->objectVBO);

		glGenVertexArrays(1, &(this->objectVAO));
		glBindVertexArray(this->objectVAO);
		glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(graphics2DObjectVertexData), (GLvoid*)offsetof(graphics2DObjectVertexData, position));
		glEnableVertexAttribArray(0);
		glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, sizeof(graphics2DObjectVertexData), (GLvoid*)offsetof(graphics2DObjectVertexData, uvCoordinates));
		glEnableVertexAttribArray(1);
		glVertexAttribIPointer(2, 1, GL_UNSIGNED_INT, sizeof(graphics2DObjectVertexData), (GLvoid*)offsetof(graphics2DObjectVertexData, objectIndex));
		glEnableVertexAttribArray(2);

			glBufferStorage(GL_ARRAY_BUFFER, graphics2DMaximumVBOSize_Byte*3, NULL, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);

		this->objectVBOAddrStart = (graphics2DObjectVertexData *) glMapBufferRange(GL_ARRAY_BUFFER, 0, graphics2DMaximumVBOSize_Byte * 3, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
		
			glBindBuffer(GL_ARRAY_BUFFER, 0);

Synchronization:


		// Waiting for buffer
			GLenum waitStatus = GL_UNSIGNALED;
			if (this->subSceneSync) {
				while ((waitStatus != GL_ALREADY_SIGNALED) && (waitStatus != GL_CONDITION_SATISFIED))
				{
					waitStatus = glClientWaitSync(this->subSceneSync, GL_SYNC_FLUSH_COMMANDS_BIT, 1);
				}
			}

		this->objectVBOAddr = this->objectVBOAddrStart + this->currentBuffer*graphics2DMaximumVBOSize_Byte;
		this->objectSSBOAddr = this->objectSSBOAddrStart + this->currentBuffer*graphics2DMaximumSSBOSize_Byte;		

		/////////////////////////////////////
		//    FETCH AND RENDER HERE
		/////////////////////////////////////

		this->currentBuffer = (this->currentBuffer + 1) % 3;

		// Locking the buffer
			if (this->subSceneSync) glDeleteSync(this->subSceneSync);

			this->subSceneSync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

Rendering:


			glBindFramebuffer(GL_FRAMEBUFFER, this->subSceneFBO1);
			glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
			glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

			glUseProgram(graphics2DStage1ObjectShader);

				glActiveTexture(GL_TEXTURE0);
				glBindTexture(GL_TEXTURE_2D, this->textureAsset->colorMapID);
				glUniform1i(graphics2DStage1ObjectColorMapLocation, 0);

				glActiveTexture(GL_TEXTURE1);
				glBindTexture(GL_TEXTURE_2D, this->textureAsset->normalMapID);
				glUniform1i(graphics2DStage1ObjectNormalMapLocation, 1);

				glActiveTexture(GL_TEXTURE2);
				glBindTexture(GL_TEXTURE_2D, this->textureAsset->specularMapID);
				glUniform1i(graphics2DStage1ObjectSpecularMapLocation, 2);

				glActiveTexture(GL_TEXTURE3);
				glBindTexture(GL_TEXTURE_2D, this->textureAsset->lightMapID);
				glUniform1i(graphics2DStage1ObjectLightMapLocation, 3);

			glEnable(GL_DEPTH_TEST);
			glDepthMask(GL_TRUE);
			glDisable(GL_BLEND);

			// Binding SSBO
				glBindBuffer(GL_SHADER_STORAGE_BUFFER, this->objectSSBO);
				glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, this->objectSSBO);

			// Binding VBO
				glBindVertexArray(this->objectVAO);
				glBindBuffer(GL_ARRAY_BUFFER, this->objectVBO);
				glDrawArrays(GL_TRIANGLES, graphics2DMaximumVerteces*this->currentBuffer, vertexIndex);

These are the only major changes from the last working version of my thing.

However nvoglv64.dll crashes during SSBO data filling for the very first object. Addresses seem to be good, all the buffer switching is proper as well. My video card does support the ARB_BUFFER_STORAGE extension. Is there anything else I can check to ensure the working order.

The program also crashes with just VBO being persistent, but it actually makes it past frame 1, so I don’t think having 2 persistent buffers is an issue.

Suggestions are appreciated.