Performance problem: VBO vs Immediate-mode


I started using OpenGL about a year ago and did some tests with immediate mode rendering. After a few months I started to work with VBO’s. Now I wanted to see how much performance I gained by switching from immediate mode to VBO rendering, but to my surprise the performance didnt increase at all. In fact, most of the time it was slightly worse?!

For my test I create only a single VBO and I want to render it 20000 times per frame (with some different translation on the modelview-matrix).

My guess is that I probably did something wrong with setting up the VBO’s, here are the important parts of the code:
(Please notoe, I use Java and LWJGL to access the native OpenGL functions)

Initializing the VBO:

		private void load_vbo() {
			int FLOAT_BYTE_SIZE = 4;
			 * Define vertices and indices used by the VBO:
			float[] vertices = new float[] {
					0, 0, 0,
					0, 1, 0,
					1, 1, 0,
					1, 0, 0
			byte[] indices = new byte[] {
					0, 1, 2, 2, 3, 0
			 * Create the vertex buffer; upload vertices into buffer.
			vbo_id = GL15.glGenBuffers();
			FloatBuffer vbo_buf = ByteBuffer.allocateDirect(vertices.length * FLOAT_BYTE_SIZE).order(ByteOrder.nativeOrder()).asFloatBuffer();
			GL15.glBindBuffer(GL15.GL_ARRAY_BUFFER, vbo_id);
			GL15.glBufferData(GL15.GL_ARRAY_BUFFER, vbo_buf, GL15.GL_STATIC_DRAW);
			 * Create the index buffer; upload indices into buffer.
			ibo_id = GL15.glGenBuffers();
			ByteBuffer ibo_buf = ByteBuffer.allocateDirect(indices.length).order(ByteOrder.nativeOrder());
			GL15.glBindBuffer(GL15.GL_ELEMENT_ARRAY_BUFFER, ibo_id);
			GL15.glBufferData(GL15.GL_ELEMENT_ARRAY_BUFFER, ibo_buf, GL15.GL_STATIC_DRAW);
			 * Make vbo ready for drawing:
			int POS_COUNT = 3;
			int TEXCOORD_COUNT = 0;
			GL11.glVertexPointer(POS_COUNT, GL11.GL_FLOAT, stride, 0);
			//GL11.glTexCoordPointer(TEXCOORD_COUNT, GL11.GL_FLOAT, stride, POS_COUNT * FLOAT_BYTE_SIZE);

For drawing the VBO:

		for (int i = 0; i < IMAGE_COUNT; i++) {
			GL11.glTranslatef(someX, someY, 0);
			GL11.glDrawElements(GL11.GL_TRIANGLES, 6, GL11.GL_UNSIGNED_BYTE, 0);

For drawing Immediate-Mode:

		for (int i = 0; i < IMAGE_COUNT; i++) {
			GL11.glVertex3f(someX, someY, 0);
			GL11.glVertex3f(someX, someY + 1, 0);
			GL11.glVertex3f(someX + 1, someY + 1, 0);
			GL11.glVertex3f(someX + 1, someY, 0);

Everything else works, its just the performance which isnt that good.
Thanks for your help.

The difference is that you draw all the QUADS in one batch (one glBegin/glEnd pair), while the VBO version makes a draw call for every two triangles. Try to make one index buffer containing 20000 copies of the 6 vertex indexes and render them in one call. That should be much quicker.

That makes sense.
But I would like to draw the exact same image multiple times; like sprites in a video game. Their positions change and the number changes too. What should I do, have a gigantic buffer and fill it as necessary?
Should I move the sprites on the screen using glTranslatef or should I update their positions directly in the buffer?

Thanks you for your advice.

This is what instanced rendering is for. Feed in your x,y translates as an instanced attribute, and draw the same object 20000 times in one draw call. No need to replicate any indicies.

Also watch out for those byte indices; they may seem attractive on the surface because they use less memory, but byte indices are very unlikely to be supported in hardware meaning that your entire vertex pipeline could drop back to software emulation. Use unsigned shorts instead if you’re drawing a small amount of data.

You don’t have to use one buffer for all of the sprites, but you should try to minimise the total number of draw calls.

Instanced rendering is ideal, as that means that you only have to update one x,y pair per sprite, not 4. But even using glDrawElements() with client-side vertex arrays should provide some improvement over glBegin/glEnd (assigning array elements directly is more efficient than calling a function).

Matrix operations apply to each draw call. You can’t apply different matrices to different quads within a single draw call. But you can do essentially the same thing with instanced rendering.

A key consideration for efficiency is separating what is constant (e.g. a quad’s vertex coordinates relative to its origin, and its texture coordinates) from what is varying (e.g. each quad’s origin), and only updating the parts which vary. The constant parts are supplied once for all instances, the varying parts are supplied once per instance. The vertex shader combines the two (e.g. adding the origin to the relative vertex position).

The thing is, I want to draw sprites for a 2D game. Now the sprites obviously move but they can also change their size (which I currently do with glScalef) and rotate sometimes (which I do with glRotatef). However, they all use the same texcoords.

So what I have right now is one VBO with fixed positions and texcoords, and for each sprite I call glTranslate, glScale and glRotate, then draw the VBO and pop the MV matrix. This is, as stated above, not much faster then when I was doing all of this with immediate mode.

So should I rather:

  1. Have one VBO for every sprite and not do any matrix multiplication but instead manipulate the buffer data.
  2. Have one gigantic VBO and put the values of all sprites into this VBO (possibly having lots of free, unused space in the VBO at some points in time)
  3. Go with the instanced rendering (although I currently dont know anything about it and probably have to read some documentations first)

Thank you all for the quick responses.

Here I’m going to go against the modern trend and suggest that you just continue using immediate mode.

The questions you’re asking are certainly answerable with a VBO setup, but why bother? For sprite-based 2D work you’re going to bottleneck far more on fillrate and blending than you will on draw calls or vertex submission, and switching to a VBO setup is going to do absolutely nothing about those fillrate and blending bottlenecks. That’s not the problem that VBOs were designed to solve, and even if you do get an optimal VBO setup you’re still going to be disappointed because your primary bottlenecks will still be there.

Hm, I understand.
But wouldnt using VBO’s save a lot of work for the CPU? Even if it doesnt do me much good with the rendering I can still store all the vertex data on the graphics card and draw it much more easily, right?

Maybe, maybe not. If you’re doing translation/scale/rotate of each sprite then those translations/scales/rotates will still need to be calculated on the CPU (and most drivers will implement the GL matrix stack on the CPU too so that’s not going to help). If you’re updating a dynamic buffer for each frame, or each sprite, (e.g. to do instancing) you need to be very careful about how you do the updating or you’ll introduce stalls. Finally, consider that a 2D sprite is typically something in the order of 16 floats - 4 sets of positions at 2 floats each (x and y) and 4 sets of texcoords at 2 floats each (s and t). A matrix is going to also be 16 floats. So even if you do implement instancing you’re still going to be pushing 16 floats per-sprite; it just doesn’t seem worth it in your case.