Texture Switching

Warning: I haven’t tried this yet. I already reserve a 4-bit field of an integer in each vertex for which texture in the texture array. This lets me render many objects within each batch with indexed vertex (element array).

However, this typically doesn’t work within any contiguous smooth surface of a single object, because a given vertex is shared between multiple triangles on a smooth surface.

However, I have not yet tried the new “restart” capability. With that, it should be possible to duplicate vertices within smooth surfaces where you want the texture to switch. Of course the “duplicate” vertex has a different value in that 4-bit “texture array index” field, but that’s not so terrible.

Therefore, with “texture arrays” and “restart” I believe you can achieve the result you want… and quite efficiently too.

However, this typically doesn’t work within any contiguous smooth surface of a single object, because a given vertex is shared between multiple triangles on a smooth surface.

The 4-bit field is a vertex attribute, yes (Though how you create a 4-bit attribute is beyond me)? It’s effectively part of the texture coordinate. So you simply do what you do if the same position uses different texture coordinates, or different normals: you duplicate the position in a new vertex.

There’s no need for “restart” (I assume you’re talking about primitive restart); this has been done since vertex arrays of any kind were first introduced.

One of my vertex attributes is a 32-bit integer. It is simple matter to execute shift (>> and <<) and mask (&) operators to extract 1, 2, 3, 4, 5, 6… and larger bit fields to specify:

  • one of many transformation matrices
  • one of many textures in a texture array
  • whether to apply the texture or not
  • whether to normal-map or not
  • whether to emit light, compute lighting, or otherwise
  • and so forth

This is what I do, so I can submit mammoth batches in a single call of glDrawElements().

PS: Just curious. What’s so strange about a “4-bit field in an integer attribute”? There’s no need to define any “4-bit attributes”! :slight_smile: In fact, that would be terribly inefficient, which is why I pack many individual bits and 4-bit fields into a single 32-bit integer attribute.


You are entirely correct about adding duplicate vertices to accomplish these kinds of things.

However, this doesn’t work if the desired process needs to be done dynamically AKA “on demand” (in response to something that happened in the application, like adding a gunshot hole or other damage).

Also, I tend not to think about adding duplicate vertices because the focus of my engine is “procedurally generated content”. Which means, in practice, that objects tend to be composed of standard “fundamental shapes” and derivative shapes based upon the fundamental shapes. Since these shapes are standardized (created by standard routines), I usually don’t even think in terms of specialized “jiggered” approaches like extra [duplicate] vertices to achieve results like this. Of course the other reason not to think that way is that it is not a general approach, in that it doesn’t work for dynamic, “on demand” situations like I mentioned above.

However, if the nature of the application is such that he’ll never run into these dynamic/on-demand situations, then making specialized extra/duplicate vertices is perfectly good.

This is what I do, so I can submit mammoth batches in a single call of glDrawElements().

While simultaneously having your shader be exceedingly large and branchy. I’m not a fan of monolithic shaders myself; I’m not convinced of the performance of this technique.

However, this doesn’t work if the desired process needs to be done dynamically AKA “on demand” (in response to something that happened in the application, like adding a gunshot hole or other damage).

Then you need to decide whether constantly updating a buffer object is going to get you the same performance that just rendering “normally” will, without the uber-shader, mega-batch approach. I don’t know what the penalty is for state changes, but I doubt it’s more severe than PCIe bus transfers.

This feature would be extremely useful to me, in fact this increased # of batches due to texture switching is the last real bottleneck in my renderer.

It was suggested that this is not needed because we have texture arrays, or could just use an atlas. That’s true for a lot of applications but not mine.

Texture Atlases won’t work since I don’t know until draw time which textures are needed together in the same atlas. I could generate an atlas every frame, but that would be slower than just doing the switching.

Texture Array does not work for the same reason, and additionally my textures are all different sizes.

OP’s suggestion though, would let me collapse almost everything into 1 draw call - that’s perfection!

This feature would be extremely useful to me, in fact this increased # of batches due to texture switching is the last real bottleneck in my renderer.

So how do you know that this is a bottleneck? Namely, how do you know that you aren’t simply bumping into the fastest your hardware will go?

OP’s suggestion though, would let me collapse almost everything into 1 draw call - that’s perfection!

I think people have taken this “minimize batch count” thing a bit too far to be honest. Actual high-end games, products that cost millions of dollars to product who’s success partially depends on getting as much performance as possible from hardware, do not take some of the steps that people often talk about.

Taking one draw call to render is not “perfection”. It’s simply taking one draw call to render. Everything has costs, and nothing is free.

Texture atlases have costs. You have to make textures bigger, mipmaps are more difficult to make workable, you may waste texture space, etc.

Texture arrays have costs. The entire texture must be small enough to fit into GPU memory all at once. So you can’t have a working set of textures that are in GPU memory; it’s all or nothing for any particular array.

I don’t know what you have done that removes all state changes except texture changes. But I highly doubt that it was “free”. Your particular applications might be able to live within its limitations, but that doesn’t make it free.

The OP’s suggestion might allow you the user to have only one draw call. But that doesn’t mean any of the state change overhead has vanished. What the OP suggests is not possible on current hardware, and unless there is a pretty fundamental change in how textures are implemented, it will not be available on hardware in the near future.

Changing textures requires state changes. Either you are going to ask OpenGL to change that state, or OpenGL is going to change that state internally. But the state change, and all of the associated performance issues therein, will still be there.

Under our horribly over powered desktops, you can easily do over 1000 draw calls per frames and the CPU won’t even break a sweat. Seriously, putting everything to just one draw call is just plain silly. The place where you need to look is “how many draw calls are you doing?” If you are under 1000 on a desktop, that is not likely your bottle neck (unless every draw call is accompanied by a texture and/or GLSL program change). Lastly, not having everything on one draw call allows one to cull large chunks of non-visible without forcing it down the GPU (ahh… but how fine to cull, such joys!).

That’s just it actually: in the worst case scenes we are easily doing 5000+ draw calls/frame, each one being just one quad, and each requiring a texture switch.

Imagine a particle system with 1000s of particles, where every particle has a different texture, that’s basically the use case.

We also have a software rasterizer and when running at a small resolution like 640x480, it’s actually faster than OpenGL in this worst case… that’s kind of sad imo :frowning:

Imagine a particle system with 1000s of particles, where every particle has a different texture, that’s basically the use case.

And there’s no way you can build appropriate texture atlases for these particular cases? This is a common technique used by many high-performance rendering applications. You don’t need to know precisely which textures will be used. You know the set of images that could possibly be used, and that is generally enough. Bundle all the particle-system textures together, and you’re fine.

Which brings to mind another question: how exactly would you render particle systems in the same draw call as, for example, terrain that might have diffuse maps, a bump map, and possibly one or two other textures? Not to mention using a much more complicated shader.

We also have a software rasterizer and when running at a small resolution like 640x480, it’s actually faster than OpenGL in this worst case… that’s kind of sad imo

That’s not sad at all; it’s expected. GPUs have, and always will have, some form of overhead for their use. That’s why it is important to draw something suitably significant that allows the basic rendering performance gain to exceed the overhead.

These days, if everything you’re drawing is just single-textured, multiply texture by color, Quake-1-era stuff, you’re really wasting your GPU.

This doesn’t work in our case because the set of “all possible” textures can be far too large to fit into one texture, or potentially, into VRAM. (We use a LRU cache system instead of loading everything at startup.)

Which brings to mind another question: how exactly would you render particle systems in the same draw call as, for example, terrain that might have diffuse maps, a bump map, and possibly one or two other textures? Not to mention using a much more complicated shader.

Well, you wouldn’t. When I said one draw call I was just talking about the particle-like systems. The rest of the stuff we need to draw fits very well into the OpenGL paradigm and there are no performance issues.

This doesn’t work in our case because the set of “all possible” textures can be far too large to fit into one texture, or potentially, into VRAM. (We use a LRU cache system instead of loading everything at startup.)

Then break it up into several smaller atlases. Each texture can represent particles for releated effects. Impose limits on your artists if you have to. You’ll still have texture changes, but not as many.

Yeah, I think that’s probably the best that can be done.

The point though, is this a limitation of the hardware, or of the API? I don’t see why it shouldn’t be possible to (efficiently) use a different texture for each polygon.

I don’t see why it shouldn’t be possible to (efficiently) use a different texture for each polygon.

Because texture accessing is built into the hardware. It’s not just passing a pointer to the shader and having it fetch values from memory. There is dedicated texturing hardware associated with each cluster of shading processors. This texturing hardware needs to know specific information, not just about the texture (pointers to memory, etc), but about how to access it (sampler state, format, etc).

For any texture you use, this information must be passed to the texture unit hardware before it can access that texture. That’s part of what happens when you bind a new texture and render with it. In a texture object (and sampler object), there is a block of GPU setup commands that gets put into the GPU’s command buffer when you render with new textures.

Plus there are API issues. GLuint texture names have no relationship to the actual texture data on the GPU side. That translation is done on the CPU when you call glBindTexture. So having the GPU read, for example, “5” from a buffer object would be meaningless; it wouldn’t know what to do with that value.

Coupled with that is whether or not texture “5” is in GPU memory currently or not. This is something that the CPU normally takes care of when you render with new textures. Again, it is part of the setup for new textures.

In short: not gonna happen.

Imagine a particle system with 1000s of particles, where every particle has a different texture, that’s basically the use case.

There are my thoughts:

  • [li] Does each particle have a unique texture?[] What are the texture resolutions?[] How many pixels are “most” the particles taking up on the screen?[*] Do these textures have a fast to compute procedural nature?

Another thought. If there are say 1024 particles on the screen, running at a resolution of 1024x1024, then each particle on average takes up 1024 pixels, which comes to that the particle is 32x32 pixels in size (on average).

So. Putting the textures used by the particles into a texture atlas is definitely the way to go. Moreover, “calculating what textures to use each frame”, does not sound likely. The more likely case is that from one frame to the next most of the particles are using the same texture as last frame. A texture atlas will give you flexibility. You build the atlas, as you need texture “room”. Once the atlas gets full, you take a gander at what images are needed and not. I freely confess that having what image is applied to each particle changing from frame to frame does not make since. I can imagine a system where you “allocate particles each frame”, and the allocation is done frame by frame per particle or per group. In that case, you’ve got some refactoring to do in order to take advantage of the fact that for most particles, from one frame to the next, the image data is the same.

Since you are talking for particles, then likely the images are pretty small, so the waste of an atlas slack is not such a big deal, the most obvious thing to do is make all the images the same exact size and a power of 2 at that. Then the texture atlas beans is heck-a-easy.

Keep in mind at the end of the day, you’ve (only, snickers) have 1GB of VRAM typically to fit geometry, texture and framebuffers.

@Alfonse:

I defer to your expertise on the hardware part. But it still seems like this scenario could be a lot more efficient if it was done in the driver instead of making 1000s of bind and drawArrays calls. Especially if it could be guaranteed that all the textures are the same format (but not same size).

@kRogue:

Interesting. That actually gives me an idea. I could allocate a large texture array, say 1024x1024x64, and put the LRU inside that. Under certain (realistic, I think) assumptions about the rate of things entering/leaving the LRU, this may be more efficient in both the typical and worst case scenes.

Just one problem, how would it work with mipmaps? I need mipmaps on all the textures, and if everything is in an atlas I can no longer use glGenerateMipmap. (not directly anyways)

No prob. Slices of texture arrays can have MIPmaps.

Texture Atlases won’t work since I don’t know until draw time which textures are needed together in the same atlas. I could generate an atlas every frame, but that would be slower than just doing the switching.

You can put a lot of slices in your atlases. And then either dynamically rewrite your slice index (texcoord.z) for your batch verts, or use a helper texture to translate your virtual texcoord.z to the actual texture array slice index (think virtual memory).

Depending on your program’s constraints, applying this may be trivial or hard. For instance, which texture formats do you need to support? Which resolutions? –> Yields number of texture array permutations. Then you need to figure out the max num slices in each array (i.e. largest working set for each fmt+res). Trivial, hard, or in between … depends on your app.

You could also calculate the mipmaps yourself rather than glGenerateMipmaps (which is usually just a simple box filter).

Assuming you have that the image data in the atlas are powers of 2, then you can set the mipmap data of the images yourself. This makes a great deal of sense if the images are loaded disk (i.e. the image data loaded includes the mipmaps). If you are generating the image data procedurally at run time, then generating the mipmap data yourself is quite likely to be much faster than generating the data anyways. Take a look at nvidia.developer.com for a texture atlas white paper. Essentially it says "make the image data powers of 2 or make sure all image data is at powers of 2 boundaries.

If the power of 2 restriction is too great, then you can mess with GL_TEXTURE_MAX_LOD to specify the highest mipmap that GL can use (for example if the image data is aligned to power of 2 or multiple of 2^k (which ever is smaller), setting GL_TEXTURE_MAX_LOD to k will work too.