Optimizing the vertex pipeline

JoshKlint · August 12, 2020, 11:20am

I am doing some pretty extreme performance tests with 100 million+ polys, and I want to optimize the speed at which my vertex shaders are running, primarily for shadow rendering.

This is my vertex structure:

struct Vertex
{
	Vec4 position;
	std::array<signed char, 3> normal;
	signed char displacement;
	std::array<short, 4> texcoords;
	std::array<signed char, 4> tangent;
	std::array<unsigned char, 4> color;
	std::array<unsigned char, 4> boneweights;
	std::array<unsigned char, 4> boneindices;
	uint32_t index;
}

On the shader side, vertices are defined as follows:

//Vertex layout
layout(location = 0) in vec4 inPosition;
layout(location = 1) in vec4 inNormal;
layout(location = 2) in vec4 inTexCoords;
layout(location = 3) in vec4 inTangent;
layout(location = 4) in vec4 inColor;
layout(location = 5) in vec4 inBoneWeights;
layout(location = 6) in uvec4 inBoneIndices;
layout(location = 7) in uint inVertexID;

Do you see any problems here that would be non-optimal for common PC hardware?

I know this was the case a few years ago, but can we expect an unsigned integer vertex indices to be slower on modern hardware than an unsigned short?

Currently I am using the same shader layout for render and shadow polygons. Is this a mistake? Can I make shadow polys faster by omitting everything but the vertex position, or by copying the vertex positions into a second tightly packed shadow mesh? Or is that a waste of time?

Any tips you can offer are appreciated.

IAmNotHanni · August 12, 2020, 11:36am

Hey there,
even if it would be a significant difference, wouldn’t it depend on the specific hardware anyways?
Since Vulkan supports so many GPUs, there is a big variety of hardware to consider.

Best regards,
Johannes

Alfonse_Reinheart · August 12, 2020, 2:10pm

I’m going to assume that you are not using this vertex for un-skinned meshes.

In order to do shadow mapping with a skinned mesh, you need to provide all of the information needed to compute the vertex position. I don’t know what displacement or index are, but what you need for positions obviously includes the weights and indices. I’ll throw displacement in there too.

So if you were able to isolate just the values needed for position rendering, you’d have a structure 28 bytes in size (always round up to multiples of 4). That’s smaller than what you have, which is 48 bytes, almost half the size. That’s good.

What’s not good is that it doesn’t matter. Your position vertex data is interleaved with non-position data, and memory fetching is ultimately done by cache-line. So even if technically your shader uses less data, it’s still going to have to fetch the same amount of memory. The cost of using that memory to fill in attributes will almost certainly be negligible (and you could remove even that just by not specifying those attributes in the vertex format for the shadow pipeline).

The only way to get a performance benefit out of all of this is to have two buffers of data: one that stores the position-related data and one that stores the non-position data. Of course, this could cause other problems, such as adding too much data to the pre-T&L cache.

So you’d have to profile it to know whether it’s worth doing. Personally, I’d guess that it’s probably worth doing (especially since, when doing shadow rendering, vertex processing is probably going to be your biggest performance bottleneck), but profiling is really the only way to know.

That being said, there are a few things in your vertex struct that are… questionable. The most notable being that position is a vec4. Is it really worth a whole 4 bytes per-vertex to be able to pass homogeneous coordinates? That’s usually not what you pass in; most meshes will just have 1.0 there, so it seems kind of like a waste of space.

Also, there’s something that won’t save you space, but it could help improve visual quality. The normal and tangents (ignoring displacement) probably ought to be VK_FORMAT_A2R10G10B10_SNORM_PACK32, which is a higher-precision way to store 3 useful components and one useless one. It’s pretty widely available for use as vertex input data. And I’d say that, if you’re going to steal a byte for this displacement, take it from tangent and let normal use the higher precision data.

Lastly, here’s a useful GLSL trick that’s sadly not widely known:

layout(location = 1, component = 0) in vec3 inNormal;
layout(location = 1, component = 3) in float inDisplacement;

You can specify named variables that use different components of the same attribute location.

JoshKlint · August 14, 2020, 8:00am

I always knew most of this stuff but I thought the difference it would make would be marginal. Instead I am seeing massive difference in performance in vertex-limited scenes. I’ve already decided to eliminate vertex colors and the second UV set based on what I am seeing. (They can be stored in a texture and accessed in the vertex shader if they are really needed.)

I’ll post a write-up of my findings once this is complete. Thank you for the tips.

JoshKlint · August 19, 2020, 1:56pm

Well, I have my results.

Interleaved tightly packed data is best. I reduced the size of my vertex structure to 32 bytes and I am now using a second copy of the data in a tightly packed position array for shadowmap rendering. (Saw no difference between 12 and 16 bytes on a GEForce 1070M, I suspect it gets rounded up to 16 bytes internally somehow, so I might as well pack the texcoords into the last four bytes.)
New vertex structure:

struct Vertex
{
    Vec3 position;
    short texcoords[2];
    signed char normal[3];
    signed char displacement;
    signed char tangent[4];
    unsigned char boneweights[4];
    unsigned char boneindices[4];
};

The most surprising thing was that optimizing the vertex cache produced performance that was twice as fast:

I also tried implementing AMD’s Tootle but it gave the same exact result. The vertex fetch optimization produced no change in performance, as recent research has indicated:

The overdraw optimization produced no change, but my scene wasn’t really the type of thing I think would benefit anyways.

Also, unsigned short indices are 11% faster than unsigned 32-bit integers.

system · February 18, 2021, 1:56pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.