Optimize grid rendering for post T&L cache

Yandersen · August 6, 2014, 6:21pm

Hi!
I am rendering a large terrain (which is stored as heightmap texture) using triangle strips for the full length of the map. The optimization I am going to make is the viewing frustum culling. So instead of drawing the map as a whole, I will draw it piece-by-piece, selecting only the parts trapped inside the viewing volume.

Those pieces are going to be a standard rectangular grids of relatively small size. The question is what would be the most optimal dimensionality of those batches (9x9 points, 17x17 points…) and the best drawing algorithm to ensure that each vertex of the batch is processed only once with the help of post-transform cache? Are the optimizations for post T&L cache still relevant at all or am I just outdated?..

As I understand, the best way is to draw the grid using the strips progressing in diagonal direction (so the first one would actually be a single triangle, the next wil have 3 triangles and so on). In that case the required capacity of post-T&L cache will be equal to the number of points in a grid, right? But how to determine the size of post T&L cache and what is it in general and what would it be in near future?

Dark_Photon · August 7, 2014, 5:48pm

It depends on your hardware. You’re trading culling cost (typically CPU) against large batch sizes (typically GPU, though batch dispatch overhead matters if the batch is too small). Best bet: test and see.

Are the optimizations for post T&L cache still relevant at all or am I just outdated?..

Very relevant.

As I understand, the best way is to draw the grid using the strips progressing in diagonal direction (so the first one would actually be a single triangle, the next wil have 3 triangles and so on). In that case the required capacity of post-T&L cache will be equal to the number of points in a grid, right? But how to determine the size of post T&L cache and what is it in general and what would it be in near future?

Read these:

This page is a humorous but informative introduction.
Tom Forsyth’s Linear-Speed Vertex Cache Optimisation.
Ignacio Castaño’s Optimal Grid Rendering. More on that here and Forsyth’s comments on that here.
Check out the ACMR (average cache miss ratio) stats on Ignacio’s page. Also see his spiel on why ACMR might not be the absolute best metric to use (but is better than most), suggesting average transform to vertex ratio (ATVR) instead.

mbentrup · August 8, 2014, 4:07am

This answer cannot be answered, because there is no simple FIFO post T&L cache in modern desktop GPUs. A FIFO cache is inherently serial, GPUs like to work in parallell, so you can’t say the size of the cache is X.

I think the common approach used today is to split the primitive stream into multiple “packets”, so that the number of distinct vertices within every packet is less than or equal to the GPU’s workgroup size, and then those packets are transformed independently. So you get full sharing of vertices within a packet, but no sharing between packets.

Yandersen · August 9, 2014, 1:45pm

[QUOTE=Dark Photon;1260877]Read these:

This page is a humorous but informative introduction.
Tom Forsyth’s Linear-Speed Vertex Cache Optimisation.
Ignacio Castaño’s Optimal Grid Rendering. More on that here and Forsyth’s comments on that here.
Check out the ACMR (average cache miss ratio) stats on Ignacio’s page. Also see his spiel on why ACMR might not be the absolute best metric to use (but is better than most), suggesting average transform to vertex ratio (ATVR) instead.[/QUOTE]Thanks, Photon, I’ve been to all of these Google-top search materials already. I was just wondering this info might be outdated, so posted the q checking if anything changed since the publication days.
The Forsyth’s general optimization techniques can’t do better then the algorithm developed strictly for a particular well-defined situation (regular grid in my case), would you agree? And Castello’s proposal of strips’ direction reversing is not as good as diagonal drawing in case if the the post T&L cache can fit N vertices, where N is the width of the grid, vertex-wise. In other words, the width of the diagonally-drawn grid should be equal to the quantity of transformed vertices the cache can fit, so the only question left - what is that size and how to determine that.
However, the question becomes blurry considering that:

Have to admit, wasn’t that deep in the new functionality yet… Does it in some way means that the effective size of the cache may vary depending on the user’s configuration of the rendering pipeline?..

Dark_Photon · August 9, 2014, 4:36pm

Vertex cache optimization is still very relevant, but the main thing that’s changed (AFAIK) is there is no longer dedicated memory for the post-T&L (post-vertex shader) vertex cache on the GPU. Circa 2007 when the GPU world converted to generic stream processors, dedicated shared memory was put on the cards to store/load intermediate computations including transformed vertices. The more varyings (interpolators) output by the vertex shader, the fewer vertices you could have in your post-vertex shader (transformed vertices) cache.

However, even back-in-the-day, you had a whole 16KB of space per GPU multiprocessor (now it’s up to 64KB), so space for storing transformed vertices is not really a problem. And even before GPUs flipped to generic stream processors, the fixed function GPUs at the time had a post-vertex shader cache size of 45 vertices. That’s “plenty” big enough to get near optimal performance. With a good triangle indexer, the incremental benefit of increasing the cache size is very small.

The Forsyth’s general optimization techniques can’t do better then the algorithm developed strictly for a particular well-defined situation (regular grid in my case), would you agree?

It’s been a while since I’ve implemented Forsyth’s algorithm, but I think you’re probably right because you’re employing information his algorithm doesn’t know about (and doesn’t want to assume) to maximize reuse. However, the whole point is moot if your rendering is not vertex transform bound. I suspect even if you are, your performance won’t be that far from what you can get applying his technique, but try both and see! It’ll be fun to check this.

And Castello’s proposal of strips’ direction reversing is not as good as diagonal drawing in case if the the post T&L cache can fit N vertices, where N is the width of the grid, vertex-wise.

Do you really know what the size of the cache is? Or do you want to impose requirements on what at must at least be?

If not, that’s the benefit of a cache size oblivious method. It’s not going to just fall on its face if you happen to run on the wrong GPU.

Does it in some way means that the effective size of the cache may vary depending on the user’s configuration of the rendering pipeline?..

Yes. Related to what I was talking about about, it depends on which GPU you’re running on (how much GPU shared memory per SM), potentially which driver you’re running (i.e. how it choses to divvy up that shared memory), and how much varying data your vertex shader outputs (which you have complete control over).

mbentrup · August 10, 2014, 2:11am

Most DX10+ GPUs have mutliple processing units, thus the draw batches are split into smaller batches and each is processed on a different unit. There is caching/sharing of vertices within a processing unit, but not between different processing units. So if you render a batch of 100 triangles and it gets split to 50/50, then triangle 50 would have all vertices of triangle 1-49 cached, but triangle 51 would have none of the vertices accessible in the cache.

So you can no longer say that the last “x” vertices are cached. However, in general it is still useful to optimize the vertices, because you benefit from the caching within the split batches.

Yandersen · August 10, 2014, 6:40am

Thank you, Dark Photon! Still, IMO the Forsyth’s general algorithm is better then the Castello’s approach. Probably will use it, because my diagonal drawing will work only if all segment’s vertices are processed by a single unit, and if any vertices are sent to another one, it will break the chain.

[QUOTE=mbentrup;1260907]Most DX10+ GPUs have mutliple processing units, thus the draw batches are split into smaller batches and each is processed on a different unit. There is caching/sharing of vertices within a processing unit, but not between different processing units. So if you render a batch of 100 triangles and it gets split to 50/50, then triangle 50 would have all vertices of triangle 1-49 cached, but triangle 51 would have none of the vertices accessible in the cache.

So you can no longer say that the last “x” vertices are cached. However, in general it is still useful to optimize the vertices, because you benefit from the caching within the split batches.[/QUOTE]I can see the possibility of splitting and sharing the triangle processing in case of the triangle array, but I don’t understand how is it possible at all in case of the index buffer with primitive restart indices - splitting triangles in that case would require scanning and mapping the buffer. Do I need to move from strips with PRI to the indexed triangles and disable the PRI in order to get my triangles automatically shared between the processing units? Or I understand it wrong and that functionality must be configured manually in some way?

Yandersen · August 14, 2014, 1:32pm

I was going to implement Tom Forsyth’s optimization, but found the problem to be too interesting for code’s copy-pasting - want to “reinvent” that on my own.
But before I start, I want to understand how the vertex cache works exactly.

Say, I rendered a triangle with indices {1,2,3}, so they are all in the cache now.
Q1: does the cache look like {3,2,1}?
Then another triangle with vertices {4,2,3} comes in.
Q2: how the cache look like now? Is it {4, 3,2,1} or {3,2,4, 1}?
In other words, what happens with the reused vertices which reside in the cache already - are those being relocated to the top of the cache or they maintain their previous positions and only the new vertices are added to the top of the cache?

So far I assume that the vertex cache buffer corresponds to simple FIFO model. If so, then the class emulating it should look like this:


//The class emulating the GPU's vertex cache
class CVertexCache{

 unsigned int Size; //Quantity of indices the cache can fit
 int* Indices;      //FIFO buffer storing the vertex indices

public:

 int VerticesProcessed;   //Stats: total amount of entered indices
 int VerticesTransformed; //Stats: amount of indices pushed into the buffer

private:

 //Do not allow copying as there is a dynamic array inside
 void operator = (CVertexCache){};
 CVertexCache(const CVertexCache&){};

public:

 //Destructor frees the dynamic array
 inline ~CVertexCache(){ free(Indices); memset(this,0,sizeof(CVertexCache)); }

 //Check the index-capacity of the cache
 inline int GetSize()const{ return Size; }
 //Set the desired cache-capacity; returned value is true unless malloc fails
 bool SetSize(unsigned int QuantityOfIndexSlots){
  this->~CVertexCache();
  if(!QuantityOfIndexSlots)return true;
  if((Indices=(int*)malloc(QuantityOfIndexSlots*4))==NULL)return false;
  Size = QuantityOfIndexSlots;
  for(unsigned i=0;i<Size;i++)Indices[i]=-1;
  return true;
 }//SetSize---------------------------------------------------------------------

 //Default constructor
 inline CVertexCache(){ memset(this,0,sizeof(CVertexCache)); }
 //Constructor that allows to specify the cache size at initialization
 inline CVertexCache(unsigned int QuantityOfIndexSlots){
  memset(this,0,sizeof(CVertexCache));
  SetSize(QuantityOfIndexSlots);
 }//Constructor-----------------------------------------------------------------
 
 //Empty the cache buffer and reset the statistics
 void Clear(){
  VerticesProcessed=0; VerticesTransformed=0;
  for(unsigned i=0;i<Size;i++)Indices[i]=-1;
 }//Clear-----------------------------------------------------------------------

 //Return the index value stored in the specified slot; -1 returned for
 //out-of-bounds requests or if the specified slot is empty
 inline int operator[](unsigned int SlotID)const{
  if(SlotID<0||SlotID>=Size)return -1;
  else return Indices[SlotID];
 }//Operator[]------------------------------------------------------------------
 
 //Return the number in range [1.0...0.0] representing the relative position of
 //the given vertex in the cache; the value 1.0 corresponds to the first
 //position, the value (1.f/Position) is for the last one and 0.0 returned if
 //the given index is not in the cache
 double GetPosition(int Index)const{
  for(unsigned i=0;i<Size;i++)if(Indices[i]==Index)
   return double(Size-i)/double(Size);
  return 0.0;
 }//GetPosition-----------------------------------------------------------------

 //Introduce a new index; if it is not in the cache already, it will be pushed
 //into the FIFO buffer shifting all indices by one position; if the index is
 //the restart index, it is silently ignored
 void InputVertex(int Index, int PrimitiveRestartIndexValue=0xffffffff){
  if(Index==PrimitiveRestartIndexValue)return;
  VerticesProcessed++;
  if(!Size)return;
  for(unsigned i=0;i<Size;i++)if(Indices[i]==Index)return;
  VerticesTransformed++;
  for(unsigned i=Size-1;i>0;i--)Indices[i]=Indices[i-1];
  Indices[0]=Index;
 }//InputVertex-----------------------------------------------------------------

 //Process all the indices of the given buffer; refer to the stats afterwards;
 //the returned value indicates how many times in average each different vertex
 //was transformed
 double Process(int* IndexBuffer, unsigned int IndexCount,
                int PrimitiveRestartIndexValue=0xffffffff){
  Clear();
  if(!IndexBuffer||!IndexCount)return 0.0;
  unsigned VertexCount=0;
  for(unsigned i=0;i<IndexCount;i++){
   int Index = IndexBuffer[i];
   if(Index==PrimitiveRestartIndexValue)continue;
   InputVertex(Index,PrimitiveRestartIndexValue);
   unsigned s = i;
   for(unsigned v=i+1;v<IndexCount;v++)if(IndexBuffer[v]==Index)s=v;
   if(s==i)VertexCount++;
  }
  if(!VertexCount)return 0.0;
  return double(VerticesTransformed)/double(VertexCount);
 }//Process---------------------------------------------------------------------

};//CVertexCache________________________________________________________________

As I understand the post TnL-targeted mesh optimization, the goal is to have each vertex transformed no more than once. Therefore, my class CVertexCache has a function Process() which emulates the GPU’s behavior dealing with the given index buffer and returns the ratio of the transformation operations to the total quantity of vertices the indices of the buffer have referenced to. According to my checking of a regular grid with row-progressing triangles, the average transformation ratio is close to 2 per vertex even with the cache size as small as 4. That means that ideal optimizing of such mesh will result in at most 2 times faster rendering. It does not seem to be a really big benefit, IMHO…

Yandersen · August 14, 2014, 8:42pm

Oh, it seems like I understood the grid-rendering problem enough to come to the conclusion. The most optimal grid rendering could be achieved only if the quantity of points in the grid is not higher than the quantity of vertices the post-transformation cache may hold. If so, then the drawing must be performed in diagonal direction, starting from one of the corners, progressing to the opposite corner of the mesh. As the size of the cache is typically very small (~24, right?), then the solution is to divide the grid onto small batches and render them individually.

The goal is to have each vertex transformed only once. For each of the grid batch this is true for all points except for the border ones: each edge is shared between two neighbouring batches, so those vertices processed twice in total; each of the 4 corner points is shared between 4 batches, so processed 4 times. Let’s calculate how many transformations are made in total and divide it by the quantity of vertices in a batch - this will tell us how many transformation operations are done for each individual vertex (in average). In best theoretical case (if the size of TnL cache is unlimited), this value is 1 (each vertex processed only once). Let’s see how fast the number of redundant transformations grows along with the reduction of the cache capacity:

GridWidthMax = CacheSize;
TotalVertices = GridWidthMaxGridWidthMax = CacheSizeCacheSize;
VerticesTransformed = { 1 * (GridWidthMax-2)^2 } + { 2 * (GridWidthMax-2) * 4 } + { 4 * 4 } = (CacheSize+2)^2
TransformationsPerVertexAvg = VerticesTransformed / TotalVertices = { (CacheSize+2)^2 } / { CacheSize^2 } = (1+2/CacheSize) ^ 2

Now let’s apply that formula to some cache sizes:


Cache size: | TransformationsPerVertex:
4           |  2.25
8           |  1.56
12          |  1.36
16          |  1.27
20          |  1.21
24          |  1.17
28          |  1.15
32          |  1.13
36          |  1.11
40          |  1.10
1024        |  1.004

Looking at the table I see almost no use for caches larger than 16 vertices. Taken this is a bit lower than the average card has, the answer for the topic, IMO, should be: divide the grid onto smaller batches 16x16 points and render them continuously in diagonal fashion.

I think, the specific grid-optimizing algorithm could take cachesize==16 as a predefined constant and go well with it.

Aleksandar · August 15, 2014, 2:53am

I’m sorry if I’m rude, but I think you are wasting your time.

The post-transform cache size vary from GPU to GPU, from generation to generation. There is no purpose to make something that works well only on your machine (or maybe there is?). Aside the fact that a 16 vertices batch is ridiculously small.

All algorithms used in last decade are cache oblivious. They are usually based on the assumption of (modified) LRU policy. Try to find Mark Kilgard’s “Modern OpenGL Usage: Using Vertex Buffer Objects Well” from 2008. It is also an old document, but it will serve the purpose to understand basic concepts.

Forsyth’s algorithm is quite old. GPU architecture changed a lot since those days. Many years ago I tried to implement several algorithms for optimizing vertex post-transform cache, and while they have some impact on G80 GPUs, there was no visible improvements on GF100 (Fermi). What I’m using since then is a regular strip-like drawing with primitive restart at the end of each strip and it serves me well.

Yandersen · August 15, 2014, 6:55am

a 16 vertices batch is ridiculously small
No doubt it is. Sry, I wasn’t clear about this - I didn’t mean that those batches have to be rendered using individual draw calls. I mean that the rendering path have to be clustered into such areas.
Using the strips along the full length of the grid will require to process each vertex twice, while the proposed 16x16 clustering will reduce the transformations count up to ~37%.

Hm, there is more to it. Actually, when rendering grid in diagonal direction, the batches could be as long as you want - 16xN, not necessarily 16x16. In such case, only the top and bottom line of the batch’ vertices have to be processed twice. In such case the total transformations to vertex counts ratio is:

TpV = 1.0 + 2/CacheSize

CS:  | TpV:
4    | 1.50
8    | 1.25
12   | 1.17
16   | 1.13
20   | 1.10
24   | 1.08
28   | 1.07
32   | 1.06
36   | 1.06
40   | 1.05
1024 | 1.002

In other words, my proposal is to draw the grid not strip-by-strip for the full length of the grid, but draw 16 lines at once, diagonally. This will reduce the vertex transformations by 44% (the theoretical maximum is 50%, in case one goes from two TpV to one). And this is achieved even on hardware with vertex cache size as small as 16 vertices.

The benefit of having the cache size 32 will be reduction of transformations for 47%, which is a negligible improvement comparing to the 44% achieved with CS==16.