I was going to implement Tom Forsyth’s optimization, but found the problem to be too interesting for code’s copy-pasting - want to “reinvent” that on my own. 
But before I start, I want to understand how the vertex cache works exactly.
Say, I rendered a triangle with indices {1,2,3}, so they are all in the cache now.
Q1: does the cache look like {3,2,1}?
Then another triangle with vertices {4,2,3} comes in.
Q2: how the cache look like now? Is it {4, 3,2,1} or {3,2,4, 1}?
In other words, what happens with the reused vertices which reside in the cache already - are those being relocated to the top of the cache or they maintain their previous positions and only the new vertices are added to the top of the cache?
So far I assume that the vertex cache buffer corresponds to simple FIFO model. If so, then the class emulating it should look like this:
//The class emulating the GPU's vertex cache
class CVertexCache{
unsigned int Size; //Quantity of indices the cache can fit
int* Indices; //FIFO buffer storing the vertex indices
public:
int VerticesProcessed; //Stats: total amount of entered indices
int VerticesTransformed; //Stats: amount of indices pushed into the buffer
private:
//Do not allow copying as there is a dynamic array inside
void operator = (CVertexCache){};
CVertexCache(const CVertexCache&){};
public:
//Destructor frees the dynamic array
inline ~CVertexCache(){ free(Indices); memset(this,0,sizeof(CVertexCache)); }
//Check the index-capacity of the cache
inline int GetSize()const{ return Size; }
//Set the desired cache-capacity; returned value is true unless malloc fails
bool SetSize(unsigned int QuantityOfIndexSlots){
this->~CVertexCache();
if(!QuantityOfIndexSlots)return true;
if((Indices=(int*)malloc(QuantityOfIndexSlots*4))==NULL)return false;
Size = QuantityOfIndexSlots;
for(unsigned i=0;i<Size;i++)Indices[i]=-1;
return true;
}//SetSize---------------------------------------------------------------------
//Default constructor
inline CVertexCache(){ memset(this,0,sizeof(CVertexCache)); }
//Constructor that allows to specify the cache size at initialization
inline CVertexCache(unsigned int QuantityOfIndexSlots){
memset(this,0,sizeof(CVertexCache));
SetSize(QuantityOfIndexSlots);
}//Constructor-----------------------------------------------------------------
//Empty the cache buffer and reset the statistics
void Clear(){
VerticesProcessed=0; VerticesTransformed=0;
for(unsigned i=0;i<Size;i++)Indices[i]=-1;
}//Clear-----------------------------------------------------------------------
//Return the index value stored in the specified slot; -1 returned for
//out-of-bounds requests or if the specified slot is empty
inline int operator[](unsigned int SlotID)const{
if(SlotID<0||SlotID>=Size)return -1;
else return Indices[SlotID];
}//Operator[]------------------------------------------------------------------
//Return the number in range [1.0...0.0] representing the relative position of
//the given vertex in the cache; the value 1.0 corresponds to the first
//position, the value (1.f/Position) is for the last one and 0.0 returned if
//the given index is not in the cache
double GetPosition(int Index)const{
for(unsigned i=0;i<Size;i++)if(Indices[i]==Index)
return double(Size-i)/double(Size);
return 0.0;
}//GetPosition-----------------------------------------------------------------
//Introduce a new index; if it is not in the cache already, it will be pushed
//into the FIFO buffer shifting all indices by one position; if the index is
//the restart index, it is silently ignored
void InputVertex(int Index, int PrimitiveRestartIndexValue=0xffffffff){
if(Index==PrimitiveRestartIndexValue)return;
VerticesProcessed++;
if(!Size)return;
for(unsigned i=0;i<Size;i++)if(Indices[i]==Index)return;
VerticesTransformed++;
for(unsigned i=Size-1;i>0;i--)Indices[i]=Indices[i-1];
Indices[0]=Index;
}//InputVertex-----------------------------------------------------------------
//Process all the indices of the given buffer; refer to the stats afterwards;
//the returned value indicates how many times in average each different vertex
//was transformed
double Process(int* IndexBuffer, unsigned int IndexCount,
int PrimitiveRestartIndexValue=0xffffffff){
Clear();
if(!IndexBuffer||!IndexCount)return 0.0;
unsigned VertexCount=0;
for(unsigned i=0;i<IndexCount;i++){
int Index = IndexBuffer[i];
if(Index==PrimitiveRestartIndexValue)continue;
InputVertex(Index,PrimitiveRestartIndexValue);
unsigned s = i;
for(unsigned v=i+1;v<IndexCount;v++)if(IndexBuffer[v]==Index)s=v;
if(s==i)VertexCount++;
}
if(!VertexCount)return 0.0;
return double(VerticesTransformed)/double(VertexCount);
}//Process---------------------------------------------------------------------
};//CVertexCache________________________________________________________________
As I understand the post TnL-targeted mesh optimization, the goal is to have each vertex transformed no more than once. Therefore, my class CVertexCache has a function Process() which emulates the GPU’s behavior dealing with the given index buffer and returns the ratio of the transformation operations to the total quantity of vertices the indices of the buffer have referenced to. According to my checking of a regular grid with row-progressing triangles, the average transformation ratio is close to 2 per vertex even with the cache size as small as 4. That means that ideal optimizing of such mesh will result in at most 2 times faster rendering. It does not seem to be a really big benefit, IMHO…