Good to hear you got it sorted.
Out of interest, does having the one-byte color index values aligned on 4-byte boundaries but still in a separate buffer object give any improvement over having them tightly packed? Although interleaved is usually better if possible.
Have you remembered to switch back-face culling on too?(never mind already seen you mention this in another post)
Maybe now the major bottleneck is gone, the other optimizations such as rendering front to back will have an effect, although that one would have more effect if rendering fragments is expensive.
If matrix multiplication becomes a bottleneck (although 256 matrix multiplications shouldn’t be), then one trick you could do if the chunks are always aligned with the world is to simplify the matrix multiplication.
If you have a constant view-projection matrix across all chunks:
[a e i m]
[b f j n]
[c g k o]
[d h l p]
And you are multiplying it by the chunk model matrix that is simply a translation from the origin:
[1 0 0 x]
[0 1 0 y]
[0 0 1 z]
[0 0 0 1]
The the matrix multiplication could be simplified to:
[a e i m][1 0 0 x] [a e i ax+ey+iz+m]
[b f j n][0 1 0 y] = [b f j bx+fy+jz+n]
[c g k o][0 0 1 z] [c g k cx+gy+kz+o]
[d h l p][0 0 0 1] [d h l dx+hy+lz+p]
Only the last column varies across each of the chunks.
Does removing the glBindVertexArray call have much of an impact on performance, if it does then putting everything into one buffer object as other people have mentioned might help, using glBufferSubData to stream in new chunks + glDrawRangeElements to draw each visible chunk. If using more recent extensions, then glMapBufferRange to allow writing to a range of the buffer + glDrawElementsBaseVertex to allow use of a shared index buffer could be useful.