glMultiDrawElements or glPrimitiveRestartindex?

You have scratched a pretty complex topic, but anyway I wonder why nobody dares to answer to these questions. :slight_smile:

  1. In order to preserve locality sometimes it is better to traverse triangles in some pseudo-circular manner, using closed space filling curves (like Sierpinski). Triangle strips can rapidly progress in a certain direction, hence easily violet locality. Furthermore, long chains might violate input cache also.

The size of vertex post-transform cache differs from card to card. Each processing unit has its own cache which size is usually expressed in number of entries (transformed vertices). But even for the same GPU, number of entries may vary depending on the number and sizes of vertex output attributes. Furthermore, in order to parallelize vertex processing, GPUs divide single VBO across multiple processing units and duplicate shared vertices. Even with perfect vertex ordering there are still many redundant calculations.

In short:

  • the size of the vertex post-transform cache is usually unknown,
  • it depends on GPU architecture, number and size of vertex shader output attributes (because we need to know its size in number of entries not in bytes),
  • how VBO will be split across multiple execution units is also unknown,
  • whether cache uses FIFO or LRU policy is also unknown

You could run some benchmarks (even in run-time) and determine all those parameters, or do it off-line and optimize your application for certain architecture.

But, there are much better methods. Many cache-optimization algorithms are cache oblivious. Using LRU strategy they are trying to utilize vertex-locality as much as possible. They are not optimal, but good enough for wide variety of hardware.

There are many algorithms for vertex post-transform cache usage optimization. Some of them gains better ACMR, while others are extremely fast and can be used in real-time optimization.

To make things even more complicated, the efficient utilization of early-Z for very complex objects with expensive fragment shaders requires calculation and storing several index buffers for a single object depending on the viewing angle.

  1. Yes and No! Yes, it is possible, but usually it is not reasonable because of wide variety of hardware. I think I gave the answer on this question with the previous one. :wink:

Sierpinski is not space-filling, it is more emmental filling …
Try Hilbert curve instead, as detailed on page 556 of “Real-time rendering” By Tomas Möller, Eric Haines, Naty Hoffman.

Yeah, I can imagine that. As I said, I didn’t want to touch it, it’s way too advanced for me, but the discussion got started. Hell, after all it’s my thread :slight_smile:

I guess nobody is answering because I’m too annoying :smiley:

Damn. Where do I get all this information? Looks like important stuff to me.

What kind of benchmarks? Like progressively increasing the size of a buffer and see when the performance goes down?

uhmmm… does that mean that at the end triangles are not so useful?

I am not sure of what you mean.
What I’m proposing is to get information about the cache on the specific hardware, for example when you install the application, or when you load it. Then, that single time, you can run your optimization, and save the results. So you will have a layout that is optimal for that specific machine.
Something similar to what you said about the benchmarks, just without the benchmarks. You get the information in a more efficient way, if it’s possible.

I’m on it :slight_smile:

Well, testing capabilities at startup time or whenever the hardware (or driver) is changed is a feasible solution. Just keep in mind that different shaders have different output attributes. It would be probably better to use just a single strategy based on some cache oblivious approach. If you decide to discover the cache size, it is better to guess some good starting size and than try to adjust it. In my application 32 entries is proved to be a good guess. :slight_smile: