Bindless Stuff

I get results like this:

Please use code tags to make tables like that more legible.

Thanks for the tests, but this test hardware and method makes me suspect that you are very likely going to be GPU limited much of the time (large triangles = lots of fill, and this is a slow card).

Where you are going to see the most benefit from bindless is where you’re “not” waiting on your GPU to get the work done. You’re waiting on your CPU to pump the batches. That is, in cases where your GPU is fairly fast, and your CPU/CPU memory is relatively slow, such that you just can’t keep the GPU fed..

Also as Alfonse pointed out, for the maximum benefit, you need to be rendering a lot of “different” batches from different buffers. This maximizes your chance for cache misses, which is where bindless shines. Also, don’t render super large triangles. To maximize the bindless benefit, the goal is not to be GPU limited here.

VAOs were in many cases the same speed as just using VBOs, possibly when limited by something else…

Which strongly suggests your test program is not CPU/batch submit limited for those cases, which is where you’re going to get the max speed-up from bindless.

It was mostly to see if I could reach anywhere near the 7x speedup that was achieved in NVidia’s test-case, and where/when bindless started to have an effect.

To maximize bindless benefit, you want a fast GPU and a relatively slow CPU/CPU mem (e.g. slow memory clock, smaller memory caches, etc.) and batches that aren’t super-huge (more CPU batch setup overhead). The benefit is going to be different for different hardware, but it shouldn’t ever net you a loss.

To make that test setup even uglier, running other threads on other cores sharing the same CPU caches which push data out of the cache, causing more cache misses. But just running enough different batches through one thread should do that too.

Bindless + VAO combined seem to be faster than VBOs, but not as fast as bindless by itself

That is my experience too. Don’t stack VAOs on top of bindless – you lose perf. Bindless gives you everything VAOs give you and more.

This may be due to the expense of having a bazillion little VAOs floating around in the driver that are otherwise each causing cache misses when accessed. Dunno. But bindless apparently avoids this overhead by letting you store nearly all of the VAO state on your side in the data structures you store your batches in, which are already in the cache at that point anyway while you’re submitting draw calls.

Right (emphasis mine):

Besides including irrelevent “cruft”, FPS is the inverse of time, and thus varies non-linearly with time (which is one reason it’s fairly useless). For instance, the performance difference between 80 and 90 fps is actually ‘‘greater than’’ (i.e. more impressive than) the performance difference between 125 fps and 150 fps. Why? Well, invert to seconds/frame and see. And if you have to invert to make sense out of this nonsense anyway, why use FPS at all? Just use milliseconds (ms).

Thing is, sometimes you want to draw 1000 little boxes, or 5000 little balls, all photocopies of each other (or slight munges). In those cases, instancing shines. (…if you don’t care about culling efficiency.)

But sometimes you really do want lots of varied content, and instancing is like hammering in a screw. It’s not the right solution.

You want cheaper batches. And that’s what bindless gives you.

It also avoids some of the contortions you end up doing to efficiently cull instances. Instances can really kill your perf through loss of frustum-culling granularity if you’re not careful. Faster batches means you can tolerate smaller instance groups, which means better frustum-culling from the get go.

Instances can really kill your perf through loss of frustum-culling granularity if you’re not careful.

If by “careful” you mean “I frustum cull my instances”, then yes.

Instancing has nothing to do with frustum culling, unless you’re only thinking static instancing. In which case, you should say that.

I’m glad that bindless has finally achieved such attention (after one year of existence). :slight_smile:

Well, I don’t like generic tests because they show nothing. If someone reports a 2x speed boost in a real application, than it is for respect. Bundless can achieve that if there are thousands of VBOs even on fast CPUs with enough cache.

Before going deeper into analysis, it would be useful to clarify some facts about the test.

First, is there a glFinish() call at the end of the drawing method. If there is no such call than the results are not valid. I have a lot of experiences with NVIDIA drivers on Windows, and my early tests (few years ago) were not valid because of that.

Second, what method (function) is used to measure the time? On Windows I suggest using PerformanceCounter. (Don’t laugh at me, I know for the bugs on some motherboards, but they are past, and even if you still have such one, measuring small intervals excludes the error).

Third, it can be useful to find the bottleneck of the application to justify the frame-rate. Currently I’m investigating debuggers/profilers for OpenGL in order to purify my code. Those tools can really be useful. (By the way I’m a little bit disappointed by Nexus. :frowning: Or maybe I expected too much…)

And for the units of measured values, my opinion is that the pseudo-frame-rate is much better for the most of readers, than ms. the pseudo-frame-rate is just the inverted value of the number of seconds something lasts, but the measuring interval is terminated before SwapBuffers or similar frame terminator and any screen synchronization routine should be eliminated.

Alfonse, what is your problem? You appear to be attacking someone who is trying to help and, unless I missed a paypal debit, he’s doing it for free. Measure your language or bugger off. I’m finding this useful. To everyone else, thank you for your efforts. I shall continue reading until I have something to contribute.

No, but performance is measured over many frames, it shouldn’t be needed should it? (Well, maybe 1 glFinish() after the final frame, but it always keeps a running total of framerate, so it would hit performance doing glFinish() after every frame)

It’s using the built in GLScene performance monitoring code, the relevant parts look like this:


// stripped down render loop
if FFrameCount = 0 then
  QueryPerformanceCounter(FFirstPerfCounter);
# render scene #
if not (roNoSwapBuffers in ContextOptions) then
  RenderingContext.SwapBuffers;
Inc(FFrameCount);
QueryPerformanceCounter(perfCounter);
Dec(perfCounter, FFirstPerfCounter);

if perfCounter > 0 then
  FFramesPerSecond := (FFrameCount * vCounterFrequency) / perfCounter;


TGLSceneBuffer = class(TGLUpdateAbleObject)
  ...
  public
    {: Current FramesPerSecond rendering speed.<p>
       You must keep the renderer busy to get accurate figures from this
       property.<br>
       This is an average value, to reset the counter, call
       ResetPerfomanceMonitor. }
    property FramesPerSecond: Single read FFramesPerSecond;
    {: Resets the perfomance monitor and begin a new statistics set.<p>
       See FramesPerSecond. }
    procedure ResetPerformanceMonitor;
end;


procedure TGLSceneBuffer.ResetPerformanceMonitor;
begin
  FFramesPerSecond := 0;
  FFrameCount := 0;
  FFirstPerfCounter := 0;
end;

It’s basically just counting the number of frames rendered after you reset the performance monitor, and dividing by the total time taken between just before the first frame was rendered after a performance monitor reset + just after the last frame was rendered.

Typically you’d just query “FramesPerSecond” every couple of seconds + reset the performance monitor straight after with “ResetPerformanceMonitor()” to get a framerate displayed that is responsive.


FPS := GLSceneViewer1.FramesPerSecond;
GLSceneViewer1.ResetPerformanceMonitor();

Shouldn’t you use timer query?
http://www.opengl.org/registry/specs/ARB/timer_query.txt

Great job Dan anyway, it’s complicated to have this sort of thing sorted but it’s good to have some numbers.

(I am still wondering how VAO end up in OpenGL 3. Who could ever see any good in this feature design this way? Oo)

I am still wondering how VAO end up in OpenGL 3. Who could ever see any good in this feature design this way?

Apple has had VAOs around for years. It seemed to work for them.

Define “work”? :stuck_out_tongue:

Define “work”?

It does what they wanted it to.

Well, measuring the time and dividing it with a number of frames drawn in that period is not a very accurate method to discover how much time is actually spent in the drawing itself. In the best case, I don’t want a screen synchronization to take its part, and in real apps there can be a peace of code executing between two consecutive drawings. That’s why many on this forum insist on ms and not on fps. So, the idea is to measure time taken for each frame, not a time-span interval across many frames.


// E.g.
   QueryPerformanceCounter(&q1);
   DrawScene();
   glFinish();
   QueryPerformanceCounter(&q2);
   time = CalcTime(q1,q2);
   SwapBuffers();
   //Etc...

You might want to supplement the CPU time by GPU ticks, as retrieved by the EXT/ARB_time_query extension.

So you reckon this would provide more useful results?


glFinish();
glBeginQuery(GL_TIME_ELAPSED, timerQuery);
DrawScene();
glFinish;
glEndQuery(GL_TIME_ELAPSED);
glGetQueryObjectiv(timerQuery, GL_QUERY_RESULT, @timeElapsed);
// Calc average time elapsed

I assume it also requires an initial glFinish() before starting the timer query?
I’m not convinced running in a completely clean pipeline would give the most realistic results, but it will help eliminate cases where the fastest stats are limited by something else, and the lower stats are limited by what you are measuring.

The first glFinish is not necessary, especially if there is only one thread drawing (and the previous glFinish committed drawing). But if you like, you can include even that. :slight_smile:

The purpose of timer query is to AVOID glFinish.

Bu I’m not sure it’s a key point for your test.

It is crucial! Because we want to know exact moment when the driver finishes the drawing, not when it accepts the command.

That’s what the timer query does…

The timer query, sit in the command queue. it takes the start and end time from when it is processed in the thread that process the command queue so that you never have to stall to get an accurate timing. Well, unless you call glGetQuery too soon, but it can have a Frame + 1 latency with no trouble.

Oh, you are talking about the new GL3.3 extension - GL_ARB_timer_query. Sorry, I didn’t understand!
Well I haven’t tried it yet. And I cannot relay on it, because most of the cards/drivers do not implement GL 3.3.

Although it would be interesting to compare results of GL_ARB_timer_query with “the old method”. :wink:
Thanks for the suggestion!

I should have quote the extension. It didn’t went to my mind that Windows call this timer query too! Fun :stuck_out_tongue:

Timer query is pretty old actually and supported through
GL_EXT_timer_query extension on nVidia hardware back to GeForce 6.