NVidia 5x slower than ATI

I’ve been bitten by OpenGL’s VBO model before and really wish that the D3D model had just been copied straight. You get exactly what you ask for and if it’s not supported it crashes. I do understand though that such an approach is contrary to the overall ethos of OpenGL, so it was never going to happen.

But at the very least there’s room for either a significant improvement in the documentation or a FAQ entry listing the common usage scenarios and saying: “this is what you need to do” for each. Behaviour that’s not clearly defined is never a good idea once you start getting down ‘n’ dirty with the hardware.

With VAs or with VBOs in DYNAMIC I get slow NVidia performance, typically 5x or more slower than ATI.

What are you doing with the buffer objects? Specifically, what is your usage pattern? Are you respecifying the data every frame, or are you respecifying it rarely?

What I find irritating, is that ATI’s OpenGL driver is significantly better/more optimized than NVidia’s.

No, they are not. ATI simply ignores your hints; they’ve said so themselves. Programmers who never understood what the hints actually mean would constantly misuse hints. Eventually, ATI got fed up with it and just did things solely based on your usage patterns.

NVIDIA actually cares about what hints you’re using. Which, quite frankly, is the way it should be.

Both.

  1. If the user is simply navigating the scene (moving the camera), then the buffer data contents are not being changed.
  2. If the user is modifying one of the object’s attributes (eg. vertices or color), then the buffer data contents are being changed each frame while the user is performing the modifications so that they get visual feedback of the changes.

According to the hints, if I only used case #1 where I do not change the buffer data contents very often I should be able to use DYNAMIC, but I can’t, it’s too slow on NVidia.

The hints has nothing at all to do with what my comment was in my previous post. I am referring as to whether each vendor is using DMA to transfer the buffer data or not.

I said that ATI is apparently DMA’ing for VAs (Vertex Arrays) and all VBO hints – NVidia is not, they appear to be doing memcpy’s for VAs and DYNAMIC VBOs.

So NVidia’s performance on VAs and DYNAMIC VBOs is terrible compared to ATI.
This is on every system that I have tested.

The only way I can get any performance on NVidia, whether I change the buffer data on each frame or never change it at all, is to use STREAM VBOs. I can’t use VAs or DYNAMIC VBOs with NVidia because it is too slow, 5x to 10x slower than ATI.

I am referring as to whether each vendor is using DMA to transfer the buffer data or not.

This may be a semantic point, but there’s no way to get data to the GPU without a DMA. So the question isn’t whether a DMA happens or not. It’s how long the operation makes the CPU/GPU wait.

I’d guess that your usage pattern tells the ATI implementation to keep a second instance of your buffer around, so that when you re-upload the whole thing, it will go to an already allocated piece of memory.

The NVIDIA implementation, under the assumption that changes to the buffer will be relatively uncommon (and therefore you don’t much care about uploading performance. A single frame of low performance isn’t a problem), will have to either wait until the rendering gets done to start a DMA, or allocate a piece of GPU memory right then and there. Either way, not a fast operation.

Thanks for the information.
I don’t agree with it, but that is ok. :slight_smile:

By the way, you may wish to re-check your information because yes it is possible to use memcpy to perform cpu-gpu, gpu-cpu, and gpu-gpu buffer copying to either Video Memory or the AGP Aperture.

Just a couple of additional notes for anyone who has been following this thread.

On the NVidia systems that I have tested I can use both DYNAMIC or STREAM and get the same VBO performance if I use an interleaved VNC array. Separate VNC arrays cause DYNAMIC to render slower than STREAM. Odd, but of course interleaved is usually best.
Separate or interleaved arrays perform the same on ATI.

The reason I don’t believe ATI is double buffering, is because I can create a large scene of over 800MB of I+VNC data on a 1GB card, and there is not a second 800MB allocation occurring in CPU/AGP memory.

So to recap what I have found:
The results are on current C2D/C2Q hardware, testing of more than 6 different setups, NV 8300 family, 8800 family, 275, Quadro 200 series, and ATI 3870, 4870, 6870, 6970.

  • ATI VA is performing as fast as VBO.
  • NVidia VA is performing 5x to 10x slower (memcpy vs dma?).
  • ATI DYNAMIC or STREAM VBO is performing well.
  • NVidia DYNAMIC or STREAM VBO is performing well (DYNAMIC with interleaved VNC).

Your results may of course vary based on hardware and driver.
I decided on a user-selectable render path options settings, that defaults to VBO STREAM.

Interleaving enables about 5% or so better performance, but I have never noticed the differences in rendering speed using different hints on both NV and ATI.

:slight_smile:
What do you mean double buffering means?
Just the frame-buffer is duplicated, not the entire scene! :slight_smile:

This is the bad news for ATI if it is true. You didn’t mention the size of VA/VBO, the way they are drawn etc.

Interleaving enables about 5% or so better performance, but I have never noticed the differences in rendering speed using different hints on both NV and ATI.

Yes I also noticed about 5%-10% faster rendering with interleave over separate arrays.
With my specific rendering setup, which was a set of frustum culled 4-to-65536 vertex I+VNC buffers, on NVidia with VBO DYNAMIC I was getting poor performance with separate arrays and fast performance with interleave arrays, this was on I+VNC data that was not changing between frames.
It was acting as if using the DYNAMIC hint was causing separate arrays to be stored in CPU memory instead of GPU.

What do you mean double buffering means?

Two alternating VBOs in this case, not double-buffered. :slight_smile:
This was in response to Alfonse’s comment above regarding ATI possibly using a second VBO buffer. They are not.

This is the bad news for ATI if it is true. You didn’t mention the size of VA/VBO, the way they are drawn etc.

Why would this be bad news? Good performance is always better.
If NVidia is using a slow memcpy for VA’s whereas ATI is using DMA then it is a good idea, as it will render faster because it can push the data from CPU->GPU quicker.

The sizes I tested for the VA/VBO were from 1024 vertices to 16.7M vertices split into max blocks of 65536 with culling, using float vertex, float normal, uint color, and ushort and uint index. Both separate individual I,V,N,C arrays and interleaved I+VNC arrays. Drawn using glVertexPointer, glNormalPointer, glColorPointer and DrawElements.
As I mentioned though, I tested on over 6 newer computers, so people with other hardware may get different results than I did.

Static VBOs should be cached in the graphics card memory, so if VAs and VBOs have the same rendering speed that means either VBOs are not in video memory or VAs are cached in the video memory. I really doubt that either of the vendors does cache VAs “on the server side”. That’s why I said what I said. But if AMD do cache VAs in video memory, than the performance would be high (for example if the content of VA is not changed between consecutive draws then “cache” the content), but in that case there will be no difference between the semantic of VA and VBO. I don’t have any AMD card on disposal, but if you post your benchmark I’ll execute it on various NVIDIA cards and various versions of MS Windows, and reply with the results.
I have just changed VBO hints in my application from STATIC_DRAW to STREAM_COPY, and I REALLY HAVE THE SAME RESULTS for both on 8600M GT/Vista. I’ll try also on GTX470/Win7.

I have just changed VBO hints in my application from STATIC_DRAW to STREAM_COPY

You do know what the difference between “DRAW” and “COPY” is, yes?

DRAW means that you, the user, will fill it with data directly, either with glBuffer(Sub)Data or mapping with write flags. It means that you, the user, will not read from it, either with glGetBufferSubData or mapping with read flags.

COPY means that you, the user, will neither read from nor write to it. COPY is useful for doing things like transform feedback, where you render to some buffer object that you then use as source for another rendering operation.

So COPY is the wrong thing.

:slight_smile:
Thank you for clarification, Alfonse!

The following chart depicts what I wanted to say. Next time I’ll be more careful during posting. :wink:

VBO hints chart

Yes, you are correct, STATIC hint data should be located on GPU memory.

To clarify my previous posts if I have confused anyone, the ATI VA is not rendering at the exact same speed as the ATI STREAM VBO, but as I stated it is rendering significantly faster than the NVidia VA.
Since you typically don’t need to display the render output faster than screen refresh rates, ATI VA can visually appear to be almost as fast as VBO for screen updates such as camera movement. This is what I meant previously.
For example, the results that I am getting are along these lines with 2M triangles of I+VNC data, this is typical results of every system I have tested so far:

ATI VA = 56 fps
NVidia VA = 9 fps
ATI VBO STREAM = 200+ fps
NVidia VBO STREAM = 200+ fps

So ATI VA performance will visually look close to VBO performance in this case for my application (regarding 60Hz).
The better way to time this would be milliseconds of course instead of fps, but this is just a quick example.

Sorry, this isn’t a benchmark program, it is an actual CAD-like program being developed.

Can you try turning off threaded optimizations on the nvidia cards (via Riva Tuner or similar), and see if it improves performance of your VAs?

I recently got a new computer with AMD card. I noticed that AMD is more sensitive to memory usage. I can run out of memory on AMD quite fast if I forget to free framebuffers on resize, for example. I did not encounter this on NVIDIA - it seemed to tolerate careless memory management better.

Different memory management strategies might be playing a part in these performance differences. But this is just guessing.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.