Correct VAO usage

Hi,

Recently I’ve tried to implement support of VAO, but it works terribly slow (about 1 fps and 200k tri/sec in indexed strips). The example app from ati.com is completely useless: it renders two triangles with 800 fps, and my code can render two triangles with 800 fps as well Did anyone implemented fast geometry transfers with VAO? And how?
That’s how I’ve tryed to do:

create 32 buffers (100 kb each) with DYNAMIC_ATI flag
for each triangles batch
{
update buffer data with DISCARD_ATI flag
render triangles with glDrawElements
}

Originally posted by h2:
[b]Hi,

Recently I’ve tried to implement support of VAO, but it works terribly slow (about 1 fps and 200k tri/sec in indexed strips). The example app from ati.com is completely useless: it renders two triangles with 800 fps, and my code can render two triangles with 800 fps as well Did anyone implemented fast geometry transfers with VAO? And how?
That’s how I’ve tryed to do:

create 32 buffers (100 kb each) with DYNAMIC_ATI flag
for each triangles batch
{
update buffer data with DISCARD_ATI flag
render triangles with glDrawElements
}[/b]

Yeah, it’s slow (about the same speed as Vertex Arrays) if you update the buffer every frame. Not sure if it’s a driver thing, or if it’s just the way the extension works.

You need to use a separate extension that lets you write directly into the buffer, else you get an extra copy step induced by the driver. 200 ktri/second doesn’t sound right, though – sounds as if you’re getting lots of partial evictions into AGP or some simiarly nonsense.

NitroGL

Yeah, it’s slow (about the same speed as Vertex Arrays) if you update the buffer every frame. Not sure if it’s a driver thing, or if it’s just the way the extension works.

It’s actually much slower than VA With VA I get about 2M tri/sec.

jwatte

You need to use a separate extension that lets you write directly into the buffer

What extension do you mean? I believe that VAO should work fast by itself, without any additional extensions.

Has anyone else tested this?
Is it true that updating the vertex array object every frame is slower than normal vertex arrays?
I have read the spec and this whitepaper (http://www.ati.com/na/pages/resource_centre/dev_rel/ATIVertexArrayObject.pdf) and it seems that you can do it using GL_DYNAMIC_ATI when creating the object and glUpdateObjectBufferATI() every frame.

(I presume that you are creating the objects just one time at the beginning of the program and not every frame…)

I will test it next weekend…

What extension do you mean? I believe that VAO should work fast by itself, without any additional extensions.

The VAO extension only lets you point the driver at data you already have in system memory. Thus, the driver will have to copy the data into the VAO. If you change your data every frame, you get a lot of unnecessary memory traffic:

  • read source data
  • transmogrify your data
  • write destination buffer in system memory
  • read system memory buffer
  • write AGP memory

Meanwhile, nVIDIAs VAR extension lets you do this:

  • read source data
  • transmogrify your data
  • write AGP memory

Note that the extra overhead doesn’t matter for static data (data you only update once, and then leave in the buffer), only dynamic data.

I believe there’s another ATI extension which allows you to get a pointer into a VAO, but I can’t find it among the public specifications.

Originally posted by jwatte:
[b] The VAO extension only lets you point the driver at data you already have in system memory. Thus, the driver will have to copy the data into the VAO. If you change your data every frame, you get a lot of unnecessary memory traffic:

  • read source data
  • transmogrify your data
  • write destination buffer in system memory
  • read system memory buffer
  • write AGP memory

Meanwhile, nVIDIAs VAR extension lets you do this:

  • read source data
  • transmogrify your data
  • write AGP memory

Note that the extra overhead doesn’t matter for static data (data you only update once, and then leave in the buffer), only dynamic data.

I believe there’s another ATI extension which allows you to get a pointer into a VAO, but I can’t find it among the public specifications.[/b]

Perhaps that’s what that new GL_ATI_map_object_buffer extension does…

Zak McKrakem

[i]I presume that you are creating the objects just one time at the beginning of the program and not every frame…[i]

I will test it next weekend…

Then please post your results here.

jwatte

[i]- read source data

  • transmogrify your data
  • write AGP memory[/i]

I believe this scheme is true if you are not doing any writes to other memory areas. Because if you do, cpu write combiners will be flushed. And what about writing to some intermidiate variables? Does that causes write combiners flush?

My point of using VAO is to avoid redundant data copies when doing multipass on dynamic data.

Originally posted by h2:

Then please post your results here.

I have made a quick test this morning… I have modified the SimpleVAO example from ATI to draw a sphere using 5120 tris (just vertex and normals, no texturing) and 15360 vertex (it is a quick test so there are no vertex sharing).
Using a static object (glNewObjectBufferATI(objectSize, verts, GL_STATIC_ATI)) works ok. It returns 1 as the vertex handle and the app is running at ~600FPS.
Using a dynamic object (glNewObjectBufferATI(objectSize, verts, GL_DYNAMIC_ATI) or glNewObjectBufferATI(objectSize, NULL, GL_DYNAMIC_ATI)) return 0 that means that the buffer can’t be allocated!!!

System: P4 1,7Ghz; 256Mb; Radeon 8500; W2000; driver 5.13.01.6015

Anyone knows what is the last driver to test?

(In this system with a GF3 I can allocate at least 8Mb using VAR extension and asking for AGP memory. So AGP is working properly…)

Thanks.

h2,

> I believe this scheme is true if you are
> not doing any writes to other memory
> areas. Because if you do, cpu write
> combiners will be flushed. And what about
> writing to some intermidiate variables?
> Does that causes write combiners flush?

You clearly have to manage your caches and write combiners (nee “line fetch buffers”) correctly. If you spit out an aligned block of 32 bytes (or better yet, 64 bytes) at a time, then that will write combine correctly no matter what the other memory traffic is.

Then you can make sure to manage your caches correctly. As there’s 8 kB of Dcache on a Pentium IV, you probably don’t want to use more than 4 kB of “auxiliary data” (stack + coefficients + whatever), and process your vertices in 4 kB input data chunks, doing a full pre-read (not just pre-fetch) of the data, so you know it’s going to sit in L1 and not confuse your LFBs.

If your data is very scattered, you should probably look into a way to make it less so :slight_smile:

> My point of using VAO is to avoid
> redundant data copies when doing multipass
> on dynamic data.

You could get the same win with LockArraysEXT(), assuming the Radeon driver copies data to AGP memory when you lock and/or re-set the array pointers (which it should). However, in both VAO and LockArrays cases, you’ll have one redundant copy into system memory where you are generating your data.

Originally posted by h2:
[b]Hi,

Recently I’ve tried to implement support of VAO, but it works terribly slow (about 1 fps and 200k tri/sec in indexed strips). The example app from ati.com is completely useless: it renders two triangles with 800 fps, and my code can render two triangles with 800 fps as well Did anyone implemented fast geometry transfers with VAO? And how?
That’s how I’ve tryed to do:

create 32 buffers (100 kb each) with DYNAMIC_ATI flag
for each triangles batch
{
update buffer data with DISCARD_ATI flag
render triangles with glDrawElements
}[/b]

I think you are running into a synchronization issue that has since been fixed. If this is the issue, I believe you may be able to double buffer the VAO’s on your own to avoid the sync. Just allocate 64 instead of 32 and use half every other frame. Would it be possible to get a copy of the app to confirm that this is fixed?

As for the question with MapObjectBuffer, it is implemented. I think we held off publishing the spec publicly because of the possibility of an minor 11th hour revision to the interface.

-Evan

Oh, and as for the demo, it was intended to just be as simple as possible to avoid confusing people that just wanted to learn VAO. If you think we went to far on that one, that sort of feedback is good to help us decide what type of content to include.

[This message has been edited by ehart (edited 02-20-2002).]

[This message has been edited by ehart (edited 02-20-2002).]

Zak McKrakem

I’ve wrote some kind of test app as well. Results are exactly the same:

Intel Celeron/MMX 334 Mhz
OpenGL ICD on ATI Technologies Inc. Radeon 8500 DDR x86 [1.3.2475 Win2000 Release]

     Arrays : 8.6 FPS, 2252360 TPS, 4.18+4.05 Mb/s
        CVA : 7.3 FPS, 2006648 TPS, 3.53+3.43 Mb/s
      Lists : 34.6 FPS, 8681824 TPS, 16.74+16.24 Mb/s
  Lists N/L : 73.1 FPS, 17977928 TPS, 35.29+34.24 Mb/s

Streaming VAO : 3.2 FPS, 1023800 TPS, 1.56+1.52 Mb/s
Static VAO : 35.1 FPS, 8681824 TPS, 16.93+16.43 Mb/s

Sources of test are here http://www.cc.jyu.fi/~pturchy/vaotest.zip

By the way, I’ve just tested my GF2 MX. Outstanding results for display lists

Intel PentiumIII, PentiumIII Xeon or Celeron/MMX/SSE 701 Mhz
OpenGL ICD on NVIDIA Corporation GeForce2 MX/AGP/SSE [1.2.2]

     Arrays : 8.6 FPS, 2252360 TPS, 4.15+4.03 Mb/s
        CVA : 8.5 FPS, 2252360 TPS, 4.11+3.99 Mb/s
      Lists : 8.2 FPS, 2170456 TPS, 3.97+3.85 Mb/s
  Lists N/L : 8.3 FPS, 2252360 TPS, 4.03+3.91 Mb/s

Streaming VAO : unable to initialize
Static VAO : unable to initialize

jwatte

You clearly have to manage your caches

Thanks for explanation. Now I understand.

You could get the same win with LockArraysEXT()

Probably not. There is no way to tell how particular ICD is managing CVA. Just look at my results for CVA; they are actually slower than VA. Unfortunately, all that IHVs need is to make Q3 run fast, nothing else

h2

I ran your app on my machine and here is what I get :

AMD K7/MMX/3DNOW 800 Mhz
OpenGL ICD on ATI Technologies Inc. Radeon 8500 DDR x86/MMX/3DNow! [1.3.2483 Win2000 Release]

     Arrays : 12.0 FPS, 2989496 TPS, 5.80+5.63 Mb/s
        CVA : 16.8 FPS, 4218056 TPS, 8.09+7.85 Mb/s
      Lists : 37.4 FPS, 9255152 TPS, 18.08+17.54 Mb/s
  Lists N/L : 145.3 FPS, 35751096 TPS, 70.20+68.11 Mb/s

Streaming VAO : 31.6 FPS, 7903736 TPS, 15.27+14.81 Mb/s
Static VAO : 38.6 FPS, 9582768 TPS, 18.63+18.07 Mb/s

Note that the driver I have is more recent that the one you used (1.3.2483 vs 1.3.2475). Streaming VAO has pretty good performance.

ehart

Would it be possible to get a copy of the app to confirm that this is fixed?

Ok, looks like it was a driver bug, and it’s fixed already.

BTW, I have a swarm of bugs with pbuffer, render_texture and envmap_bumpmap on R200 (on R100 everything is fine). Mayhap you can handle them as well?

[This message has been edited by h2 (edited 02-21-2002).]

kehziah

Ok, looks like this bug was fixed Thanks for testing.

Originally posted by kehziah:

Note that the driver I have is more recent that the one you used (1.3.2483 vs 1.3.2475). Streaming VAO has pretty good performance.

Kehziah,
How can I know the driver version? I got it from Display Properties -> Settings -> Advanced and I got 5.13.01.6015. This number is not similar to yours…

Thank you.

[This message has been edited by Zak McKrakem (edited 02-21-2002).]

The version numbers mentioned are those returned by glGetString(GL_VERSION).