ARBvbo posted

NitroGL · March 20, 2003, 7:10am

I don’t see why it wouldn’t be supported on those… It’s almost the same as ATI’s VAO extension.

pkaler · March 20, 2003, 7:13am

Originally posted by Cab:
Is VBO going to be supported on Radeon 7500, 8500?

I’d expect all cards that support vao and var will support ARB_vbo.
http://www.delphi3d.net/hardware/extsupport.php?extension=GL_ATI_vertex_array_object http://www.delphi3d.net/hardware/extsupport.php?extension=GL_NV_vertex_array_range

Cab · March 20, 2003, 7:20am

Originally posted by NitroGL:
I don’t see why it wouldn’t be supported on those… It’s almost the same as ATI’s VAO extension.

Maybe because mapping a buffer was not available via the ATI extension. There was another ATI extension (ATI_map_object_buffer) for doing it, but I don’t know if it was available in Radeon 7500.

Thanks.

NitroGL · March 20, 2003, 8:06am

@Cab - This shows that all (I think) Radeons support that extenion: http://www.delphi3d.net/hardware/extsupport.php?extension=GL_ATI_map_object_buffer

And for those interested, I’ve made a sinple (fairly simple) demo of the new extenion: http://www.area3d.net/file.php?filename=nitrogl/ARBvbo.zip
The important parts are commented. I’m pretty sure it’s all done correctly, though on my 9700 it doesn’t work quite right (mem unmap fails), but I think that’s just a driver thing.

DarkWIng · March 20, 2003, 8:16am

How expensive VBO bind actualy is? Specialy in cases where you have each array in seperate VBO (one ofr vertice, one for normals, one for colors,…)

imported_jwatte · March 20, 2003, 9:08am

I can’t think of any reason why calling BufferData would be any worse off than doing the map / unmap for each update. In fact, I can think of a lot of reasons why it might be better.

Suppose I still do skinning in software. Then I can either:

Skin into cached memory (which will thus cut my available L1 cache size in half, which REALLY hurts on the Pentium IV). Then use BufferData to copy the data into the buffer, which means a second copypass.
Map the buffer, and skin directly into the buffer, which presumably lives in un-cached memory. This avoids an extra copy pass, AND it gives me more L1 cache for my bone matrices.

On a Pentium IV and a reasonable-size skeleton, the size of the bone matrices really starts hurting if you’re going cached-to-cached, as there’s only 8 kB of L1 cache, and (rule of thumb) half of that disappears if you’re writing to cached memory (maybe only a quarter disappears if you write with MOVNAPS because the cache is 4-way (IIRC), but I wouldn’t bet on it).

Anyway, it seems to me that mapping is The Right Thing To Do for any streaming data which you rewrite every frame, and completely rewrite as part of generating the data.

cass · March 20, 2003, 9:09am

Originally posted by Korval:

What’s the ETA on the full ARB_“Uber_Buffer” extension (or extensions)?

Korval,

The superbuffers group is working hard to get a finalized spec. As we near completion of the spec, we will strive to get it into public drivers.

I’m really happy about these two extensions. They will fix a lot of outstanding problems with writing portable OpenGL.

Thanks -
Cass

zeckensack · March 20, 2003, 9:19am

Originally posted by jwatte:
[b] Suppose I still do skinning in software. Then I can either:

Skin into cached memory (which will thus cut my available L1 cache size in half, which REALLY hurts on the Pentium IV). Then use BufferData to copy the data into the buffer, which means a second copypass.

Map the buffer, and skin directly into the buffer, which presumably lives in un-cached memory. This avoids an extra copy pass, AND it gives me more L1 cache for my bone matrices.

On a Pentium IV and a reasonable-size skeleton, the size of the bone matrices really starts hurting if you’re going cached-to-cached, as there’s only 8 kB of L1 cache, and (rule of thumb) half of that disappears if you’re writing to cached memory (maybe only a quarter disappears if you write with MOVNAPS because the cache is 4-way (IIRC), but I wouldn’t bet on it).

Anyway, it seems to me that mapping is The Right Thing To Do for any streaming data which you rewrite every frame, and completely rewrite as part of generating the data.[/b]

Your point is very valid, but … doesn’t the P4 bypass its L1 D-Cache for FP-Data?

Korval · March 20, 2003, 9:34am

ARB_vbo isn’t any more or less restrictive than conventional OpenGL vertex arrays. That is, you can’t specify different components of a vertex attribute from different arrays, but you certainly can specify different vertex attributes in different buffer objects. This would be useful to draw the same model with different colors or texture coordinates or whatever. It’s admittedly a stupid example, but consider:

The spec, and the examples, made it seem like that calling a second glBindBuffers is not a possibility, once you’ve attached a particular attribute array to a buffer. On page 22 of the Power Point presentation, it says that glBindBuffers “must preceed pointer calls”. That seems to rule out your code. Also, the powerpoint slides seem to indicate that each gl*Pointer call is bound to the same Buffer.

[Edit]

On the other hand, I consulted the actual spec, and it says, “It is acceptable for vertex, variant, or attrib arrays to be sourced from any combination of client memory and various buffer objects during a single rendering operation.”

So, I guess you’re right. Good; I can still do what I wanted to.

The superbuffers group is working hard to get a finalized spec. As we near completion of the spec, we will strive to get it into public drivers.

Excellect.

[This message has been edited by Korval (edited 03-20-2003).]

bashbaug · March 20, 2003, 9:49am

Originally posted by jwatte:
Anyway, it seems to me that mapping is The Right Thing To Do for any streaming data which you rewrite every frame, and completely rewrite as part of generating the data.

You’re right. I was thinking of KRONOS’s example where he already has his data in cacheable memory, and is simply doing a memcpy into the buffer.

I probably should have said, “If you’re completely respecifying the contents of a buffer object and you’re not streaming data, you’re better off with BufferData.”

Nice catch.

– Ben

system · March 21, 2003, 8:09am

I probably should have said, “If you’re completely respecifying the contents of a buffer object and you’re not streaming data, you’re better off with BufferData.”

I came to that conclusion short after a used the 43.03 drivers. It exposes the extension and the issue I had is gone. Maybe the 43.30 don’t syncronize…

I haven’t bench this but I don’t know what to do: mapping or using BufferSubData. Both work and I can’t see a diference. But I guess BufferSubData should be faster since the driver takes care of the access and it is the only one that knows where the memory trully is…

Robbo · March 21, 2003, 10:43am

Originally posted by evanGLizr:
[b] Have you talked to MS about this? AFAIK OpenGL cannot survive modechanges in XP because of XP’s design (the OS invalidates the WNDOBJ handler when not in the same resolution the WNDOBJ was created in), and MS acknowledges that shortcoming.

No sensible app should keep an OpenGL context alive across a modechange (unfortunately, there are some non sensible apps floating around). That statement in the issues part of the spec encourages faulty programming.
[/b]

Hey, thanks for the info! I’m going to add that one in to our “bug” tracker at work. I didn’t know that! Although our users are very unlikely to change screen mode while the program is running, it is a possible problem they might run in to.

Austrian_Coder · March 21, 2003, 10:46am

So as far as i can see, this extension is supported by every NVidia and ATI card with the newest drivers, which will release in the near future.

Should there be a fallback, if this suberp extension is not supported? Then mybe it is more work.

Humus · March 21, 2003, 12:41pm

A fallback would probably be wise to have, but it’s not much code needed to support both VBO and standard system memory vertex arrays.

imported_jwatte · March 21, 2003, 5:19pm

> doesn’t the P4 bypass its L1 D-Cache for FP-Data?

That’s the first I heard of that. That would be very bad, as it would make latencies for reading the bone matrices very high.

Are you sure you’re not thinking of the MOVNTPS SSE instruction, which allows you to manually bypass the cache write?

If this is a special mode in the P4, do you have a reference I could go look-see at?

zeckensack · March 22, 2003, 1:47am

No hard references, sorry. That’s only second hand info I picked up on the forums, but it should be officially documented somewhere. The P4 supposedly ignores L1 for FP data and instead falls back to its L2 cache. Which is not all that bad, we’re definitely not talking about uncached memory access here.

I’m not sure atm whether this applies to x87 only, SSE2 only, or both. I don’t have a P4, so I can’t test it myself. But I’m pretty certain that it’s true for at least one of these two.
The basic idea is that
1)FP data often comes in huge batches - a potential cache thrashing hazard
2)typical FP code doesn’t suffer as much from increased latency, as long as there’s enough bandwidth.
L2 would then be the natural choice, seeing how scarce L1 cache is on the P4.

Please take all of this with the mandatory grain of salt until someone with first hand knowledge clarifies.

fritzlang · March 22, 2003, 9:30am

Originally posted by cass:

Early implementations of VBO were just for API correctness (so that apps could begin porting). In current internal builds, VBO is as fast as VAR.

When can we expect to see that driver?
I love this extension ( thanks ) but it is very slow as it stands now.
And will I be guranteed to get fastest possible video mem if there is sufficent on board?

From the spec:
“- Applications may still access high-performance memory, but this is optional, and such access is more restricted.”

I did a simple test, untextured 2d patch, 257 x 257 vertices as 2d floats, unsigned short indices, drawing tristrips.
VAR = 128 fps.
Std-gl =60 fps.
VBO with indices in ram = 60fps.
VBO with indices in buffer = 60fps.

Cheers.

[Edit]
Shouldn’t then, with optimal drivers, VBO potentially be faster than VAR, for static geometry, since the indices can be in video ram?
In this example VBO > 128fps?

[This message has been edited by fritzlang (edited 03-22-2003).]

JD1 · March 22, 2003, 10:48am

Lars, I saw your docs and it gave me an idea. Download my .chm docs and tell me what you think. It’s not complete just a format to give an idea what I had in mind. I’m not sure where I will go with this because it looks like a lot of work. This format is useful for vendor specific extensions. I feel sgi should create a new doc format allowing IHVs to plug into it. The user would then download docs from sgi and have it all in one place.

Download from http://forged3d.tripod.com

It’s at the bottom of the pic on main page. Btw, that’s my editor done in d3d9, just recently I thought of moving to gl for flexibility purposes. I’m still undecided though. Take care.

pkaler · March 22, 2003, 11:23am

Originally posted by JD:
Download my .chm docs and tell me what you think.

Anyway to get that in a format that is readable where IE is not available?

This format is useful for vendor specific extensions. I feel sgi should create a new doc format allowing IHVs to plug into it. The user would then download docs from sgi and have it all in one place.

How about Docbook, or LaTex, or straightup html with some css.

I have Perl on my “to learn” list. Maybe I’ll put together a script to parse the txt files and spit them into LaTex.

[This message has been edited by PK (edited 03-22-2003).]

system · March 22, 2003, 11:29am

Originally posted by zeckensack:
The basic idea is that
1)FP data often comes in huge batches - a potential cache thrashing hazard
2)typical FP code doesn’t suffer as much from increased latency, as long as there’s enough bandwidth.
L2 would then be the natural choice, seeing how scarce L1 cache is on the P4.

Or maybe Intel made a bad choice when reducing the L1 cache size to 8KB (data) and they noticed bypassing the L1 for FP often improved performance.

Kind of stupid considering how much the performance improved with the 32K on the PMMX.

Intel seems to be bent on clock rate thinking GHz sells chips. These guys are going backwards. Rememeber the cacheless Celeron? What a joke.
and what about RAMBUS? gimme a break.