R350 finaly! It's faster than FX, but is so flexible? Aparently yes!

Most of the 9800’s speed improvements were in the areas of antialiasing support and Hyper-Z stuff. If you don’t turn on antialiasing (and, to be honest, given the fact that a 9800 will run Unreal2003 at around 90fps with good antialiasing and aniso, why not?), you’re probably not going to see much of a speed improvement.

What I really want to know is:

#1: Did they match new nVidia’s fragment program instructions?
#2: Did they match nVidia’s vertex/fragment program parameter/constant/etc counts?

Originally posted by MZ:
If this is only instruction count, as reviews seem to suggest, then this is nothing to be excited about.

Don’t know if this is true or not, but infinite instruction count alone is enough to get me excited. I don’t think I’ve ever hit any of the other limits, but I sure have hit the instruction limit a few times.

Anyone know what’s happened to the M10 mobile part? I’d assumed it would be announced at the same time.

Correct me if I’m wrong, but doesn’t the f-buffer give us order-independent-transparency, basically for free? There’s something to be excited about.

FYI, for those that wanted OGL Extensions:
http://www.beyond3d.com/reviews/ati/r350/index.php?p=apisupp

Originally posted by deshfrudu:
Correct me if I’m wrong, but doesn’t the f-buffer give us order-independent-transparency, basically for free? There’s something to be excited about.

No, f-buffer renders fragments in the order you send them, so you’re still responsible for your own sorting.

I’m interested to hear details on the f-buffer implementation as well.

Cass

A few more F-Buffer details here:
http://www.beyond3d.com/forum/viewtopic.php?t=4717

(sireric - ATI)

Originally posted by DaveBaumann:
[b]FYI, for those that wanted OGL Extensions:
http://www.beyond3d.com/reviews/ati/r350/index.php?p=apisupp

[/b]

Thanks, I didn’t see any link to that page.

What is “Legacy Depth Bias”?

Why do they list “Projected Textures”. This is texture coordinate generation I guess.

Originally posted by cass:
[b] No, f-buffer renders fragments in the order you send them, so you’re still responsible for your own sorting.

I’m interested to hear details on the f-buffer implementation as well.

Cass[/b]

at least it kicks all problems of multipass away, making transparency a useable feature again. but depthsorting is needed (depending on the transparency…)

about the other limits (someone above noted), texcoords, texcount, etc… dunno, you never worked on a r300 chip yet, did you? i don’t have any problems with texcount… i mean… the full lighting equation does not require any texture at all, remember, full floats => lighting equation can be done ni the (now infinite long) shaders. if you wanna do shadowmapping, you need one shadowmap per light, yep, limiting your max amount of lights. but you can emulate for example soft shadowing by supersampling around on the shadowmap with ease, you could do it like lightmapping, fitting several shadowmaps into one (dunno, its just an idea )…

all i wanna say is texture count is limited, but makes definitely sense, and is not that much of an issue, as you can actually use it for texturing (specifiyng materials, that is). texcoord amount is not really important, you can just pass in object space x,y,z and the u,v texcoords, interpolated, and generate all other texcoords from this perpixel (dropping the vertexshader). that is always possible, even while wished to get as much out of the pixelshader as possible from time to time.

its just the ability to not care about multiple passes that is awesome. something a 9700 and a gfFX can’t provide. its a great step…

about the ones talking about opengl2.0. do we really need a full released opengl2.0 spec to implement what we yet know about gl2? no. there are papers describing more or less how gl2 should look like, and ati is free to implement drivers that support that, today known gl2. no the ati card can’t do everything full in hw. some features simply have to get identity mappings (the ddx and ddy instructions available on the gffx for example), others possibly have to drop to software (texture sampling in vertex shader), but who cares. its great to play yet, and the hw is definitely capable of supporting/emulating a good gl2 version. just remember, no gf2 can do gl1.3 or gl1.4 features, still it claims to be such a card. it can work around most non-hw features. just don’t use 3d textures on the gf2

Originally posted by cass:
No, f-buffer renders fragments in the order you send them, so you’re still responsible for your own sorting.

Read the paper by Mark and Proudfoot (you should have already, as they even cite you :wink: As I understand it, as long as you don’t change fragment programs (or in general, as long as you don’t change the fragment program state), you should be able to get away without sorting. It’s part of the f-buffer design. In plain text that means that if you manage to define a single fragment program for an object, you can render the whole object without sorting, even if it’s semi-transparent.

OTOH, if you have multiple shaders, you still have to sort. Basically the question is:

Originally posted by cass:
I’m interested to hear details on the f-buffer implementation as well. :wink:

… how is this thing really implemented, and more specifically, did ATI adopt one of the strategies proposed by Mark and Proudfoot to handle buffer overflows. My educated (but still wild guess) would be that the f-buffer is flushed whenever you change the state of the fragment program, but that still can lead you to a situation where the on-card memory is not enough.

Gimme a sample board (with working drivers, if that’s not much to ask for).

Originally posted by m2:
In plain text that means that if you manage to define a single fragment program for an object, you can render the whole object without sorting, even if it’s semi-transparent.

No, the paper explicitly states that “partially-transparent surfaces must still be rendered in back-to-front order”.

Let’s say you have a translucent sphere with a two-pass shader. Even if you depth-sort the polygons, the first pass will cause all fragments to be covered twice (once for the back of the sphere, once for the front). The fragments on the front and on the back will be blended together and a single color will be written to the framebuffer.

Now you go and do pass two. You generate a new pair of “front” and “back” fragments. What you want is for these to be blended with the corresponding “front” and “back” fragments from pass one, but you don’t have these anymore! All you have is the combined color that was written to the framebuffer.

Using the F-buffer you can keep both the original fragments and composite the two passes correctly, then do the blending between the final “front” and “back” fragments in the framebuffer.

– Tom

No, the paper explicitly states that “partially-transparent surfaces must still be rendered in back-to-front order”.

Dang. Wrong letter! I was thinking of the R-Buffer. :frowning:

Sounds like it might be like a 3D texture where the depth is used for the stream of fragments.

I wonder what the API looks like.
Would be nice to have their modified MESA at hand.

davepermen, you say in your real work with R300 you never needed more than 8 texcoords, 16 texunits, etc. That’s reasonable, but try to apply your own argument to yourself: did you ever needed shader longer than 1000 instructions?

I can’t understand how “unlimited” can impress you guys so considerably more than “limited-to-obscene-long-1024”. Regarding purely FP length capablity, the difference between NV30 and R350 is neglible.

Now, if the R350 F-buffer didn’t give you all what true F-buffer could theoretically give, then the difference would be even more neglible. I am abstracting from practical usablity of 9+ texcoords or 17+ texunits. But this all would simply mean that R350 will not save you from multipass even a bit more than NV30, unless you run 1025+ long shader.

Another thing: from what sireric wrote, it seems like F-buffers will not be transparent for user, and they will have to be explicitly allocated (super-buffers were mentioned). This, I guess, will require estimating max amount of written fragments, what may involve estimating depth complexity of rendered objects.

Yet another thing: we don’t know yet how many temp values can be output in single pass. If there are not enough of them, this may force shader compiler to produce longer code, because when you can’t store temp value, you will have to recompute it from scratch in next pass(es). On the other hand, the more temp outputs are allowed, the larger bandwidth overhead of multipass is, and more buffers to allocate.

These are all speculations, but if they show true, I’ll appreciate NV30 solution as easier to use.

Now OT:
IMO, all those technical news of yesterday are really shadowed by announcement of GFFX 5200 price (100$)

I agree that today’s hardware probably doesn’t warrant f-buffers yet, because any shader that requires the f-buffers to kick in would probably run too slow anyway. However, this is the first card with f-buffer support, and that’s still very cool from a nerd’s perspective

From a developer’s point of view, the generalized FP texture support (2D/3D/cube) is more interesting, as are the multiple render targets (but 9700 already had those, I believe). These features aren’t available on NV30, AFAIK, which is a real shame.

All in all, I’m more interested in details on ATI’s f-buffer implementation than I am in any extra flexibility it might give me. How much storage do they allocate for the f-buffers? What do they do in the case of overflow, and is any application intervention required when it happens?

– Tom

Originally posted by MZ:

davepermen, you say in your real work with R300 you never needed more than 8 texcoords, 16 texunits, etc. That’s reasonable, but try to apply your own argument to yourself: did you ever needed shader longer than 1000 instructions?

yes.

i do a lot of raytracing stuff

MZ, if the driver compiles shaders directly from a HLSL, an f-buffer should be completely transparent to the app. However, if you don’t use GL2 slang but something more lowlevel you’ll probably have to use it explicitly. I’m guessing this is why they say they’ll support it in GL2, but not in Direct3D. With the Direct3D HLSL, the driver never sees the high level code AFAIK, MS’ runtime compile sit to pixelshader asm and sends it to the driver. And of course, the interestng difference to NV30 is performance, which cards performs better on long shaders? Since the Quadro supports 2048 instructions, I think Nvidia could support any number of instructions if they wanted to.

I have a few questions about the implementation. The limitations are made clear from the paper but a major feature like this is enormously interesting and details are critical.

Can we get more information on how the API is exposed, is it a just additional registers? Do you require exact fragment replication between passes? How many fragments * stores can be held in the f-buffer before it overflows? What happens when the f-buffer overflows? Will you rely on the HLSL compiler technology to solve any of these and if so will this ever be exposed at a lower level or is it too much of a support nightmare?

Tom, there may be multipass shaders now that would benefit from the f-buffer. No need to mangle your passes into a combination that works with limited framebuffer ‘registers’, AND if you have fine enough grained multipass a lot of the time it’s going to be fetching the previous passes result from the on chip f-buffer instead of the destination framebuffer in VRAM, so the f-buffer could end up accelerating some of TODAYS multipass ONLY IF it’s tuned to take advantage of it. So rather than the fbuffer kicking in and you going to another lower level of performance it would hopefully kick in currently, after you restructure your application a bit and you’d see a performance win (this is of course wildly optimistic :-).

We still don’t know some critical stuff, most importantly how big is the buffer, and is overflow a disaster. It’d also be interesting to know if this is entirely new memory or does it eat into other on chip cache.

The API is key as well, noone’s spoken to this at all. If it’s low level with just additional registers then it may be tricky to exploit in conventional apps without some middlewear or compilation magic. Then let’s say you DO decide to exploit the f-buffer, you may be trading state thrashing across passes and cache flushes on stuff like tiled texture for f-buffer fetches, for some stuff this is still a win, for other stuff it’s not. It seemed great to think about the f-buffer in isolation and how it saves a lot of persistent framebuffer memory but it ain’t as simple as that :-(.

For implementing longer shaders with no changes in other state it’s clearly a win IF you can switch the shader itself quickly enough, because you don’t really care the drawing order of primitives (you do about meshing though!), buy again what happens for large on screen primitives that hit a lot of fragments do you have to subdivide or just take it in the shorts with an overflow? It almost demands some kind of on chip subdivision of rasterization and automated sequential application of multiple shaders before f-buffer overflow bites you.

We need more info from ATI, it may be that even they don’t know how to best exploit this strange beast yet.

[This message has been edited by dorbie (edited 03-09-2003).]