Render to current (!) Texture on Nvidia

imported_Martin_Krau · March 25, 2005, 3:49pm

Hi ,
i`m currently doing some gpgpu stuff for which i would like to evaluate a seperable kernel in the following fashion :

evalute the full kernel in one corner pixel of the image
evalute the rest of the first row by reading the last result , subtracting the column thats moved out of the kernel area , add the column that moved into the kernel
proceed similarly for the other rows , but this time subtracting /adding rows …

Now , for this to work id need to use read from current render texture , and id need to make sure that the data im reading was already written , therefore i would need to control which pixels go parallel in the pixel shader and which not. I know that generally the results for what im trying to do are undefined but i`d be very interested on any experiences doing this or similar , especially on Nvidia hardware ( 6th generation preferably … )
To be precise : Can i assume that pixels generated from the same Primitive Command ( Lines , Tris , obviously not points ) are the only ones that will go parallel , and if not is there any way to restrict parallelization ?

Thanks in advance
Martin Kraus

P.S : i know that i could just split the seperable kernel into horizontal and vertical kernels and evaluate them sequentially , but this would result in doing 2 passes , something i would like to avoid .

Pete_Warden · March 25, 2005, 5:17pm

I’ve experimented with similar ideas in the past. I have no inside knowledge of the hardware, but it appears that texture reads go through a cache.

This is a problem, because texture caches appear to only flush when they spot normal texture updating going on (uploading texture data, switching to a different texture, etc). They don’t flush between primitives, so there’s no guarantee you’ll read the data that the last primitive just wrote to the screen.

I have played with binding a different texture when I want to flush the cache. I never found a recipe that worked very well though.

It sounds like you’re trying to port the fast sw box blur algorithm? That was something I also looked at, but ended up doing a repeated series of increasing width 8x1 horizontal and vertical kernels to get a fast blur even for a large radius.

Pete

Korval · March 25, 2005, 5:18pm

Now , for this to work i`d need to use read from current render texture
Yeah, that’s not going to be happenning anytime soon. You’re talking about either changing how a very fundamental piece of hardware functions, or relying on timing-based behavior that can be different even on different clock-speeds of the same card (GeForce 6600s vs. GeForce 6800s), let alone different cards altogether.

To be precise : Can i assume that pixels generated from the same Primitive Command ( Lines , Tris , obviously not points ) are the only ones that will go parallel , and if not is there any way to restrict parallelization ?
It isn’t (just) a question of parallelism; it’s a question of hardware architecture and cache. Since hardware is not designed to allow for reading the current pixel through the texture unit, it is entirely possible, and quite likely, that if the pixel writes are cached (and they should be), that texture units do not have access to that cache. So, you would need to flush the cache, an operation that probably doesn’t happen even with primitive changes. In fact, the only operation I can think of that would force a cache flush is a buffer swap or binding a new destination buffer.

imported_Martin_Krau · March 26, 2005, 10:39am

Hi ,
hmmm … seems like this is not feasible then … what i was looking for was a relatively lightweight operation to flush those caches etc …
To be precise im trying to accelerate an algorithm for creating disparity maps for stereo vision using a SAD or SSD kernel . Seems like ill have to go with the two pass version then …

imported_simongreen · March 27, 2005, 10:30am

I talked about this briefly at GDC this year. I wouldn’t recommend depending on the behaviour
of texturing from a buffer you’re also rendering to. You can make it work by adding glFinish in the right places, but there’s no guarantee this will work on future hardware.

You can get this to work reliably by ping-ponging between two buffers and copying back the changes, but the cost of this sometimes negates any benefit.

http://download.nvidia.com/developer/pre…sing_Tricks.pdf