i`m currently doing some gpgpu stuff for which i would like to evaluate a seperable kernel in the following fashion :
- evalute the full kernel in one corner pixel of the image
- evalute the rest of the first row by reading the last result , subtracting the column thats moved out of the kernel area , add the column that moved into the kernel
- proceed similarly for the other rows , but this time subtracting /adding rows …
Now , for this to work i
d need to use read from current render texture , and id need to make sure that the data i
m reading was already written , therefore i would need to control which pixels go parallel in the pixel shader and which not. I know that generally the results for what im trying to do are undefined but i`d be very interested on any experiences doing this or similar , especially on Nvidia hardware ( 6th generation preferably … )
To be precise : Can i assume that pixels generated from the same Primitive Command ( Lines , Tris , obviously not points ) are the only ones that will go parallel , and if not is there any way to restrict parallelization ?
Thanks in advance
P.S : i know that i could just split the seperable kernel into horizontal and vertical kernels and evaluate them sequentially , but this would result in doing 2 passes , something i would like to avoid .
I’ve experimented with similar ideas in the past. I have no inside knowledge of the hardware, but it appears that texture reads go through a cache.
This is a problem, because texture caches appear to only flush when they spot normal texture updating going on (uploading texture data, switching to a different texture, etc). They don’t flush between primitives, so there’s no guarantee you’ll read the data that the last primitive just wrote to the screen.
I have played with binding a different texture when I want to flush the cache. I never found a recipe that worked very well though.
It sounds like you’re trying to port the fast sw box blur algorithm? That was something I also looked at, but ended up doing a repeated series of increasing width 8x1 horizontal and vertical kernels to get a fast blur even for a large radius.
hmmm … seems like this is not feasible then … what i was looking for was a relatively lightweight operation to flush those caches etc …
To be precise i
m trying to accelerate an algorithm for creating disparity maps for stereo vision using a SAD or SSD kernel . Seems like ill have to go with the two pass version then …
I talked about this briefly at GDC this year. I wouldn’t recommend depending on the behaviour
of texturing from a buffer you’re also rendering to. You can make it work by adding glFinish in the right places, but there’s no guarantee this will work on future hardware.
You can get this to work reliably by ping-ponging between two buffers and copying back the changes, but the cost of this sometimes negates any benefit.