Hardware rasterizer

Hello everyone! This is my first post here:)
I’m currently doing radiosity simulation using CUDA and OpenGL. It is based on progressive refinement and ‘rednering’ onto hemicube. I have reasonably fast code in CUDA, simulating ‘rendering pipeline’, but when it comes to software rasterization of triangles… uh it’s really slow and ugly.
So up until rasterization phase, I have vertices of triangles that need to be ‘rendered’, in screen space coordinates, with depth and index information. Something like this:

vertices: (123,56) (110, 99) (33, 59)
triangle index: 4561 (this can be converted to color if needed)
depth: 15.3 (for zbuffer)

What i would like to do, is use hardware rasterizer to do rasterization, then take the result back to CUDA and process it. As far as I know, it is not possible to access hardware rasterizer directly from CUDA, so I would like to know, how to use only the rasterization phase with Opengl. I just need to render flat triangles with z-buffer. No textures, shading, transformations, backface-culling, frustum-culling (well, simple 2d clipping would be nice). Just pass coordinates of ready-to-render triangles and receive result as fast as possible.
The point is, I want to avoid any overheads and operations I don’t need, because this process is going to be repeated thousands of times.

Any help, ideas, suggestions will be very helpful :slight_smile:

Another option is to include also hardware projection in this process, so culling would be done on CUDA (i use octrees to do frustum culling)and projection + rasterization in hardware. I’m just looking for fastest solution :slight_smile:

What do you need as a result of the rasterization? The depth buffer or triangle index values? (or both)?

I believe the frustum-culling/clipping hardware is insanely fast, so you can probably let OpenGL handle that (as long as you do coarse culling to not send too many triangles that get completely culled)

I am not expert, but this is what I would do.

  • Pass the vertices, triangle index and depth as vertex attributes in a vertex buffer object. (shared from CUDA data?)

  • Setup a FBO render target (shared with CUDA?)

  • Write a simple pass-through vertex shader that access this index data. Be careful with the vertex positions - the output is expected in a unit cube range that is then converted to render target coordinates. (so if you already have render target coordinates, you will have to normalize them from -1…1) (also remember to set the “w” coordinate - probably to 1)

If you don’t need the output depth buffer, you can supply the depth value as the “z” coordinate for depth comparisons. (also in the range -1…1)

  • Write a simple fragment shader to write out the color index you want.

I think the solution is to use cuda-opengl interop. I read some tutorials, I think I can do the rendering now. But I don’t know how to get the result from framebuffer to CUDA global memory (it should be fast GPU to GPU copy)

CUDA by Example (good book) has a chapter on exactly this topic, though doubtless the material exists elsewhere as well.

Google these:

  • “cudaGLSetGLDevice” - To tell CUDA you intend to use OpenGL interop with this CUDA device.

  • “cudaGraphicsGLRegisterBuffer” - You use this to make a GL buffer object accessible from the CUDA side.

  • “cudaGraphicsMapResources”/“cudaGraphicsResourceGetMappedPointer” - to map and access this GL buffer object from the CUDA side.

IIRC you can do this kind of thing with OpenCL too.

thanks very much! It’s working now, but here’s another question :smiley:
I have vertices and colors VBO, I map this to CUDA, write some data there, then unmap and call glDrawArrays. Works very good! and fast, because it’s all on GPU memory. But after render I need framebuffer back for processing, so I have PBO and do something like this:

[b]// set the target framebuffer to read

// read pixels from framebuffer to PBO
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, interop->framePBO.boID);
glReadPixels(0, 0, hemicubeResolution_, hemicubeResolution_ , GL_RGBA, GL_UNSIGNED_BYTE, 0);

This gives me the pixels I need, I map the PBO to CUDA and everything is fine, but I have a feeling this data is going through CPU. For example, I render 600 frames (in loop) of 5000 triangles using this on-GPU VBO, and this takes about 0.5sec (including some prior memory management), but as soon as I add to this loop the code for reading back framebuffer (immediately after render code), it takes 3 seconds.

So I don’t know, if it is really just so slow to copy 512x512 image, or maybe the data is going to CPU, then, after I map it to CUDA, back to GPU.

Is there a way to ensure that PBO is created and maintained on GPU memory, so there are only GPU->GPU operations? I experimented with glBufferData, like GL_DYNAMIC_READ or GL_DYNAMIC_COPY. some are better some are worse but it is still slow, compared to just rendering. (Isn’t rendering of 5000 triangles supposed to be slower than copying rendered image?)

thanks again for any tips,

edit: hmm, maybe something like offscreen rendering will help, but Im not sure if I can map FBO to CUDA. I will give it a shot :slight_smile:

Which GPU?