EXT_draw_range_elements: Driver optimisation

The spec for ‘EXT_draw_range_elements’ says that the implementation/driver can make use of the extra information to avoid any pre-processing steps (like walking the element list). I can see how this might be useful when drawing a vertex buffer stored in system memory; you’d only want to send up as small a range of verts as possible, given by the range of vertices indexed by your element array. I can’t yet see whether this would allow optimisation for vertex data already in AGP or video memory.

Could someone shed some more light on how GL drivers might otherwise handle ‘draw elements’ (without any range information) and whether/how they make use the range information for AGP/vid.mem. data?

Cheers.

Originally posted by DanielHawson:
[b]The spec for ‘EXT_draw_range_elements’ says that the implementation/driver can make use of the extra information to avoid any pre-processing steps (like walking the element list).

[…]

Could someone shed some more light on how GL drivers might otherwise handle ‘draw elements’ (without any range information) and whether/how they make use the range information for AGP/vid.mem. data?

Cheers.[/b]
That note in the spec is written in the context of plain vertexarrays (no VBO, no compiled vertex arrays), so is the following explanation.

As such, the vertexarray itself cannot be in video memory because it comes from user allocated memory.
What you can have in AGP or video memory is a copy of the vertexarray. Now, this has several problems:

  • The application can modify the vertexarray as soon as the glDrawXXXXXElements call returns. This means that if you want to reuse that copy in a later glDrawXXXXXElements call, you have to check the coherency of your copy (note that if you cannot reuse the copy, there’s not much sense in having it).
  • Without the range information, to know the size of the array so it can be copied into AGP/vidmem the driver has to walk the array of indices. If the next glDrawXXXXXElements involves more elements from the vertexarray, the driver may need to recopy/expand the copy.

So the driver has three ways of dealing with these calls:

  • Expanding the the drawelements call into the command buffer (CPU intensive).

  • Locking the user memory buffer to get a physical address in order to be able to stream those vertices to the card directly. This has several problems:

[ul][li] Locking memory is an expensive operation.[/li] [li] The driver needs to know the size to lock(walk the array of indices if the range is not provided).[/li] [li] Because the app can modify the memory buffer right after glDrawElementsXXXXX returns, the driver needs to get sure all the elements have been consumed by the graphics card before returning.[/ul][/li]- Copying the array into some memory accessible by the card (AGP/video) and streaming from there. This is only really useful if the driver can reuse the array for later glDrawXXXXXElements calls but, as mentioned earlier, this has coherency problems if the application modifies the buffer or if the range changes from call to call.
The option the driver chooses is a heuristic depending on the size of the buffer, past behaviour/usage pattern, and maybe even application’s name :wink:

That means glDrawRangeElements is not necessary, if one uses VBOs anyway? That would be nice, i could save some CPU-time by not computing that information (yeah, i know it wouldn´t be much).

Although, AFAIK nVidia (maybe ATI too) still tell everyone to use the “Range”-version, so maybe they can still use that extra information in some way?

Jan.

“The range is precious information for the VBO manager, which can use it to optimize its internal memory configuration.” - nvidia’s ‘Using VBO’ white paper.

The range information becomes even more important with VBO, at least for non-static data. For example, if you map a portion of a VBO or modify it via BufferSubData only a portion of the VBO is “dirty”. This means that if rendering operations are in-flight that don’t touch the effected data, the driver doesn’t have to wait for the card to finish.

It also works the other way around. Imagine a VBO that is in system memory, but is “cached” in on-card or AGP memory. If a portion of the VBO is modified, but that portion is not used by a DrawRangeElements call, the dirty data doesn’t need to be sent to the card.

The “range” data typically changes when you change your index list. When you change your index list, you’re touching all the index data anyway, so you can calculate the min/max while you’re doing that. Dollars to donuts that you’re memory bound anyway, so the extra operations for min/max will not show up on a profile, assuming the compiler uses predication rather than (poorly predicted) branches.