pBuffers vs renderbuffers (FBO) - memory usage

I meant the hardware solution:

That’s a “solution” to an entirely different problem. And neither ATi nor nVidia have “solved” it.

In response to REAL virtualization.

Think about what happens when a page fault occurs on a CPU. The CPU has to suspend the currently running process in order to then call an interrupt procedure to load the required data from disk. Then once it has loaded the data, it can update the page table with the new address and reload the suspended process. Then continue along its merry way. A typical x86 CPU has 8-16 integer GPRs, 8-16 FP registers, and 8-16 SSE registers. This does not include all the additional control registers that need to be stored away.

For a GPU to have this same mechanism, you have a GPU that needs to be able to suspend the rendering mid-primitive because a tile is in a page with a non-present backing. Unlike a CPU though, the GPU has an arbitrary number of execution units performing vertex shaders, geometry shaders and pixel shaders. You also have all the fixed function stages in between them and at the ends of pipeline as well as the texture units fetching data and having control state of their own. What state do you save?

If you were to save all the state for say an R600, you’d have multiple megabytes worth of information that would need to be saved away and then reloaded after the data was either loaded from system memory or disk. However, you can get away with checkpointing and getting a minimal set of state to restart each portion of the pipe. This is still not a cheap operation as the GPU is not the one that can actually handle the page fault, the CPU has to handle all the behind the scenes magic since you do not want to keep your system memory locked down all the time.

With some new features in GPUs such as MEMEXPORT, just using the checkpointing system may not be possible since restarting the shader units could cause the memory ordering to be different since the shader engines have to replay everything up to the checkpoint before they start actually running new pixels, vertices, triangles, etc…

In most cases, if this is a serious app, there are not going to be that many other things needing to be processed by the GPU and so, there is not going to be another string of commands to fill this time while it is faulting. Then again, this may cause more apps to start using shared contexts so that each context is off doing its own thing so that if a page fault does occur on one context, there may be other work for the GPU to do while waiting. This may cause more of the OpenGL programmers of the world to have to start going through their own version of the multi-threaded programming transformation.

So:

  1. the CPU still has to handle the page faults. Even if the GPU could transparently queue up transfers to/from system memory. the CPU has to be involved in finding the pages the GPU needs in system memory and locking them down. In the worst case scenario, the CPU will also have to bring in the pages needed from disk. If you want the GPU to be able to do all this without CPU intervention then you are going to have to have even higher system memory usage because it is going to need all objects’ backing stores to be locked down in memory while they have commands pending. Currently, they are only locked down during the paging.

  2. the amount of memory required to store the state of a GPU is much higher than that needed for a CPU.

  3. what does the GPU do while waiting for the data to be paged in?

Thanks AkaTONE, very interesting post.

After reading it, it seems to me that there is roughly two ways of doing useful hardware texture/buffer virtualization :

  • transparently handled by the driver, expect all buffers stored in CPU memory. Page faults on the GPU triggers the upload of the relevant block, from RAM to VRAM.
  • handled by the OpenGL application, by registering a callback to a GPU page fault, so the app can upload the block from disk in parallel. Difficulty would be to splice the texture in blocks suitable for each hardware, looks more complex than current texture formats BGRA8 LUMINANCE_FP16 etc …

And, in both cases, to answer your point 3) about keeping the GPU busy doing useful things when a new block is uploaded, take advantage of mipmapping. Like with clipmaps/megatexture/googleearth/etc : sampling from lower res mipmap when optimal mipmap block is not available.

Cache eviction would be LRU/LFU based, with a bias toward low resolution mipmap levels. When fetching blocks, adjacent blocks (in 3 dimentions for 2D mipmapped textures) become also candidates for upload, to anticipate probable future needs.

I am not enough into hardware to know if the above is even realistic, but at least it seems less complex than a full-featured CPU-like virtual memory.

Any comments ?

Intriguing idea. I had never thought to register callbacks from an application to o this. However, the problem is now you are going to have some really awkward message/signal routing mechanism so that you can get a signal that a buffer/texture/rendertarget faulted for a specific range/region/volume. Think what has to happen here. The GPU faults saying it wanted to write to a specific address. Now the driver has to then figure out what object that was associated with. Then it has to figure out what page and associated range/region/volume that address was in. This info then gets propagated from an interrupt to the application. The application then has to either procedurally generate the data, go through the file system to get the data, or copy the data from its own internal memory. Then the GPU has to queue up a transfer that will either do a straight copy OR do a blt that converts from the linear format you provide the data in to the layout the GPU uses for optimal cache reuse.

As for the idea of having the LOD being clamped by the levels that are present, that is a good idea. Now the question is, what component sets this clamping? Is it set on the texture fetch faulting? How does the sampler proceed once this clamp is set AFTER the LOD was already calculated and turned into real offsets? Is it the driver’s responsibility to check that all of a texture is in memory? Since most drivers work on a coarse-grained level, this may not be so useful until fine-grained virtualization is supported.

Another issue is vertices in buffer objects. Buffer objects have no concept of level of detail. If a page from a buffer object faults, you are either going to have to wait for the data OR return some default value such as 0, 0, 0, 1. People can live with lower quality textures being used, but if you get incorrect geometry, then you start to have serious issues. You could reduce the size of the buffers needed by eventually using the tessellator in the latest ATI chips. You can get amazing levels of detail by generating the geometry on the fly ON the GPU. Other than that, there is no way to deal with that info missing.

Also, color buffers and depth buffers have no concept of level of detail either. They have no fallback but to wait for the data.

Unfortunately, the only solution besides the current per object virtualization is fine-grained per page. This is the only way you solve all the problems. But, you get the extra overhead of having to deal with page faults. If ATI/nVidia/Intel can get the cost of the page fault down, i.e. the amount of data that needs to be saved off, then this becomes more viable.

There is a HUGE AMOUNT OF WORK that goes on in the driver and hardware to make sure what you see on screen is what you want … usually :wink:

There’s an easier solution that works ‘most’ of the time. More VRAM :slight_smile:

Current we have:

Disk (pagefile) -> main memory -> bus -> 512mb DDR3 VRAM (for example)

Why not…

Disk -> RAM -> bus -> 4gb DD2 VRAM -> 512mb DDR3 VRAM

Then we have the CPU responsible for virtualising between main memory & the larger 4gb visible VRAM, and the GPU responsible for virtualising between the low speed 4gb VRAM and the smaller high speed VRAM. The 4gb will be the GPU equivalent of an L3 cache. Now, you may say that cost is a factor - but today I can buy 4gb DDR2 at RETAIL prices for less than £50.

The memory is the cheap part. Part of the R600’s reason for being so big and expensive was that it had a 512-bit external bus. This required the actual chip size to be much larger than the die. You had to have a large enough area for all the extra pins to connect to the board. The successor GPUs cut the bus down to 256-bits,128-bits and 64-bits as cost savings. The 512-bit bus was overkill anyways. Now if they implemented a second memory controller, they are going to go back in the other direction in terms of pins and price.

Besides, there is no reason to do this at the card level when it could be done just as easily by allocating a gigantic chunk of system memory, wire it down and then point a range of GART at that block. Then, they could treat this area as another region of VRAM and no other driver component would need to worry about it. But, depending upon how a driver lays out the backing stores of objects, the driver could just use the backing store directly and not need this gigantic block of pseudo-VRAM.

Usually, if you have more system memory, the driver will be able to allocate more space for the GART. Then this extra memory combined with the current virtualization schemes, should handle just about as many cases as re-architecting the memory hierarchy of a GPU.