Better GPU program execution handling

imported_Gedolo2 · November 18, 2013, 9:32am

Better GPU program execution handling.

Currently OpenGL still works with draw calls.
As if the GPU can’t store a whole program, needs to be fed high level functions function call by function call.
This can keep performance low and introduces an artificial barrier for complex operations.

Instead of sending one by one call to the GPU.
The following model:
1 shaders get compiled for GPU
2 program or pieces of program gets handled to GPU.
3 CPU says what part of program on GPU to execute and where to wait for synchronization instead of doing this per call.
4 GPU executes pieces of program

In this model both the CPU AND GPU have their program in their memory ready to go, the program that each needs to execute.
The CPU does not say what draw call to execute but what part of a program to execute, this can be a bunch of draw calls.
The GPU executes the parts and then notifies the CPU when it has finished and can do the next part.
Maybe for optimal performance the GPU could save a table with synchronization point ID and the stuff to execute that could be multiple parts to execute per synchronization point. Each part would contain program execution starting point, program execution length, number of times it needs to be looped. Each processor will need to have a table of what other processor(s) to notify after each part done for each synchronization.

This will work much better then doing things per call.
Programmers can minimize the amount of waiting and synchronization required during runtime.

Having a function to be able to tell to upload a shader to GPU and prepare it for execution would be necessary for the best combination of performance and programmability. With the default to do this as early as possible just after context creation.
To be able to change executing program on the fly it must be possible to compile the new shaders for the GPU, upload them to the GPU, then say where in the program to check for and switch to the new shader. (Application programmer must insert function to clarify this, if no function then driver must not allow program changes while executing program.)

Of course in other API’s that have a similar old fashioned execution model used with fully capable processors this needs to be changed. Fixing the deficiencies with the new execution model would certainly help other processors: audio processors.
It would be optimal to be able to let the processors also talk when they done their part of a program to each other without CPU communication in between. This way you can avoid extra synchronization delays because off processorA > CPU > processorB and do this: processorA > CPU, processorB

e.g.
processorA does part 1 then notifies processorB
processorB does part 2 then notifies CPU
CPU does part 3 then notifies processor A or B to start part 4
processor A or B does part 4 then notifies CPU
CPU does part 5 and notifies processor A and B to both start their share of part 6
processor A and B do their share of part 6 and after completion both notify the CPU independently from each other

mhagain · November 18, 2013, 11:47am

The GPU executes the parts and then notifies the CPU when it has finished and can do the next part.

I may be misunderstanding your intent, but this would actually be slower than the current model because it seems to me that the CPU would have to stall while waiting on the notification. Currently the CPU and GPU run asynchronously, so there’s none of this waiting; in normal operation the CPU just hands a command to the driver, the driver places it in it’s internal command buffer and returns immediately, and some time later the GPU comes along and picks up the command.

Remove the notification element, and what you’re talking about really just boils down to a more modern take on display lists, I think.

imported_Gedolo2 · November 19, 2013, 10:05am

No, the CPU can in the meantime do other things while waiting for the notification.

Removing the notification things you should have to resort to polling or knowing the GPU execution time on fore hand.
Polling is inefficient and wastes resources while GPU execution time is not known in most cases.

The important difference is that the command buffer is on the GPU and the GPU gets told what part to execute through a single command. You could see it as some kind of a super draw call.

Here the CPU hands the command to the driver, the driver converts it to a GPU program, the GPU program gets put on the GPU and after this has happened the program gets executed. The GPU already has it’s commands when the CPU says to execute it. Furthermore because you could do conversion of instructions for the GPU to the specific GPU instructions seperate from the rendering loop you could tell the driver to take time to optimize things. There might be other performance advantages.

kRogue · November 19, 2013, 11:03am

There are a number of issues I see with this post, let the games begin:

1 shaders get compiled for GPU

–> this is part of the GL specification already.

2 program or pieces of program gets handled to GPU.
3 CPU says what part of program on GPU to execute and where to wait for synchronization instead of doing this per call.
4 GPU executes pieces of program

GPU’s are heavily and deeply pipelines beasts. The more they can be handed at once, and the most that which they are handed is the same the better. GPU’s have -very- fixed function pipeline parts for almost every stage of 3D graphics. A GPU of today requires to have all pars of how to process vertices in context before a single vertex is processed: vertex format, vertex shader, tessellation control and evaluation shader, gometry shader, rasterization parameters, fragment shader, blending mode, everything. Take a look at the Intel PRM at 01.org and you will see what I mean. The reason a GPU requires to have it all is essentially because it has such a deep pipeline and with that deep pipeline (and ability to have many execution contexts active at a time) gives the GPU the ability to hide latency. Hiding latency is the biggest thing almost because the ratio of flops to bandwidth is so freaking high on GPU’s.

mhagain · November 19, 2013, 11:47am

That’s effectively just display lists though. Or even Direct3D 3 execute buffers, if you wanted to go so far.

On the one hand it is a nice programming paradigm to have available for static content; on the other hand - and in the real world - not all content is 100% static all of the time, and dynamically updating such an “execute buffer” (yes, I’m going to use this phrase) as required can be a killer of performance.

That’s why an interface like the old display list interface would be required. So instead of issuing commands immediately, the driver captures them and records them, then it can play them back (either singly or batched together) via a glCallLists command.

Display lists in principle were a nice idea, but the original designers introduced complexities that may not have been apparent at first but certainly reared their heads as more functionality was added. The ability to have nested/hierarchical display lists was one such complexity, another was having to track which GL commands could and which couldn’t be compiled into a display list. If you read some old GL extension specs you’ll find that they’re full of notes on how things interoperate with display lists.

I’m not saying that it’s a bad idea, but I am saying that it needs more thinking through.

imported_Gedolo2 · November 19, 2013, 2:13pm

Reducing the number of synchronizations to minimize latency in a program loop is implied.

I think that the idea of what Display Lists want to achieve was very good but that it was an implementation and specification issue rather than the idea of doing better than draw calls being fundamentally wrong.
More of an have done it the wrong way than not being possible in any way.

imported_Gedolo2 · November 19, 2013, 2:27pm

@mhagain
Please read my whole post.
How it is achieve is essentially very different.

The timing of compilation and execution of shaders is done better.
Will think a bit about how to be able to make this work good with lots of dynamic instructions and something for lots of changing data though. (Handling instructions versus data: textures, arguments are very different beasts.)

The idea is to replace the display list interface with a transparent system where the programmer does not have to specify the glCallLists command or similar in the same way.

Nesting, hierarchical stuff is already transparently supported.

The idea is to be able to not have to have the classical problems and limitations of display lists.
Including not being able to use some commands.
Some kind of different, as transparent as possible and practical for driver and application developers. By dividing the tasks and asigning them to the actor that has the best information to make a certain decision about something.
By doing things differently from Display Lists.

I’m not completely clear on how exactly all the details can be worked out to get something without the problems of Display Lists but that is something that should be possible to get right, or at least good enough.

l_belev · December 3, 2013, 2:49am

It would be good to have some mechanism to solve tasks as this example:

let’s say we need to draw a complex character that is not a single draw primitive, but consists of numerous parts that use different shaders/textures/whatever.
All the parts are always drawn in the same order. Every time it is drawn is the same except for animated bone transforms that the CPU pours in an uniform buffer.
It would be nice to be able to compile all the character drawing steps, including any texture/render state/shader changes in some form of command packet for the GPU
that would greatly reduce the CPU overhead of repeatedly re-specifying all the drawing steps and their due validations every time.
All the CPU would need to do is update the bones in the uniform buffer and then tell the GPU to execute the compiled packet.

Maybe some updated form of display lists would do the job.