OpenGL and how the driver works?

imported_cippyboy · August 22, 2013, 7:31am

I have been a graphics programmer for quite some time now (OpenGL and DirectX) but I have not quite understood some of the most intricate details of rendering or their implementation in hardware/drivers.

My first question to whoever knows how a driver works or should work is draw call consistency. And that is something I realize it exists, I’m just curious if the official specs talk about it or not cause I haven’t read anything of the sort anywhere. So what do I mean ? Well, let’s take a drawcall, I send vertices, gets to pixels, but let’s say I have 1000 streams in hardware and the last 50 pixels get processed. That obviously means 50 parallel threads with 950 idling cause there’s no more pixels right ? Does that mean that the GPU waits until all execution commands from a draw call finish before beginning a new draw call ? Or can it start a new drawcall and process new vertices, even pixels, before the last one is entirely finished ? Cause heck, those 950 streams could do even 2 small drawcalls before the last one is finished if the pixel shader of the previous one is complex enough (theoretically). If it waits however it could explain why sending more data (with newer hardware) is faster than sending a dozen smaller batches.
My second question is actually a bit prior to this, do pixel shaders get invoked only after all vertices are processed ? Let’s say we have again 1000 streams and 48 vertices down the pipe, 952 streams idling. Can the 952 streams process pixels from the previous vertices or do they all wait for all vertices to get processed. The pipeline describes that all stages are in order but I haven’t seen it say that pixels should not be processed right after rasterization is complete.
The case for new hardware architectures : If we already have unified architectures that can do any type of computation in parallel, does current gen hardware (HD7XXX and GTX 7XX) still has fixed hardware dedicated to say rasterization ? or AlphaTesting or logical operations on a framebuffer ? For example when we had DirectX10 level hardware and everyone was saying they have unified architectures, I would’ve assumed that going to DirectX11/GL4 features would not imply needing new hardware. Why ? Well, you could implement tesselation as a shader stage just like the other 3 stages, in effect it doesn’t actually require new processor operations. I know DX11 also introduced bit shifting and some other things but I don’t see how tesselation needed new hardware in a unified architecture.
Do drivers work in server/client mode or just user/kernel/device mode ? I think the second option is true, but I can’t be 100% sure. I initially thought it’s the first one due to GL specs talking about “client” and “server”, so what I thought was like that right after my GPU boots, there’s like an operating system, technically a second computer in it with one program, the driver, and when I send commands to the GPU it would just be like PC networking sendings messages in a socket and getting them out at the other end, doing the processing and sending me back the results. I couldn’t have imagined true parallelism to happen any other way. Until I read some driver code from mesa and saw that there’s actually a ton of CPU code in the driver that doesn’t look like it’s dealing with sockets, and then some DirectX driver API where they even standardized GPU command buffers.

What I really wanted with 4) a few years back was to know when or if a GPU command was finished. I now realize there’s synchronization APIs in DirectX10+/GL3+ that deal with that

Alfonse_Reinheart · August 22, 2013, 8:22am

As for #1 and #2, the OpenGL spec neither knows nor cares about these implementation details. All OpenGL says is that the visual results you get will be equivalent to what you would get if everything processed exactly in the order you provide the commands to the GL. How the implementation achieves this is irrelevant.

However, it is not unreasonable to assume that drivers and GPUs are not stupid. If there are processing resources available and there is work to be done, it is reasonable to assume that drivers will attempt to allocate these resources to that work if it is at all possible. That is the point of the unified shader architecture, after all. GPUs have various means to ensure that everything comes out the other end in order, so they engage in quite a bit of fudging during the processing of various rendering calls.

Having shader execution units idling doesn’t sell hardware. So you can expect that GPUs will do whatever possible to ensure that available work is done on these. Your responsibility is to provide that work so that it can pass it off to the appropriate processing elements as needed.

If we already have unified architectures that can do any type of computation in parallel

That’s not what “unified architectures” means. A unified shader architecture simply meant that there aren’t dedicated vertex or fragment processing units anymore. There are simply shader processors, which can be allocated “dynamically” as needed to any of the available shader processing stages, based on the current workload.

The processing stage still has to exist in the hardware in order for shaders to be allocated to it; you can’t just make up new pipeline stages without hardware changes.

A unified shader architecture does not mean that the entire pipeline is handled via shaders.

still has fixed hardware dedicated to say rasterization ? or AlphaTesting or logical operations on a framebuffer ?

That depends on the various hardware. But it’s safe to assume that rasterization is still fixed function, as are logic ops.

I initially thought it’s the first one due to GL specs talking about “client” and “server”

The specification is quite clear about these terms, as it stops to define them. And they don’t (necessarily) have anything to do with networking:

imported_cippyboy · August 22, 2013, 9:32pm

Thanks for the snappy response and insightful answers, however this part bothers me a bit. You’re basically saying there’s dedicated hardware bits for each shader stage, so basically for a DirectX11-level hardware, if I’m not using geometry or tesselation shaders I’m taking a small (but not zero) toll on performance since data still has to go (unchanged) through those units ?

I originally thought that the shader stages are just stages controlled by the driver, taking all the data and sending work commands for vertices/triangles/pixels to the shader processors, so going from that to tesselation would be like, telling it to do some intermediate work inbetween geometry and pixel stage, and if you don’t use that stage, then everything is exactly like it was on DX9/10 level hardware. If indeed the hardware dictates the shader stages, not using a shader stage implies either a performance hit since a part of the pipeline won’t be utilized and will have to memcpy the data through it OR the hardware bits that control that stage will just do idle spins ?

Alfonse_Reinheart · August 22, 2013, 10:46pm

You’re basically saying there’s dedicated hardware bits for each shader stage, so basically for a DirectX11-level hardware, if I’m not using geometry or tesselation shaders I’m taking a small (but not zero) toll on performance since data still has to go (unchanged) through those units ?

Even if the hardware had to pass triangles through null tessellation and geometry stages, your putting a shader there would not make it faster. So any “toll on performance” that you might have is going to be there as a function of the hardware’s design; making an explicit passthrough shader would not remove it.

Which is exactly why hardware doesn’t do that. If you don’t use a tessellation evaluation shader, your primitives don’t go through tessellation. If you don’t use the geometry shader, your primitives aren’t processed by it.

The optional stages are optional; that doesn’t stop them from being explicit, discrete pieces of hardware. The processing elements themselves aren’t, but the hardware built around those stages very much are.

Geometry shaders are attached to a primitive assembly unit; that’s what converts the stream of vertices provided into primitives. The GS is also hooked up to some form of buffer, where the output vertices and primitive data go to be processed by other hardware.

Not to mention that the tessellation primitive generator is entirely fixed function. The tessellation control shader feeds data directly to the primitive generator.

GClements · August 23, 2013, 3:31am

With X11, it may literally be client and server communicating via TCP/IP networking. The way that X works is that the X server has exclusive access to the video hardware (as well as the mouse and keyboard). Clients (i.e. GUI applications) connect to the X server and send it requests (e.g. to create, destroy or manipulate windows, draw on windows, etc), and the X server sends back requested information, error messages, and events.

This is the environment for which OpenGL was originally designed (SGI made Unix-based workstations). OpenGL is implemented as an X extension (GLX), i.e. as a set of additional commands which can be sent to an X server implementing the GLX extension. This is why OpenGL commands don’t return status codes, why glFlush() and glFinish() exist, etc. Most implementations also support “direct rendering” as an optimisation for the case where the client happens to be running on the same system as the X server to which it connects. This allows the client to perform most operations by talking directly to the video driver without having to go through the X server.

This design turned out to be useful even on other systems (e.g. Windows). Although the PCI(e) bus is much faster (higher bandwidth, lower latency) than a network connection or even a local socket, modern video hardware is so fast that even the local bus can be a bottleneck, and the need for immunity from network latency resulted in an API which is inherently suited to pipelining, which in turn facilitates large-scale concurrency.

So while Windows systems don’t implement OpenGL using a literal “server” process, the API is such that they could do so. E.g. you can’t get pointers to internal data, any access to client memory occurs at well-defined points (typically, any function which accepts a pointer reads the data before it returns, except for client-side vertex arrays which are read before the draw call returns), and the buffer-mapping protocol is designed not to require hardware-level (MMU-like) mapping. So considering that the client and server might be separated by a network connection sometimes helps clarify how certain commands will interact.

Alfonse_Reinheart · August 23, 2013, 3:37am

why glFlush() and glFinish() exist

Those exist for more reasons than just supporting networking the API.

kRogue · August 24, 2013, 5:19pm

Looking over the documents of times of old, GLX, was really an after thought in terms of GL (also, OpenGL was the successor of an SGI 3D proprietary API, I think the name was IrisGL)

I strongly suspect that why many GL calls did not in the past return status code was not for the sake of X or network transparency, but for the sake of buffering and pipelining.

Additionally as far as the GL server running on a separate machine as the GL client is something of the past. To give an idea of why here are some reasons:
[ol]
[li]Buffer objects. Before buffer objects vertex data was transmitted over the wire at each call allowing for the conversion between the endianness of the client and host at each draw call. Buffer objects being raw bytes meant that the conversion cannot be known until draw. There are workarounds admittedly: insist that client and server have exact same endianess or GL implementation detects endiannness of client and the GPU itself can handle data in different endian[/li][li]The vast majority of core GL calls do not have a GLX protocol… there are unofficial bits from NVIDIA for many API points but they are just that unofficial AND they do not by any stretch of the imagination cover all the GL calls of GL core profile[/li][/ol]

If there is one thing I wish I could do would be to utterly eliminate the incorrect notion that X’s network transparency idea is a good idea.

Dark_Photon · August 25, 2013, 10:13am

If you don’t use it, I can see where you might say that. If you do (as I do), it’s “very” useful. It blows away the mentality that to do something graphical on a machine (run GUIs, etc.) you need to go to that box. Desktop mirroring + virtual machines is a lame attempt to give you this capability which X has had for decades.

That said, it’s less useful for OpenGL, for the reasons described.

GClements · August 25, 2013, 5:43pm

Not here it isn’t. Most of my Linux systems don’t have monitors. Those which do have low-end integrated graphics which are suitable for displaying the BIOS/boot screens and not much else.

Strictly speaking, this isn’t specific to GLX. The same issues would apply to using a graphics card in a system whose CPU has a different byte order to the GPU. OTOH, with both Intel and ARM being little-endian, I doubt that the IHVs are particularly concerned about the other 0.1% of the market.

Any command which can be put into a display list is sent as part of a X_GLXRender command; the only specification required is the opcode.

Yes. Those of us who actually use indirect GLX have noticed. Part of it is that the X developers would rather have consensus, and the other (possibly larger part) is that OpenGL has been in such a state of flux since 2.0 that there’s not much point putting a lot of effort into something which may become obsolete before it ever gets used.

Also, bear in mind that “official” doesn’t necessarily mean “Khronos”. GLX is the binding between OpenGL and X, and it’s as much X’s domain as it is OpenGL’s. The only real difference between GLX and WGL/AGL is that the latter are local to an individual system, so there are no interoperability issues beyond the API/ABI.

I wish that people could understand the distinction between not personally having a use for something and it being useless. For me, it’s a killer feature; if it hadn’t existed from the start, someone would have invented it.

FWIW, I have trouble understanding why there seems so little interest in exploiting one of the features which really sets OpenGL apart from DirectX.

GClements · August 25, 2013, 5:48pm

It’s less useful for OpenGL only because the wire protocol hasn’t kept up to date with recent progress. Shaders are there, but buffers seem to be a sticking point.

Alfonse_Reinheart · August 25, 2013, 8:13pm

Strictly speaking, this isn’t specific to GLX. The same issues would apply to using a graphics card in a system whose CPU has a different byte order to the GPU.

Actually no. The OpenGL standard requires that, if the client writes a string of bytes as a “GLuint”, then the server must interpret those bytes as a proper “GLuint”. So whatever bit fiddling that the server needs to do must be built into whatever processes the server uses to read that memory.

FWIW, I have trouble understanding why there seems so little interest in exploiting one of the features which really sets OpenGL apart from DirectX.

Because:

1: It requires having more than one computer.

2: Doing so requires being Linux-only.

3: It relies on the asymmetric computing situation, where your local terminal is weak and a central server has all the processing power. This situation becomes less valid every day. Between GLES 3.0-capable smart phones and Intel’s 4.1-class integrated GPUs, the chance of not being able to execute OpenGL code locally is very low.

It’s very difficult to exploit this feature unless it’s explicitly part of your application’s design requirements. It may differentiate OpenGL from Direct3D, but it’s such a niche thing that very few people ever have a bone-fide need for it. It’s nice for when you need to do it, but you can’t say that it’s a pressing need for most OpenGL users.

GClements · August 26, 2013, 4:26am

I don’t really see your point. If the GPU can be made to use either byte order, then the X server can tell it to use the (X) client’s byte order rather than the server’s byte order. If the GPU’s byte order is hard-coded, then a driver for a big-endian system with a little-endian GPU would need to twiddle the buffer contents based upon the commands which use the buffer.

That’s the case for practically anything beyond “home” use.

I regularly run an X server on Windows systems.

The example of smart phones is one where it’s useful. The local terminal has decent graphics capability (where there server may have none) but limited CPU, memory and storage capacity. Making it a reasonable “terminal” for a back-end system but not so good as a stand-alone system.

It’s trivial to exploit this feature. Every X11 GUI application automatically has the ability to be run remotely. Well, except for ones which rely upon OpenGL 3 support, although it’s not just the lack of GLX wire protocol which makes such reliance problematic at present.

It’s useful enough that there is no shortage of attempts to retrofit similar functionality onto other platforms.

mbentrup · August 26, 2013, 5:13am

[QUOTE=GClements;1254338]I don’t really see your point. If the GPU can be made to use either byte order, then the X server can tell it to use the (X) client’s byte order rather than the server’s byte order. If the GPU’s byte order is hard-coded, then a driver for a big-endian system with a little-endian GPU would need to twiddle the buffer contents based upon the commands which use the buffer.
[/QUOTE]

Well, indirect GLX allows the sharing of buffer objects between different clients, so “the” client byte order may be ambigous. The way GLX handles this is that it doesn’t allow the creation of any GL context including buffer objects unless the client explicitly opts-in to the different byte order semantics (via the GLX_CONTEXT_ALLOW_BUFFER_BYTE_ORDER_MISMATCH_ARB attribute), and then the client (i.e. your application) is responsible for filling the buffer in the server byte order.

thokra · August 26, 2013, 5:20am

That just feels wrong. Seriously though, you probably are among a select few there.

The example of smart phones is one where it’s useful. The local terminal has decent graphics capability (where there server may have none) but limited CPU, memory and storage capacity. Making it a reasonable “terminal” for a back-end system but not so good as a stand-alone system.

Am I getting this right? Do you suggest offloading rendering to your smart phone over the network is a reasonable use-case for supporting such capabilities?

Every X11 GUI application automatically has the ability to be run remotely. Well, except for ones which rely upon OpenGL 3 support, although it’s not just the lack of GLX wire protocol which makes such reliance problematic at present.

At least on Linux distributions that go down that path, as soon as X is dropped in favor of Wayland or Wayland-like architectures, remote rendering isn’t available anymore. At least not with vanilla Wayland. You can layer stuff on top of Wayland but in general the capability is gone.

kRogue · August 26, 2013, 2:22pm

I’d like to make my (usual) case for why/how I think the entire remote rendering jazz of X is borderline useless. Here goes: in times of past the idea was that the terminal (the thing that did the displaying) had a very poor CPU and could only really be used for displaying stuff. This idea made perfect sense ages ago.

Then X came along, and now that terminal needs to run an XServer. The powerful remote machine would then send the drawing commands over the wire for the terminal to display. To be honest, this sounds kind of neat and in decades past it was not a bad idea.

Now enters OpenGL; that means the terminal needs to have a good GPU to render stuff at a reasonable speed. If a box has a good GPU, it likely has a reasonable CPU. I suppose there are the severe corner cases where some super-hefty CPU box is doing lots of calculations and the terminal needs to visualize the data and the way it is visualized it does not send oodles of data. Seems to me like a rare corner case.

It gets worse; implementing a good XServer driver system is pain, severe pain. OpenGL remote rendering is very touch and go anyways; it can be tricky to setup, there are limits on what one can expect to work well… can you imagine how poorly something like glMapBuffer is going to work? It is hideous. X makes a very severe implementation burden and the benefits of that burden are rarely used; and more often than not when that remote rendering is really used bad things and bad surprises happen.

Even ignoring the OpenGL thing, most UI tool kits usually do NOT want to use X to draw. Qt prefers to draw everything itself (it does have an X-backend which is labeled as native and it performs horribly when compared to raster). Similar story with Cairo, GDK, and on and on.

When X dies, it will likely be a very, very good thing for Linux desktop; to give an idea of how bad X really is, watch this where the fellow talking was a major contributor to X and essentially said after a while, X is not working:
http://www.youtube.com/watch?v=RIctzAQOe44

fast forward to 18:45… bit of a shocker.

imported_kyle · August 26, 2013, 2:34pm

[QUOTE=thokra;1254341]That just feels wrong. Seriously though, you probably are among a select few there.
[/QUOTE]

We are many :). I do that as well. Its actually pretty handy (well X part, not GLX). But then again, its primarily used for some pretty obscure stuff.

GClements · August 27, 2013, 2:57am

Hummingbird (since acquired by OpenText) basically built their business on eXceed (a commercial X server for Windows), so it can’t be that rare.

If your going to use a smartphone or tablet as a terminal, using X avoids having to construct a separate client for each platform for each application.

That would be Ubuntu. Everyone else seems to view Wayland/Mir as an API for the X server to communicate with device drivers.

Alfonse_Reinheart · August 27, 2013, 3:17am

If your going to use a smartphone or tablet as a terminal, using X avoids having to construct a separate client for each platform for each application.

Let’s look at the evolution of, well, all computing.

In the earliest days, computers were gigantic. But they were kinda useful. So people found a way to make these large, centralized computers which could be used by multiple people. Thus, the smart server/dumb terminal paradigm came to be. Time passes and computers get a lot smaller. Personal computers made dumb terminals… effectively obsolete. They’re still used in places, but it is exceedingly rare. Even when you’re networking to a smart server, you’re generally using it from a smart terminal.

In the earliest days, the web was very much server-only. The server coughed up HTML. Someone invented PHP scripts that allowed server-side mucking with the HTML. Again, you have smart server/dumb terminal, just with the web browser as the dumb one. Fast-forward to… today. Sure, PHP scripts still exist, but client-side scripting via JavaScript is all the rage. You can’t effectively navigate half the web without JavaScript on.

In every case, we started with dumb terminals, then slowly traded them up for smart ones. That is the nature of computing: client-side wins in the long term. And the same is true for OpenGL: client-side won. There are numerous features of modern OpenGL that only improve performance if everything is running on the same machine. Mapping buffers for example would absolutely murder performance for a networked renderer compared to even a much slower client-side GPU.

That doesn’t mean that some people can’t find uses for it. But it’s very much a niche application, so niche that the ARB is spending precious little time keeping the protocol up-to-date.

If your going to use a smartphone or tablet as a terminal, using X avoids having to construct a separate client for each platform for each application.

Or you could make your application completely independent of a network, and therefore more useable and reliable. No network hiccup or going through a tunnel or whatever can interrupt your client-side application. Not to mention faster in many cases. Smart Phones may not have the best GPUs, but they’re reasonably serviceable for most needs.

Also, using X does nothing for being able to write a platform-independent client. Sure, your rendering code may be independent, but that would be no less true than if you were using straight OpenGL ES. You still need the platform-specific setup work; even initializing an application that will use X differs between the platforms. Not to mention processing input or any of the other tasks you need to do. Oh sure, minor quirks between implementations would not exist, but the majority of your porting work doesn’t deal with them anyway.

GClements · August 27, 2013, 3:19am

Not really. Dedicated server systems often don’t have any kind of GPU. It’s not that useful when the system is serving many users, none of whom are in physical proximity to the server.

[QUOTE=kRogue;1254354]It gets worse; implementing a good XServer driver system is pain, severe pain. OpenGL remote rendering is very touch and go anyways; it can be tricky to setup,
[/QUOTE]
It shouldn’t require any setup, beyond what is required for X itself and the OpenGL driver. To the driver, the X server is just another client.

To be honest, I don’t expect OpenGL with direct rendering to work well on Linux. It isn’t a high priority for the hardware vendors, the hardware is complex, and the hardware vendors historically haven’t been particularly open with technical specifications.

That depends upon how badly it’s misused. If you map an entire buffer but only read/write a portion of it, that’s going to be inefficient. It will be far more inefficient with GLX, but it’s significant in any case. Use of glMapBufferRange() with the invalidate/flush bits shouldn’t be any worse than glBufferSubData() or glGetBufferSubData() (clearly, you can’t avoid actually transferring data over the network).

This isn’t my experience.

All of those use X. Maybe you’re confusing “core X protocol” with XRender?

kRogue · August 27, 2013, 8:35am

All of those use X. Maybe you’re confusing “core X protocol” with XRender?

No; all of those use X to do exactly the following:

[ol]
[li]Create -one- window[/li][li]Poll X for events[/li][/ol]

All the drawing is done to a -buffer- by the toolkit. The entire “remote” rendering thing is dead. In order for the program to run on one machine and display on another usually means that the buffer (the window contents) is sent over the wire. What you have now is essentially a really crappy per-window VNC. One can claim that if GL was network happy on the XServer then the application would send the GL commands to the XServer and all would be great; but it does happen that way. Sorry.

It shouldn’t require any setup, beyond what is required for X itself and the OpenGL driver. To the driver, the X server is just another client.

OpenGL resides on the XServer. The OpenGL implementation is then required to be able to take commands from a remote device (the client). OpenGL itself together with GLX are part of the X-driver often enough. Pretending that it will just work is putting one’s head in the sand; it requires heroic efforts to make a GL implementation take commands from a remote source. Compounding the pain is that many GL features do not even really make sense in this case; my favorite one being glMapBuffer, but there are others.

To be honest, I don’t expect OpenGL with direct rendering to work well on Linux. It isn’t a high priority for the hardware vendors, the hardware is complex, and the hardware vendors historically haven’t been particularly open with technical specifications.

Huh?!! AMD has released the specs to the GPU’s (outside of video decode); Intel’s GL driver for Linux is entirely open source. Lets take a real look at why it is not there: the effort to make remote rendering just work is borderline heroic. The underlying framework (DRI2) does not work over a network.

Regardless this proves my point: remote rendering is such a rarely used/wanted feature that it is not implemented really. Exactly my point. If there was commercial demand then it would be. Therefore the only ones warning it are, well no offense, borderline Slashdot trolls.

Please everyone who thinks X is network transparent and great, take the hour to watch that video (or stop when he talks about how great Wayland is); that video will wake you up to the reality: X should die.