Unified API for GPU performance counters/queries/monitors

Hello all.
It would be nice to have unified API for performance counters in OpenGL. It can help to tune applications and write cross-vendor tools to profile OpenGL programs.
Today we have different vendor specific extensions from AMD (AMD_performance_monitors) and Intel (INTEL_performance_query) and NVidia provides perf counters via NVPerfKit what works only on Windows.
Hardware is very different, yes, but maybe we can provide common interface (based on existing extensions, for example) similar to ARB_texture_compression and/or ARB_get_program_binary and query supported counters in runtime with glGetIntegerv?

Something like this:

GLuint total;
glGetIntegerv(GL_NUM_PERFQUERY_COUNTERS, &total);

GLuint counters[total];
glGetIntegerv(GL_PERFQUERY_COUNTERS, counters);

GLuint perf;
glGenPerfQueries(1, &perf);
// opengl calls here…

// for loop on available counters array to find index of some interesting counter here…

GLenum type;
glGetPerfQueryCounterType(perf, counters[required_index], &type);

GLsizei length;
glGetPerfQueryCounterLength(perf, counters[required_index], &length)

GLuint data_uint[length];
GLfloat data_float[length];

switch (type) {
glGetPerfQueryCounterData(perf, counters[required_index], data_uint);
case GL_FLOAT:
glGetPerfQueryCounterData(perf, counters[required_index], data_float);
// other cases here…

glDeletePerfQueries(1, &perf);

// do something with data

I spent a little time studying the AMD(spec) and Intel(spec) extensions and seems Intel’s perfquery extension can be easy mapped into AMD’s perfmon. I have also found what Intel support AMD_performance_monitor in their open source mesa driver for Linux instead of own extension (http://www.phoronix.com/scan.php?page=news_item&px=MTUyMjQ) and seems to Qualcomm supports it too, with some additions(spec). I’m not hardware guy, but seems AMD perfmon extension is a good start.

Also I hope if performance monitor extension (or similar) goes to core it will be available in OpenGL ES too (and maybe in OpenCL too). If every vendor starts to write each own perf extension from scratch… I don’t think what fragmentation is a true way. Hope this helps to write (or upgrade already available) OpenGL/GLES/CL-related tools and hope vendors provide as much information as is possible.

I completely agree that having a unified performance monitoring API could be very useful. So, you have my vote.

But, instead of reading AMD_performance_monitor, did you try to use it?
Try and you’ll find it totally useless (at least I found it two years ago). Take a look at the post. Even AMD discourages usage of that extension by hiding meaning/names of the counters. Also take a look at the status of the extension.

Did you try Intel’s extension? I didn’t, and I’m not sure whether it is supported and in which drivers. I saw it for the first time in December last year, but at that time it was not supported in the newest HD2500 drivers (if I remember correctly).

AMD doesn’t hide the counters. The counter can’t be specify because the counters are different even between a Radeon 7800 and a 7900.

The counters reflect hardware blocs.

Even with a standardize extension, it will take per vendor efforts to get something useful out of them.

Regardless, I agree would be nice to have such extension.

Why then there is no way to retrieve some meaningful names for each ID?
It is not a problem to have different counters in different hardware, but there should be a way to know what they mean.

Retrieving the names is not the same thing as retrieving the meaning.

If you program a bunch of counter-collection logic and then actually use it to adjust your workload dynamically, what happens when your application runs on a brand new driver for the first time and encounters a bunch of completely different counter names?

Specific low-level performance counters should be reacted to in a debug/design session, where you have appropriate documentation explaining what the counters mean for that specific driver. Not as a baked-in runtime query. And then you end up performance tuning different performance aspects on every driver.

For an analog, take a look at some low-level Intel CPU performance counters. Now try using them on a CPU from three years ago, or three years from now.

The ARB_texture_compression example pointed out in the OP is actually a good example of how not to do this. So you get query-able properties, like NUM_COMPRESSED_TEXTURE_FORMATS, and COMPRESSED_TEXTURE_FORMATS. Great. Other than listing them in a GPU-Info page, what do you do with that? If COMPRESSED_TEXTURE_FORMATS returns 0xDEADBEEF, do you trust that enum as a run-time compression format for your artwork? Not without reading the extension that introduces that enum, and understanding the artifacts introduced by that particular compression scheme!

Isn’t that the case with OpenGL in general? :wink:

But seriously, it takes some effort to figure out which counters are exposed in hardware and if a meaningful mapping between vendors is possible so an extension can provide a standardized enum to identify the counter.

Why is that? I can understand that a 7800 may expose less counters, but different counters? Could you be more specific as to what different means? Also, what is the difference between the 7000-series counter VSBusy and the universally supported counter ShaderBusyVS.

Intuitively I’d say, the semantics of some counters and the very existence of the counter is so obvious and essential that is doesn’t (or shouldn’t) change - independent of the chipset and vendor. For instance, every single vendor wants to provide a counter indicating how much time vertex shading takes, or how busy the ALU or the tex units are, or how many cache hits/misses you got. Why wouldn’t it be possible for vendors to simply agree on calling the corresponding counters VS_BUSY, ALU_BUSY, CACHE_MISSES and so on … ? How they handle such names internally is a completely different matter, of course.

There are so many common concepts, it shouldn’t be a problem to at least come up with counters for the most common subset.

The temporal argument isn’t really valid, IMHO, since we’d start with hardware that is [i]definitely[i] exposing some counters. For instance, GPUPerfApi returns a status value indicating whether the counter in question is available. There is no reason OpenGL couldn’t provide such an API once a uniquely identifyable subset of counters has been established. It’s the application developers responsibility to keep up with the GPU features when supporting so means of performance measurement - however, I’d much rather learn about a load of new counters added in version (n + 1) relative to version n than check out three different docs to be able to check out counter for Intel, AMD and NVIDIA. It’s simply a huge pain in the rear.

This is my two cents on performance counters:
[li]Performance counters should be used by tools, not end applications, for the purpose of optimizing an application[/li][li]Hardware varies a great deal, what counters are available and what exactly they are counting really depends on the hardware. Some hardware had unified shaders archs, others do not. Some hardware might implement certain fixed function bits in dedicated hardware other might tag at the end of shaders. And so on.[/li][/ol]

To that end a performance query API that exposes:
[li]For each counter, it’s type (int, float, etc)[/li][li]Name of counters for a simple short description[/li][li]Long description that tries to explain what the counter is counting[/li][/ul]

In that light the Intel extension does the above. Also, there are patches in flight (i.e. not accepted yet) adding the feature to Mesa and then quite likely i965, the Intel DRI Mesa driver.

What I’m looking for is simply a counter that can tell me the time between two frames.

Maybe have a millisecond, ?microsecond? and ?nanosecond? timer and a high precision timer.
(Not sure if the above should be seperate timers or consolidated into fewer timers.)
And a timer for very long time (hours - years) for e.g. off screen rendering that is non realtime and can take a long time.
(Programmatically of crouse, hardware can be the same of course.)
With different precision.
Use the type that is natural for the counter. Conversion to other types can be done by type conversion.
Uniform, same for all vendors, simple to use.