NVPerfKit Simplified Experiments Problem

Aleksandar · January 21, 2014, 12:52pm

Hi All,

Maybe this is not the most suitable place to ask such a question, but on the other hand is there any better place in the Universe to do it?

OK, let’s be more serious.
Has anyone ever tried to use NVIDIA PerfKit for profiling OpenGL applications?

I’ve been using it for some time, but only to read some hw counters. Recently I’ve started to use simplified experiments hoping they would give me some useful hints about units’ utilization and isolating bottleneck, but…

Something really unexpected is happening. Namely, instead of shaders as I expected, those experiments find bottlenecks in either FB or ROP.

ROP is the blending unit and handles both color blending and Z/stencil buffer handling.
The FB or frame buffer unit handles all requests for reading memory that missed any possible L1/L2 caches.

I have to mention that the application in the test does not use blending (in fact it does, but it is skipped during experiments), stencil buffer or anything that would rise ROP load.

Other important facts are readings of other counters:
Shader_busy: 80%
Texture_busy: 12.5%
ROP_busy: 15.5%
GPU_busy: 99%

For the same scene simplified experiments returned:
ROP bottleneck: 68%
FB bottleneck: 13%
ROP utilization (SOL): 6%
FB utilization (SOL): 80%
(But this could vary widely. Sometimes ROP but sometimes FB is a bottleneck).
I have tried with trivial fragment shaders (just outputting black or even discard output, but on a G80 GPU, and the FB is still the bottleneck).

Can anyone help me to resolve these readings?

It seems to me that simplified experiments don’t retrieve correct values, but on the other hand how come that the problem persists for several years (I’ve tried it with various drivers, from R266 to R332).

skynet · January 22, 2014, 10:01pm

Try inserting a glFinish() right before the call to EndObject()/EndPass(). It helped me to get ‘stable’ results.

Aleksandar · January 23, 2014, 6:05am

Thank you for the hint, skynet. I do use glFinish() before EndObject().

With GF100/R332.21/3.3.2.13351 (GPU/Driver/PerfKit) the results are pretty stable (in range of 20%), but on G80/R266/? (PerfSDK 19.07.2011.) two successive measurements can be totally different (for example: IDX Bottleneck: 0, ROP Bottleneck: 67 in the first and IDX Bottleneck: 51, ROP Bottleneck: 32 in the second consecutive measurement).

Here are same test results on GTX470. This still confuses me. By changing multisampling rate ROP Bottleneck changes (x1 - 32, x2 - 74, x4 - 79, x8 - 91, x16 - 92), but isn’t it strange that GTX470 has a problem to render HD screen with x2 multisampling with 40 ROP units?

Btw, I cannot read OGL driver counters on GF100/R332.21/3.3.2.13351. The counter exist, but the retrieved value is always 0.

skynet · January 23, 2014, 11:49am

I’m using the driver that is recommended for the use with the latest nSight (331.82) and PerfKit_3.2.1.13309 on a GTX660Ti.
One thing I noticed in the latest PerfKit/Driver combo is that the NVPMA_COUNTER_DISPLAY hint now seems to make sense for all/most counters.

I also recommend to use NvAPI to set EXPORT_PERF_COUNTERS_ID to EXPORT_PERF_COUNTERS_ON.

Aleksandar · January 27, 2014, 6:28am

I’m sorry for this delay in answering!

Thank you for another tip! I haven’t noticed any difference in changing EXPORT_PERF_COUNTERS_ID value.

I have also noticed that this driver (332.21) always reports:

OGLE: Category: 0x00001000, MessageID: 0x008C0004
The current texture related state is legal, but unexpected: Waste
of memory: Texture 2 has mipmaps, while its min filter is
inconsistent with mipmaps.

for all texture units in use. And I’m pretty sure this is not true. The same messages appear for various projects when GLExpert dump is enabled.
Also, whether or not the instrumentation is enabled some counters can be read (for example: shader_busy, texture_busy, rop_busy, gpu_busy), while simplified experiments don’t work (as expected).
But this is probably a topic for the drivers section.

Anyway, the two problems stay:

GL counters don’t work,
ROP bottleneck (as reported by simplified experiments) is not related to GPU load and rendering time.

I’m using the driver that is recommended for the use with the latest nSight (331.82) and PerfKit_3.2.1.13309 on a GTX660Ti.

Could you report values retrieved by simplified experiments for any of your applications? Is ROP Bottleneck high?

skynet · January 27, 2014, 10:56am

I seem to get quite good numbers.
Sometimes sampling counters fails for no good reason (getting NVPM_ERROR_COUNTER_NOT_ENABLED for some of the enabled counters).
In that case, I just repeat the experiment and eventually you get some results.

gpu_sampleCounters BottleSampling Counters:
Experiment takes 17 passes.

IA Bottleneck: 0
Primitive Setup Bottleneck: 0
Rasterization Bottleneck: 9
ZCull Bottleneck: 0
SHD Bottleneck: 45
TEX Bottleneck: 9
ROP Bottleneck: 66
FB Bottleneck: 6
Stream Out Bottleneck: 0
Tessellator Bottleneck: 0
L2 Bottleneck: 12
GPU Bottleneck: 6


gpu_sampleCounters OGL
Sampling Counters:
Experiment takes 1 passes.

OGL frame time: 10
OGL driver time waiting: 9
OGL driver waits for GPU: 5
OGL driver waits for kernel: 4
OGL driver waits for lock: 0
OGL driver waits for render: 0
OGL driver waits for swap: 0
OGL memory allocated: 952M
OGL memory allocated (textures): 342M
OGL memory allocated (vertex): 43.1M
OGL batch count: 1.07K
OGL vertex count: 742K
OGL primitive count: 212K

Aleksandar · January 28, 2014, 3:29am

Thank you very much for the counter values.

Yes, your application’s OGL values are correct. The sum of all waits is equal to “time waiting”. It is interesting that there is a lot CPU execution during your drawing code.
In my application only “waits for GPU” is not equal to zero (e.g. OGL frame time: 16, OGL driver time waiting: 13, OGL driver waits for GPU: 13). But it is measured on G80 with old drivers/perfkit. I have no results on Fermi/332.

Another interesting notice is a sum of all bottleneck values. It should be 100, since those are percentage of time a concrete unit is a bottleneck. In your case the sum is 147 (“GPU Bottleneck” is not a percentage but the ID of the unit that is a bottleneck). It is a very rough estimation. Of course, that is the consequence of multi-passes. You have 17 passes. Not all counters are read in each (naturally they cannot be all read in the single pass). But PerfKit should somehow correct those values to represent a mean value normalized to 100%.

On 332 drivers there are other problems like"

“GPU Bottleneck” counter should enable implicitly all other counter (it worked on old drivers/PerfKits), but it doesn’t (all GetCounterValueByName() calls retrieve NVPM_ERROR_COUNTER_NOT_ENABLED),
GetGPUBottleneckName() cannot resolve unit name by ID.

Does GetGPUBottleneckName() works on your configuration? Just pass value returned by GetCounterValueByName(…, “GPU Bottleneck”, …, &value, …) to GetGPUBottleneckName().

Btw, on my (GTX470/332) configuration all experiments without “GPU Bottleneck” requires 15 passes, only “GPU Bottleneck” requires 16 passes, and all together also requires 16 passes (so “GPU Bottleneck” should include all other experiments). According to your test Kepler requires one pass more.