if we want to check if the particular stage is causing bottleneck in the pipeline or not we decrease the load in that stage and if the FPS increases then that stage is a bottleneck. For example in case of vertex shader, if we decrease the no of vertices or instructions in shader it should take lesser time as the workload is less and hence fps should increase then how it would be bottleneck stage? FS is just a example in any stage if we decrease the workload the time for processing would increase, won’t it?
Yes, that is generally correct. However, there are some points you have to consider:
the FPS increase only until another stage becomes the bottleneck. If your VS and FS stages for example are nearly balanced, you wont see an FPS gain when you decrease the workload of either of them, because the other one will be saturated almost immediately.
DX10+ GPUs have a unified shader architecture, i.e. they use the same ressources to run any type of shader. If you decrease the workload on your VS, you may free some GPU ressources that can be used for the FS. That way you could get a small FPS gain, even if the VS wasn’t the bottleneck.
are we assuming that stages are running in parallel?
Yes, if you only modify the amount of work in that stage (not any others), and if do this with VSync disabled (so you can get continuously varying frame times).
Also, suggest you use frame time, not FPS as your performance metric (one of many links on this: Performance (Humus))
For example in case of vertex shader, if we decrease the no of vertices or instructions in shader it should take lesser time as the workload is less and hence fps should increase then how it would be bottleneck stage?
Modify the number of instructions, yes. Modify the number of vertices, not necessarily because that can change the vertex load and fragment load, violating the assumptions of the test.
In general, start your tests by turning VSync off, and testing from the tail end of the pipeline back toward the head. For instance, reduce your window size a bit. If you’re rendering the same batches fit into the now-smaller window and performance is improved, you may be fragment limited. This could be fill (ROP) limited, fragment shader limited, texture fetch limited, etc. – but it’s something having to do with your fragment pipeline.
Also, I will tell you that except in strange cases, desktop GPUs nowadays are more likely to be CPU limited than limited by something internal to the GPU. Get familiar with the maximum triangle throughput of your GPU (for non-tessellated and tessellated rendering), and if you’re not even close to that, strongly suspect inefficiency in how your pumping batches and state changes to the GPU through the GL driver.