Recently I have noticed really unusual behavior which leads me to a conclusion something substantially changed in texture handling in recent NVIDIA GPUs. (No, I don’t mean “bindless texturing”. Something that changes classical texture reading.)
Namely, adding lighting in my terrain rendering engine by using central difference to calculate normals in the vertex shader (for some 4-8 additional texture reads per vertex) + additional calculation (not trivial at all) + extra mixing in the fragment shader, resulted in about 30% lower performance on GF100 (Fermi/GTX470). On the first glance it was an expensive task, and I tried to avoid it.
But, few days ago, I have tried the same test on my new laptop with GM107 (Maxwell/GTX850M) and the results were at least “strange”. Overall performance without lighting was about 66% worse than on GTX470. That is ok, since GTX850M uses DDR3 memory with 128-bit bus etc. But after adding the lighting, frame rendering time on GM107 was just 2.4% worse than without lighting. Compared to GF100 and its 30% worse result, it is was an amazing discovery. 8 additional textureOffset() reads per vertex + additional calculation makes practically no change in the execution speed. Take a look at graphs for GF100 and GM107.
There is almost nothing on the net about the improvements of NV Maxwell architecture. Or maybe the change has happened with Kepler? Does anybody have any idea what might be the cause of this strange (but great) behavior of Maxwell?
Besides two generations worth of texture improvements, Maxwell also has a large L2 cache (2GB vs 768MB). I suspect this would help reduce texture VRAM fetchs, especially if the map is somewhat larger than Fermi’s L2 but below Maxwell’s. How big is the texture map? The texture access may be bottlenecking the Fermi GPU but not Maxwell. Maxwell can also keep more threads in flight, hiding texture access latency.
In terms of processing power, Maxwell actually has more ALUs than the 470 - 640 vs. 448 - so it also may be that the additional lighting computation is less of a hit as well. The ALUs are running a bit slower, though, so there’s not a big differences from a FLOPS point of view (GM 1.2G, GF 1.1G). The Streaming Multiprocesser structure has also changed pretty significantly several times since GF100 (GF104, GK, GM), so I’d expect overall compute efficiency to be better. Kepler also shifted away from compute to focus on graphics workloads, and GM107 seems to continue with this (likely along with GM104, if such a chip exists; the GM100 will likely be the compute-oriented GPU if prior model numbers are any indication).
So in all likelihood, Maxwell is simply better tuned to your application than Fermi was.
Actually, GTX470 has 640MB L2 cache, since L2 cache is associated with the memory controllers on Fermi architecture.
Fermi also has a 12kB of texture cache per SM. On Maxwell there is no separate texture cache. L1 cache is used for the purpose of texture caching.
Kepler GK104 also has a texture cache, while it has disappeared on GK110. I’m not sure how it works now. There is too few documents on the topic.
Not likely, since a terrain height texture array has 8x1100x1100x16b = 18.5 MB There are 3 such arrays. There are also 3 texture arrays for overlay each with 16x3840x3840x4b = 112.5 MB. Of course, not all of data are used. Overall memory consumption of the application is about 451.4 MB.
I’m glad NVIDIA are streamlining GPUs to support my apps.
I found out what the problem was (and still is) on GF100. It is a catastrophically bad ROP.
Or I have some other problem with Windows/Drivers/Environment.
Texture fetches in VS are inexpensive. Just commenting the following line in the fragment shader: