Did some more testing, as what V-Man replied on gamedev.net forums simply can’t run at 84-204 fps in the test-case I mentioned.
Removed MSAA from my benchmarking app, and the numbers are (you convert to the more-relevant milliseconds ):
Btw filtering is trilinear, each tex has mipmaps. Viewport is 1280x720
case 1: real level + a lot of cpu work done (computing a skinned mesh of 40k vertices, for research) :
Small textures: 209 fps (all textures are visible onscreen btw)
900MB textures: 109-190fps
900MB textures, including a z-pass to remove overdraw: 140-204fps .
case 2: real level only.
Small textures: 740-1310fps
900MB textures: 190-400-1200fps (depending on views: zoomed on complex scene/ regular-view/bird’s eye)
900MB textures with a z-pass : 300-450-1100fps
But in my case, my RAM and cpu are heavy-hitters: c2d E8500@3.8GHz, dual-channel DDR3@1.6GHz (clk 7).
If OpenGL was uploading/replacing whole textures on the gpu, it would have required over 1TB/s bandwidth from PCIe (whereas my PCIe 1.0 card can only give 4GB/s). Thus, definitely the gpu fetches individual texels (or blocks of texels) on its own via DMA