[QUOTE=flyingpengiun;1284560]I currently have to deal with a workload where frequently small textures are uploaded (glTexSubImage2D) and are immediatly bound and used for drawing a single time - so effectivly a streaming workload with very small data size (4kB textures). As can be imagined, the primitive overhead of glTexSubImage2D is very high.
PBOs are even worse, as the nature of the workload does not allow for asynchronism[/QUOTE]
I think you’re mixing up a few things. PBOs allow you to use buffer objects as an intermediary for texel transfers ops. And yes, some forms of buffer object (e.g. PBO) population lead to synchronization or parallelize poorly. For instance:
…and the overhead of map/unmap is even higher.
Ordinary buffer mapping (e.g. garden variety glMapBuffer/glUnmapBuffer) can be very slow.
However, other forms of buffer object population pipeline very well. For instance:
- UNSYNCHRONIZE buffer mapping with buffer orphaning. Or (as you indicated)
- PERSISTENT/COHERENT buffer mappings.
For details on both, see the Buffer Object Streaming wiki page.
I’ve used the former with PBOs to speed-up uploads of lots of tiny (and large) texture MIPs. The latter should also work well with PBOs, though I’ve not used it.
According to NVidia, prefer the latter if your GL driver supports (and you allow it to use) multiple threads. Otherwise either should be fine.
I know, uploading those tiles in batches would be best - however the code is part of a library, and I can’t change the interface provided.
I wonder, would it be possible to use a pesistently mapped immutable buffer (ARB_buffer_storage) for uploading those small textures?
Instead of a sampler I could manually access the buffer in a fragment shader, so the GPU could directly access system memory via the GTT.
You could, but…
If I understand you correctly, you’re talking about completely forgoing use of the GPU’s texture sampling hardware and doing the texture sampling/filtering yourself in the shader? You could, but you might want to first check out using Buffer Object Streaming methods to fill PBOs used to populate your texture MIPs. It seems to me that that’s a simpler change. The GPUs whole texture sampling/filtering pipeline is optimized for lookup efficiency. The texture memory is even swizzled/tiled and cached to minimize the latency of the memory lookups (though you do have to pay the cost of the swizzle on first render after loading the texture).
But it’s your call of course. If you don’t think that GPU texture sampling/filtering is really buying you much in terms of performance, give it a shot. Better yet, try both and compare the performance!