glTexSubImage still unacceptable slow

Hello,

For introduction, my problem concerns texture-based 2-D image visualization. I know the subject pops out every once and while and I’ve googled it all. No real answer though and granted, PBO’s ain’t the solution here.

My problem is image load / reload speed which gets unacceptable slow. I cannot even change dynamic range at interactive speeds, not to mention hue, saturation or contrast. How does Photoshop implement its visualization? Any wild guesses? With hardware rendering turned on, it looks / behaves (seemless zoom, panning etc.) similar to my implementation. That is, after the image has been loaded!

Real world example: ~15Mpix RGB-image takes .899 seconds to reload.

Here’s my code samples: load is used for the first time, reload afterwards, e.g. in case of image modification (in tests, I’ve used null-modification, so the time is pretty much spent entirely on gl-functions). The image is tiled of (2^N)x(2^N) blocks. One texture is out of question to support large images, and to my experience, 256 seems the optimal texture size. I’ve tested RGBA / BGRA_EXT and several other combinations. LUMINANCE_ALPHA, in case of black&white images is of course faster, but this cannot be bandwidth issue, since 62MB << 4GB for PCIe x16. One more thing, I cannot really see a difference between load and reload. Rendering is done by Windows-WM_PAINT, so not affecting this problem. The copying into buffer is mandatory anyway to support arbitrary dynamic ranges in floating point images.

Anyone? Would ya?


#define MY_TEXSIZE	256
class TextureObject;
void Load(){
	/*std::list<TextureObject> m_textures;*/
	DWORD *buffer = new DWORD[MY_TEXSIZE * MY_TEXSIZE];

	for ( /* loop over tiled image */ ){

		/* compute tile and texture coordinates in texture object */
		
		/* copy part of RGB (BGR) image to contiquous RGBA (BGRA) buffer */		
		DWORD *d_ptr = (DWORD*)buffer;
		BYTE *im_ptr = (BYTE*)image->bits /* + location to current tile */;
		for(y=0;y<MY_TEXSIZE;++y){
			for(x=0;x<MY_TEXSIZE;++x){
				*(d_ptr++) = (*((DWORD*)im_ptr) & 0xffffff) | shiftedAlpha;
				im_ptr += 3;
			}
			im_ptr += offset;
		}
		
		glGenTextures( 1, &(tex.m_texname) );
		glBindTexture( GL_TEXTURE_2D, tex.m_texname );
		m_textures.push_back( tex );

		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );
		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP);
		glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP);
		glPixelStorei( GL_UNPACK_ALIGNMENT, 4 ); 

		glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, MY_TEXSIZE, MY_TEXSIZE, 
			0, GL_BGRA_EXT,GL_UNSIGNED_BYTE, buffer );
	}
}

void Reload(){
	std::list<TextureObject>::const_iterator it;
	for ( /* loop over tiled image */ ){

		/* make your changes to image and load it into buffer */
		
		glBindTexture( GL_TEXTURE_2D, (*it).m_texname );
		glPixelStorei( GL_UNPACK_ALIGNMENT, 4 ); 
		glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, MY_TEXSIZE, MY_TEXSIZE,
			GL_BGRA_EXT, GL_UNSIGNED_BYTE, buffer );
		++it;
	}
}

I’m guessing that Photoshop is fine because it’s only updating one image, while you’re updating many. How many do you update per frame?

Thanks for the reply.

But I’m always updating one layer at time, that’s what the load does:

In case of 15Mpix (4752 x 3168), that would be 19 x 13 tiles (with wasted 1Mpix on the border).

Originally I used Gdiplus (e.g. ‘rendering’ into Gdiplus::Bitmap). I’m pretty confident that back then reloading worked faster although without zooming and panning capabilities. Resampling for the requested rectangle could thereby be possible solution, but that would totally incapacitate pan and zoom. I don’t think Photoshop does that either, at least not in hardware rendering mode.

that would be 19 x 13 tiles

That’s 117 individual loads. I’m not surprised that it’s as slow as you say. Are you trying to run this on some hardware that only uses 256x256 textures?

OP: why don’t you tell us some really important things:
a) which hardware
b) which driver version
c) which OS

That’s 117 individual loads. I’m not surprised that it’s as slow as you say.

As the OP said, this is not even 60MB of data, this should be very well doable in a few milliseconds. The overhead of 117 calls should be negligible.

c;a;b:

Windows XP Pro; Intel Core i5 520M, 4096MB, Intel HD Graphics; OpenGL driver 6.14.10.5258
Windows 7 Pro 64-bit; Intel Core i3 2310M, 2048MB, Intel HD Graphics; ?
Windows XP Home; Intel Core 2 Duo E4300, 3072MB, Ati X1950Pro 256MB; ‘latest ATi non-legacy drivers’

I’m right now using the i5, but the results are similar with both other systems as well, so I wouldn’t consider it ‘old drivers’ -issue. Please don’t judge for the laptop hardware: I have the aforementioned Photoshop CS 5 running happily on both i3 and E4300, which makes the golden standard here. I use Windows headers and compiler is Visual Studio 2008 compiler.

The reason I tile for 256 is (i) I don’t wanna waste memory (with this setup, I can view 80MPix images at ease and still conform to power of 2 size) + (ii) smaller tiles decrease performance. It seems no coinsident, that the magic 256x256 = 0xffff+1 is the most optimal.

If I commend out gl-calls in reload (but keeping caller functions wglMakeCurrent-stuff), the reload times shortens to 1/10 or better.

You could try GL_RGBA rather than GL_BGRA to see if the driver is swizzling the components when you’re uploading. This would have an impact on performance.

Also, it’s unclear from your Reload() code where ‘buffer’ is coming from. If you’re filling it as is done in Load(), you could instead try using glPixelStorei(GL_UNPACK_ROW_LENGTH) to give GL the full image width and use the original image, so that it properly advances to the next scanline of the the tile. That way you wouldn’t need to make a temporary copy. However, to do this, you’d need to either promote the source image to BGRA, or demote the GL texture to BGR.

Currently, I believe to have triple checked both RGBA and BGRA without real affect. Which however leads me to suggest that either data I’m feeding is not entirely in harmony with the graphics driver. Also like said, I cannot see much difference between glTexImage and glTexSubImage calls.

Yeah, the buffer in ‘Reload’ is filled exactly like in ‘Load’ except that it is also a place-holder for dynamic range changes (scale+shift) etc. But as its fastest, if the range is [0,255], it is like in ‘Load’.

Dynamic range, and the fact that not all of my images are RGB8, and how image processing libraries support RGB vs. RGBA, I have chosen to stick to using extra buffer. Without gl-calls 15Mpix image yields 0.0407 seconds, so this is a minor factor.

I wonder how there is no way to ask the driver what format it wants!

Intel HD Graphics

Try GL_BGRA with GL_UNSIGNED_INT_8_8_8_8_REV instead of GL_UNSIGNED_BYTE. :wink:

0x8367 right…

Thank You sooo much :slight_smile: That really was the keyword, and instead of the aforementioned .889 seconds I now have

.156 seconds for 15Mpix or .328 seconds for 34Mpix, which, to my experience, both seem comparable with any other software I’m using. On Windows XP, its now a way better than Picture Viewer, if that proves too much anyways…

Thanks again mhagain