Accessing a texture2DArray, speed it up?

Mars_999 · June 8, 2008, 10:16pm

I have 16 textures in a texture array, and would like to know if there anyway to access the thing once and then get all 16 textures, so I can eliminate 16 calls to texture2DArray? As of now this is killing my FPS.

Thanks

NiCo1 · June 9, 2008, 12:26am

If your hardware supports packing/unpacking you could pack your 8bit texture date into 32 bit floating point textures e.g. you can pack sixteen 8bit intensity values in one 32bit floating point RGBA texture.

Seth_Hoffert · June 9, 2008, 5:07am

I’d be curious to know how much this helps, since the same amount of bandwidth would be utilized. Would the instruction count savings speed things up? I suppose it could help the cache…

NiCo1 · June 9, 2008, 5:39am

I recently used something similar in a paper of mine and like you already mentioned, texture caching improved performance a lot. Additionally, in case the texture is generated on the graphics card, you can write the results to a single frame buffer attachment rather than using multiple render targets which also improves performance.

Seth_Hoffert · June 9, 2008, 1:44pm

I see, good to know. It’d be cool to see packing functions exposed to GLSL through an extension, but for now I suppose one can use bitshifting operations when dealing with non-floats anyway.

Mars_999 · June 9, 2008, 4:49pm

Ok, I not 100% sure on the packing idea. Are you saying take all 16 8bit RGB textures and just dump them into one RGB floating point texture? Then what do some math on the value since it will be greater then 1.0 to get to each texture? e.g. Red Channel 0-255, 256-511, 512-767? then in the shader just take each range out to get each texture?

NiCo1 · June 10, 2008, 2:39am

Well, first you need to check out if your hardware supports packing/unpacking.

A common pixel location in the 16 8bit RGB textures takes up 16*3 = 48 bytes of data. One 32bit floating point RGBA texture has storage for 16 bytes per pixel. This means that you could pack the data for a single pixel in the 16-D array into 3 pixels of floating point textures. On the one hand, this way the number of texture accesses will be reduced from 16 to 3. On the other hand, you will have to insert some unpacking instructions after accessing the texture to get the 8bit values back.

Mars_999 · June 10, 2008, 8:34am

I have a GF8800 GTS. So I am guessing it supports it? So what do I need to look up? Is there some extension?

NiCo1 · June 10, 2008, 10:24am

A GF8800 GTS definitely supports packing/unpacking. The only problem is that these functions are currently not exposed in GLSL but you can access them using assembly as described here (search for ‘pack’) or using Cg.

Ilian_Dinev · June 10, 2008, 11:54am

You can access them through GLSL:

vec4 arg = unpack_4ubyte(float value); // arg is always 0…1
float result = pack_4ubyte(vec4 arg);

To see the names of the other functions, I had to open cgc.exe in a hex editor and search for “unpack”. (really, I couldn’t find them in any documentation or online otherwise) :
unpack_4ubytegamma
pack_4ubytegamma
unpack_4ubyte
pack_4ubyte
unpack_4byte
pack_4byte
unpack_2ushort
pack_2ushort
unpack_2half
pack_2half

They are nifty instructions that nVidia hardware supports. They are very similar to such C code:

float value=1001.21;
char* ptr1 = (char*)&value;
float result = (float)ptr1;

They are described in http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_fragment_program.txt

The instructions are: UP2H, UP2US, UP4B, UP4UB, PK2H, PK2US, PK4B, PK4UB

By packing all 16 8-bit textures into one float32 RGBA texture and using the above instructions to unpack, AND your shader needs to get unfiltered color-values from the 16 textures at the same coordinates - you will speed-up the shader. Not because of texture-cache, but because of GDDR3 latency, I bet. (switching GDDR addresses is a relatively slow operation, that is done when starting to read from another texture, or at a very different coordinate).

(ATi hardware will never support these nifty instructions, afaik)

I use these instructions/“functions” when drawing to multiple render targets, or simulating MRT (whichever case performs faster, again due to GDDR3 latency). This also allows me to mix formats of render-targets i.e use a FBO’s MRT#1 as a RGBA8, MRT#2 as a R32f, MRT#3 as RG16half_float.

knackered · June 10, 2008, 12:53pm

I dabbled in unpacking some bytes from a float in GLSL a while back.
here’s a link to some good information I got from others on this forum:
http://www.opengl.org/discussion_boards/…true#Post221227

Seth_Hoffert · June 10, 2008, 1:40pm

That’s really cool, thanks for letting us know that the functions are exposed to GLSL as well. I wish this stuff was better documented!

knackered · June 10, 2008, 2:19pm

but…they’re not exposed to GLSL. That’s just the Cg compiler kicking in again. They’re exposed to the nvidia non-compliant GLSL compiler. If you’re going to be using these kind of nvidia specific “language extensions” then just use Cg itself directly - at least you then get the benefit of profiles so you can target other hardware seamlessly.

Seth_Hoffert · June 10, 2008, 3:00pm

Ohh, that’s true. Well, hopefully we see these things get promoted to cross-vendor GLSL extensions, so we don’t have to leave GLSL.

Mars_999 · June 10, 2008, 3:55pm

Unless GL3.0+ drivers from ATI support all current extensions that Nvidia is using e.g. texture arrays, GS, ect… I am planning on making my stuff run on Nvidia only. Sad, but ATI needs to get their head out of their asses, and get back to making GL drivers that are worth talking about again.

As for the packing, I hope to give this a try soon. If anyone else tries to do this be nice to see some posts on performance gains from the implementation.

Ilian_Dinev · June 10, 2008, 8:47pm

I did some extensive benchmarks.

hardware: GF7600GT AGP8x GDDR3 128-bit membus @ 1.4 GHz; win2k sp4; Sempron3000+

Scene1: 20 full-screen 1280x720 quads, depth-test disabled
Scene2: 3000 random triangles in a vbo, all with z=0, no transform, depth-test enabled, depth-func: gl_less

shader1 uses 16 unique textures, with same size and in 8bpp. Code:


varying vec2 texcoord0;

#if IS_VERTEX
void main(){
	texcoord0 = gl_MultiTexCoord0.xy;
	gl_Position = gl_Vertex;
}
#endif

#if IS_FRAGMENT
uniform samplerRECT tex[16];

void main(){
	vec4 color=0;
	for(int i=0;i<16;i++){
		color+=textureRect(tex[i],texcoord0);
	}
	color/=10;
	gl_FragColor = color;
}
#endif

shader2 uses one RGBA32f texture with the same size, and the data of the 16 textures is packed interleaved here. Produces exactly the same output as shader1. Code:


varying vec2 texcoord0;

#if IS_VERTEX
void main(){
	texcoord0 = gl_MultiTexCoord0.xy;
	gl_Position = gl_Vertex;
}
#endif

#if IS_FRAGMENT
uniform samplerRECT tex;

void main(){
	vec4 color1=textureRect(tex,texcoord0);
	
	vec4 c0 =unpack_4ubyte(color1.x); 
	vec4 c1 =unpack_4ubyte(color1.y);
	vec4 c2 =unpack_4ubyte(color1.z);
	vec4 c3 =unpack_4ubyte(color1.w);	
	
	c0+=c1;
	c2+=c3;
	c0+=c2;
	
	c0.x+=c0.y;
	c0.z+=c0.w;
	
	gl_FragColor = (c0.x+c0.z)/10;	
}
#endif

/*
	// this is the uber-slow version of the above code
	color+= c0.x;	color+= c0.y; color+= c0.z; color+= c0.w;
	color+= c1.x;   color+= c1.y; color+= c1.z; color+= c1.w;
	color+= c2.x;	color+= c2.y; color+= c2.z; color+= c2.w;
	color+= c3.x;	color+= c3.y; color+= c3.z; color+= c3.w;
	color/=10;
*/

Now, I’ll only differ the texture-sizes, and measure framerate.

Textures 128x128
Scene1: Shader1=21.0FPS, Shader2=23.6FPS
Scene2: Shader1=12.7FPS, Shader2=15.0FPS

Textures 256x256
Scene1: Shader1=20.5FPS, Shader2=23.0FPS
Scene2: Shader1=9.9FPS, Shader2=14.6FPS

Textures 512x512
Scene1: Shader1=20.5FPS, Shader2=23.0FPS
Scene2: Shader1=6.5FPS, Shader2=13.5FPS

Textures 1024x1024
Scene1: Shader1=16.7FPS, Shader2=20.7FPS
Scene2: Shader1=3.8FPS, Shader2=10.5FPS

Textures 2048x2048
Scene1: Shader1=8.7FPS, Shader2=15.0FPS
Scene2: Shader1=3.1FPS, Shader2=8.6FPS