Add HDR compression to the DXT sheme

The_Little_Body · June 14, 2025, 4:31am

Hi,

I have begin some searchs about a 16 bits / component extension of the DXT method

The idea is to work with fully separates Red, Green, Blue and Alpha planes for to can handle something like the 3 bpp Alpha DXT compression, but with 16 bits on the fours distincts R , G, B and A planes instead 16 bits for alls

First, we can pre-compress the picture with a image transformation, something like the YCbCr sheme frequently used in the video domain in the 4:2:2 format
(the color transformation can to be YCoCg and/or the format 4:4:4 or 4:2:0 if we want handle a more, or less, quality)

Secondly, each channel/plane is distincly compressed each from others, using 8x8 tiles instead only 4x4 and two keys of 16 bits / channel instead only 6 maximals bits per channel as used in standards DXT shemes

This compression give about 16 bits per pixels in output for an HDR picture of 16x4 = 64 bits per pixel in input, so a compression of 4 for to handle a RGBA HDR picture (16 bits/component)

Is this case already handled by a recent DXTn extension ?
(this seem the case of the BC6/7 compression sheme but it seem very difficult to handle and I find nothing about them …)

The_Little_Body · June 14, 2025, 4:43am

We can too use 3 distincts textures for to fully separate the min/max colors keys map stored in the first and second textures from the index/range map stored in the 3th textture + this greetly help the handling of anothers blocs sizes than 4x4 or 8x8

So, the fragment shader can to be as simple as this on the decompression side
(note than we can replace uniforms to varyings variables if we want a more “smooth” transition between blocks)

in   vec2 TexCoord;
out vec4 FragColor;

uniform sampler2D minTexture;   // HDR minimals colors map in a low definition 
uniform sampler2D maxTexture;   // HDR maximals colors map in a low definition
uniform sampler2D rampTexture;   // simple 8 bits monochrome range map on a high definition

void main()
{  
    FragColor = mix( texture(minTexture, TexCoord), texture(maxTexture, TexCoord), texture(rampTexture, TexCoord) ); 
}

Alfonse_Reinheart · June 14, 2025, 6:21am

There is no “3 bpp Alpha DXT compression”. There is a 4 bpp version, and a different version that uses two key colors and interpolation between, like a standard S3TC block. But that’s also functionally 4 bpp (64 bits are used to store 16 alpha values).

But in any case, if you used 4 4bpp color planes… you have 16bpp colors.

Also, ASTC and BPTC have dedicated HDR compression functionality; they likely can beat whatever you’re trying to do.

The_Little_Body · June 14, 2025, 11:40am

Thank, Alfonse

I have take a look at Khronos Data Format Specification v1.1 rev 9 but find than newers formats are very very complex to handle because very too specifics

The idea is to very simplifiate all this handling by the use of 3 textures : one than store the minimal endpoint on the minimalEndpoints texture, one that store the other endpoint on the maximalEndpoints texture, and the third texture that store the coefficient to apply on each component, the only obligation is than the three textures use the same number of components, but can to have very différents sizes and/or differrents formats
(this automatically handle alls blocs sizes than you want, including non integers sizes …)

Cf. color.rgba = maximallEndpoint.rgba * coef.rgba + (1 - coef.rgba) * minimalEndpoint.rgba

This handle nearly alls cases but with a very very simple scheme instead extraordinally complex schemes like BC6, BC7 or others ASTC or BTPC formats that are a really nighmare to fully implement …

This is something like are already handled mipmaps and linears interpolations, but with a third texture that store the coefficient to apply on each component at each texel

Color(u,v) = C1(u,v) * Coef(u,v) + (1 - Coef(u,v)) * C0 with u and v texcoords in the 0..1 range

The fragment shader can to be modified for to more see the four steps of the method on the decompression side

in vec2 TexCoord;

out vec4 FragColor;

uniform sampler2D minTexture;   //  HDR minimals colors blocks (low definition)
uniform sampler2D maxtexture;   //  HDR maximals colors blocks (low definition)
varying sampler2D coefTexture;   // coefficients map (high definition) 


void main()
{  
    vec4 c0 = texture(minTexture, TexCoord);
    vec4 c1 = texture(maxTexture, TexCoord);
    vec4 coef = texture(coefTexture, TexCoord);

    FragColor = mix( c0, c1, coef); 
}

Alfonse_Reinheart · June 14, 2025, 1:53pm

You can implement this I suppose, but it’s not going to be fast. And the data size of what you’ve described seems to be substantially bigger than the alternatives.

I don’t see what it matters if they’re “extraordinally[sic] complex”; that complexity isn’t something you have to deal with. Unless you’re writing a compressor, you don’t have to care how complex these formats are under the hood.

The_Little_Body · June 14, 2025, 2:09pm

But I want to have the decompressor AND the compressor that are relatively simples to handle/implement

And add complexity isn’t always the good way, cf. think a little about RISC (Reduced Instruction Set Computer) vs CISC((Complex Instruction Set Computer) architectures and why the SIMD model is more and more used today than before …

Note than the c0, c1 and coef textures fetchs can to be make in //, so the fragment shader can handle the 3 textures fetchs in // , only the mix() fonction has to be done after them in a serial manner

Alfonse_Reinheart · June 14, 2025, 2:36pm

OK: why? How will that make your program meaningfully better to users of it?

Is your compressor going to beat other compressors in visual quality? Almost certainly not; ASTC compressors are really good at what they do, and your format itself is actually not that great in terms of fidelity in comparison. Are your textures going to be smaller than those that use ASTC? No; your textures in aggregate are significantly bigger than any compression format. Is your shader that accesses them going to be faster? No; 3 memory fetches is 3x as expensive as 1 memory fetch.

Oh, and your format cannot handle the most important aspect of HDR: that texel data can have values larger than 1.0.

Unless you’re doing compression on the fly, there’s no clear, objective advantage to rolling your own compressor/decompressor. For a personal project, sure, do as you like. Just be aware that what you’re doing is not better in any way that’s meaningful to users of your software. If you’re fine with that, then go forward with it.

All I see are 3 memory fetches where 1 (smaller) fetch will do. Memory fetches are the most expensive thing a shader can do.

The_Little_Body · June 14, 2025, 2:40pm

But if c0, c1 and coef textures are in separates memories banks, can’t they to be handle in // using one distinct “core/thread” for each on them, so make them in // ?

Alfonse_Reinheart · June 14, 2025, 2:44pm

What is a “memories[sic] bank”?

If you’re talking about caching, 1 texture is more likely to be found in the cache than 3. Especially since that 1 texture will be smaller than the aggregate size of the 3 So with your method, there’d be more cache contention, not less, more cache misses, and more having to fetch actual memory.

It’s going to be slower. It may not be precisely 3x as slow, but it certainly will be slower.

The_Little_Body · June 14, 2025, 2:49pm

I say about “memory banks” because if the c0,c1 and coef textures are in distincts memory blocs locations, using a distinct cache for each of them permit to resolve the very high memory cache contention
(the same technic can certainly to be handled for mipmaps for example, where the lower mipmap level reuse very less frequently, 2x2 = 4x less from one level to the next, the last fetch location than on the bigger mimmap level)

I think to manually prepare one c0, c1 and coef textures example for to begin, for to see if this is really as slower as this on the shader side

The_Little_Body · June 14, 2025, 4:35pm

I have make some formulas for to see if this can to be profitable

An RGBA HDR texture using half float storage on a no compressed form use (W * H) * 4 components * 16 bits = (W * H) * 64 bits

With a bloc size of dx * dy, this make (W * H) * 4 components * 16 bits / (dx *d y) = (W * H) * 64 / (dx * dy) for to store the C0 and C1 textures and (W * H) * 8 bits for to store the Coef texture

For a bloc size of 4x4, cf. dx=dy=4, this make a total storage of (W * H) * ( 64/16 + 64/16 + 8) bits = (W * H) * 16 bits, so 4x less than on the no compressed half float HDR picture

If we use a bloc of 8x8, so dx=dy=8, this make a total storage of (W *H) * (64/64 + 64/64 + 8) bits = (W * H) * 10 bits, so more than 6x less storage than in the no compressed half float picture

This is already profitable using a 4x4 bloc size compared to the BC7 format that make a 4x compression when it handle only 8 bits per RGBA components, instead 16 bits per component in my scheme
(and this is clearly even very more profitable with a block size of 8x8)

Note that a substantally better compression/decompression scheme seem to be the use of JPEG HDR pictures instead BC7 pictures
(in the quality and file storage size sides, not in the speed or memory size sides because we have to decompress the entire picture before to can use texels from it, instead only a small part of it using BC7)

Alfonse_Reinheart · June 14, 2025, 6:51pm

Are you aware of any hardware that has such “memory banks” or “distinct caches” for separate textures? Maybe there’ve been some recent changes, but I’m fairly sure that’s not a thing.

Nobody is suggesting that your format wouldn’t be smaller than uncompressed textures. I’m comparing it to other compressed formats.

BC6H is 8-bpp, as is ASTC with 4x4 blocks. Your hypothetical format is 16-bpp, twice the size.

And it also cannot handle values larger than 1.0, so not only are those other formats smaller, those other formats are actual HDR formats.

Why are you comparing a non-HDR format like BC7 to your hypothetical HDR format? Also, BC7 is still half the size of your hypothetical format.

ASTC with 8x8 blocks is still twice as compressed as your format.

The_Little_Body · June 14, 2025, 7:08pm

This is exactly why I speak about EXTENSIONS/ADDITIONS/PROGRESSION
=> can DXTn, S3TC, ASTC or another PRTC directly be handled by olds hardwares for exemple, or the Intel MMX/SSE instructions presents on Intel 8086, 80286, 80x386 or 80486 processors ?
==> of course that no, and this why the word have greatly progress since them

And yes, this is why modern processors have now three layers caches …
(think “bank/main memories” instead “textures” for the analogy)