OpenGL compressed wavelet texture support ?

The_Little_Body · April 6, 2011, 6:51pm

Hi,

I have begin an API that make the wavelet compression/decompression from various pictures formats and think use it for to handle a new compressed texture format in OpenGL (something like GL_COMPRESSED_DXT_EXT texture formats but with a very more compression ratio)

I found that wavelet compression give a very big compression ratio (used in coordination with RLE/Huffman and others compressions/reorganisations/quantizations technics of courses) and I think that this can become very interesting OpenGL internals texture formats for the next decade.

Have OpenGL a provision (or better, internals texture formats that are already standardised) for a such texture format ?

Note that mipmaps technics can be very easily used with a such format too.
(this is intrinsic to the wavelet mathematical definition …)

And what about internals video texture supports for JPEG/MPEG/H26x or better format

@+
Yannoo

Alfonse_Reinheart · April 6, 2011, 7:10pm

Have OpenGL a provision (or better, internals texture formats that are already standardised) for a such texture format ?

No, and it never will. For good reasons.

The standard compressed format that GPUs support is some variation on S3TC. This format is designed specifically for the needs of texturing.

When you pull a 4x4 compressed block into the texture cache, you have 100% of the information you need to decompress those 16 texels. Furthermore, those 16 texels are arranged in a 4x4 block, so odds are not only can you decompress the texels you yourself need, you can also decompress the texels other nearby executions of the fragment shader will need. So not only does it cache reasonably well, it cache’s spatially. Also, the blocks are all of a fixed size. That makes it easy to fetch the specific texels you need.

Now, this is a tradeoff. S3TC is not the best format in terms of image compression, either quality wise or size wise. PNG can give smaller files even though it’s lossless, let alone JPEG. And JPEG is much better at being lossy than even a good S3TC compressor. They are good image compression formats.

But PNG and JPEG would make terrible texture formats. They have no pixel locality. Fetching the information for a specific texel is… difficult at best without decompressing a very large part of the image. And so forth.

Wavelet compression would have all of these problems.

Now, everything I described is true for a texture. That is, an image you’re going to be accessing for rendering purposes. One thing Civilization V does to ostensibly speed up load times for Direct3D 11-class cards is to employ some GPGPU-based decompression.

That is, they upload compressed binary data to the card. Then they run a D3D Compute “Shader” over it, decompressing the binary data into an actual texture object. The compressed data can then be discarded, as the texture data is now in the proper format for use by the renderer.

This is purely a loading time optimization; that’s all it is. It does nothing for allowing you to have more textures in memory at once (the way S3TC does), since you’re using uncompressed textures.

You could do something similar using OpenCL. Or possibly an OpenGL shader using image_load_store. Now, I know very little about wavelet compression, but I don’t think the algorithm parallelizes very well. If this is true, you might want a different compression algorithm, one that is more superscalar than wavelets allow.

imported_Groovounet · April 8, 2011, 10:23am

Turns out I know more than a little about what you call “wavelet compression”. Such compression is build upon multiple stages which wavelet transform and entropy encoding are 2 main stages.

The wavelet transform step is highly parallel-able and will just be incredibly fast on a modern GPU. The entropy encoding is another matter. While encoding it easily parallel-able, decoding is pretty trick but it’s possible to adapt the encoding structure to make this operation parallel-able but this would be at a cost of a worth compression ratio. However, this method can be lossless.

Texture cache usage can’t be typically as efficient as S3TC.

The_Little_Body · April 9, 2011, 10:37am

Hi,

Yes, it’s why I make only the first level of the wavelet compression for to have only a 8x8 bloc that have to be compressed/decompressed at a time

The compression is lesser that with the true iterative wavelet compression (with the quantification/reordering/RLE/ZLib stages that come before and after of courses), but this seem me very more efficient than the S3TC compression …

The final compressed size is only about 10% of the original size (but I work only with the Y channel for the instant, so I think that the compression was about 15% with addition of the Cb and Cr channels in a 4:2:0 format)

But OK, this seem me effectively very hard (if not impossible) to make the RLE/Zlib compression/decompression on shaders

The_Little_Body · April 9, 2011, 10:49am

For the independant 8x8 blocs compression/decompression algorithm, I have posted sources at http://www.developpez.net/forums/d909172…-video-wavelet/ some days ago.

The .tif sizes doesn’t reflect the compression reduction because I don’t know (for the instant) how to construct .tif files other that with a RGBA non compressed packed format …

The_Little_Body · April 9, 2011, 10:56am

For the iterative part, this can perhaps to be make with the use of multiples textures (cf. one texture per level) ?

But the fact that this work with a bloc of 8x8 (and not 4x4) is always a problem

That can be easily reducted if we think that a 4x4 RGBA bloc is “equivalent” to 4x 8bits blocs of 4x4, so a block of 8x8

And my “no iterative wavelet transformation” can’t be lossless because I work always with 8 bits values (cf. I don’t expand data to float values for the “one level” wavelet transform). But the loss seem me so little

The_Little_Body · April 9, 2011, 11:30am

Yes, Groovounet

The “only one level” wavelet transform/untransform stages seem me effectively very speed compared to the RLE/Zlib stages.
(the fact that the 64 “distants” pixels are cached to a local 8x8 array in memory for the wavelet transform/untranform has certainly a big impact about this …)

But RLE and Zlib compression/decompression can perhaps to use some hardware accelleration in a near future ?
(I think about something like a “compression by tables” where tables can be precomputed for the “general case” but modified/optimised for “specifics cases”)

The_Little_Body · April 9, 2011, 12:45pm

The “one texture per level” seem (in theory, no implemented) work for the iterative part

But the quality loose is really too big
(logic, the error is multiplied at each level … and I don’t want to work with floats => I’m now happy to not spend my time to implement it )

Hum … it’s now time to see if the “per level” can be “mixed/interlaced” with the “per image” part in a GOP of 8 consecutives pictures of a video source … (yes, I have a lot of bizaroids ideas in my head )

=> this seem something like a “3D wavelet transform limited of 8 slices of 8x8 texels” for to be more simple (or perhaps only 4 slices of 4x4 texels" for to be more “DXT compliant” and handle the R, G, B and A planes on the same time)

The_Little_Body · April 9, 2011, 2:06pm

I need now some informations about what can make the actual hardware …

Have the actual hardware one DXT decompression engine per texture unit or the DXT decompression engine is shared between alls texture units (cf. one push/pop of the context is necessary for each texture unit access) ?

Can we have bigger chunks of data that what we have with the texture2D(sampler,vec2) = one vec4 but on a similar way (cf. directly the 8 bytes used for the DXT decompression of a bloc of 4x4 texels on the output of 2x vec4 for example) ?

The quantization/reordering/RLE/Zlib compression/decompression doesn’t seem exist into the actual hardware, so this as to be make before and after the wavelet transform/untransform
=> can the actuals or futurs GPUs (or CPUs) to be enhanced for to handle this (I think that an hardware implementation of this can handle the compression/decompression speed by a factor of 10x or something like this)

How is actually make the JPEG decompression : in hardware or in software ?

The_Little_Body · April 9, 2011, 2:33pm

If recents GPUs hardware can handle directly the JPEG [2000] decompression, video texturing seem me “trivial” to add (MJPEG ?)

Note that the decompression has only to be make at the glBindTexture() invocation and the extra memory can to be free at the next BindTexture(NULL) … (perhaps a new glUnbindTexture() function to add ?)

The_Little_Body · April 9, 2011, 3:25pm

I have read at http://www.design-reuse.com/news/14978/i…ore-family.html that

“PowerVR SGX family video/image decode and encode processing support includes H.264, MPEG-4/2, VC-1 (WMV9), JPEG and others”

This is the GPU (PowerVR SGX 535) that use the iPhone, so I think that a new GeForce or Radeon card (or perhaps a “more simple” Intel GMA 500) can do the same thing, no ?

And if a video contain 100000 pictures, they are only a very little part of them (IBBP…IBBP…IBBP) that have to be stored on a decompressed form on a given time, so memory consommation don’t seem to be for me a really big problem … (I don’t think that the iPhone have gigabytes of RAM, only 128 Mb for the first and 256 Mb for the 3GS )

The_Little_Body · April 9, 2011, 5:20pm

Alphonse and Groovounet,

I think about something that load the whole compressed picture at the level of glTexImage2D() and only decompress the whole picture at the glBind() invocation.

And with something like glTexImage3D for to handle a whole animation of course

With this, no problem with shaders because the data is already decompressed and random access can occur without any problem on this texture (because the texture was accessible on a decompressed form after the glBind invocation).

OK, I can too encapsulate the glTextImage2D/glBind into an extern fonction and generate individual pictures using libtiff/libjpeg/libavcodec on another task (it’s exactely what I make actually for to handle video texturing).

But if this work very good on a iMac, Mac mini or a P4 3 Ghz with multiples videos (that can be mixed) and at more than 50 fps (the NTSC is 30/60 fps when the PAL/SECAM is “only” 25/50 fps)
this is too CPU intensive for a little eeePC, an iPhone or a Samsung N210 for example

I don’t know how is exactly make the (M)JPEG decompression on modern GPUs (actually, but this is only a question of time …), but I think that the whole decompression is now totally handled by the GPU, so an internal GL_JPEG_EXT can perhaps to be handled ?

The_Little_Body · April 9, 2011, 6:50pm

After some research, it seem that the (M)JPEG decompression is make on the driver level (cf. a black box), and alls things that I have found about JPEG decompression on GPU is make (partially) with things like CUDA, so I have now doubts about the fact that the whole decompression is directely and entirely make at the GPU level

On another side, it exist from a long time a lot of olds video processors (such as the Sigma Designs Hollywood Plus MPEG Decoder for exemple) that make the whole compression/decompression in hardware (or a very big part of it because in the past and with a 386 processor, this was the day and the night with/without …) => so why not to integrate the part of old hardware that make the entire decompression into modern GPUs (I don’t think that integrate some thousands or millions of transistor into a a GPU that have already billions of them, such the GF100, is really a problem …) ?

Alfonse_Reinheart · April 9, 2011, 9:10pm

First, could you compile all of these thoughts into a single post? I’m having trouble following the conversation. Which seems odd, since the conversation seems to be the same person responding to their own posts.

Second, how did this become about JPEG compression? JPEG 2000 uses wavelets, but the original JPEG doesn’t.

Third, you seem to have ignored the basic thrust of the point I was making. Namely, that these do not make good texture formats. Which means you need to decompress the whole image into a texture before you can actually texture from it. That means that the only thing you gain from it is having to transfer fewer bits from disk to the GPU. That’s a decent load-time gain, but it means nothing for how many textures you get to use.

The_Little_Body · April 10, 2011, 9:33am

Of course, Alphonse

But in the past, I was criticized for to have reedit multiple times olds posts
(so until now, I try to just reedit the last post for to correct/enhance it)

=> but Ok, from now, I edit alls my replies into a local text document and send only one post per day

I have begin since one week an API that permit to compress picture data using an algorithm similar to JPEG 2000 but that is limited (for the instant) to only one level of De Haar transform (cf. it don’t handle the iterative part of the De Haar algo that divide the picture by two at each level, it only compute the four mipmaps levels for each 8x8 patch on the picture).

This form of transformation limit the level of compression that to be make by the quantization/reordering that come before this transformation and the RLE/Zlib stages that come after (in the reverse order, for the decompression/untransform) but that seem me consistant for to permit OpenGL to handle this on a similar way that it handle DXT textures (cf. we have only to decompress a very limited portion of the picture for to access an individual texel on it : 8 bytes for a DXT patch of 4x4 vs 64 bytes for this limited version of the JPEG 2000 algo on a 8x8 patch).

I have posted some days ago the sources of this “one level/no recursive version of the De Haar transform/untransform” (+ the compression/decompression part that come with it of course) at http://www.design-reuse.com/news/14978/i…ore-family.html
(this doesn’t use OpenGL for the instant, it’s only for to see if the algo work)

I can obtain in some cases a compression ratio at about 10x (and in the general case, this outperform the actual DXT compression ratio of 2x)
(Note that the file size on the finals .TIF files doesn’t reflect the compression because I don’t know how to handle .TIF files other than with a RGBA non compressed format … but the compression ratio is already effective in memory )

This “new” compressed format can certainly to be used in the futur as an new internal OpenGL texture format that can be stored in a compressed form (cf. quantized/reordered/RLEed and Zlibed) into texture memory while it was not binded (for to economize the on-card video memory) and only decompressed when needed (cf. binded)

The best thing I can see for to can handle “transparently” this into the OpenGL API was to add a new GL_JPEG_EXT item to the glTexImage2D internalFormat argument + something that indicate that this texture contain the first four mipmaps levels (and not only the first) on the level argument.

I have read at http://www.design-reuse.com/news/14978/i…ore-family.html that

"PowerVR SGX family video/image decode and encode processing support includes H.264, MPEG-4/2, VC-1 (WMV9), JPEG and others"

So I think that this new item can be “easily” added for the PowerVR SGX 535 chipset and certainly for news GeForce or Radeon chipsets (perhaps for the Intel GMA 500 or newers too ?)

In a near future, I think extend this “limited version of the 2D De Haar transform” to a 3D transform that can handle video streams (where consecutives 2D planes on the transform are consecutives 2D frames into the stream + somes additions for to handle the interleaving)
=> have already the GPUs the necessary circuitry for to handle this sort of video [de]compression ?

If necessary, I think that it is possible to add the equivalent of the Sigma Designs Hollywood Plus MPEG Decoder into newers GPUs (because add some thousands or millions of transistors into a a GPU that have already billions of them, such the GF100, don’t seem me impossible to make, it’s only a problem of time/means)

And on the other side, I see how to construct the gl[Copy]Tex[SubImage]Image2DYLP() and glBindYLP() functions “that go well” for to emule this new item

PS : please, read previous posts for to have more details

Alfonse_Reinheart · April 10, 2011, 2:22pm

But in the past, I was criticized for to have reedit multiple times olds posts

My point is that you should generally say a full a complete thought the first time, rather than coming back every 30 minutes to an hour with updates. Organize your thoughts before posting.

This “new” compressed format can certainly to be used in the futur as an new internal OpenGL texture format that can be stored in a compressed form (cf. quantized/reordered/RLEed and Zlibed)

No, it can’t. For the reasons I outlined before (which you have continued to ignore), it makes for a terrible texture format. The very second you start talking about zlib, quantization, RLE, or any such thing, you kill all texturing performance.

Texture formats are optimized for reading. Because that’s what the user does with textures; they read them. That is the most common operation for textures. And that is why formats like S3TC are used and formats like JPEG 2000 are not: because they are not optimized for reading.

into texture memory while it was not binded (for to economize the on-card video memory) and only decompressed when needed (cf. binded)

So every time I bind this texture, whether I’m going to render with it or just change some texture parameters, it’s going to cause the texture to be decompressed? I don’t know about you, but to me, that sounds like a performance problem. Even if you restrict this to just using the texture, I’d still rather just have the driver do the standard memory management.

It doesn’t save any GPU room compared to evicting unused textures. If used textures are going to be fully decompressed in GPU memory, then what’s the point of having them be compressed in the first place?

Let’s say you have a 1024x1024 texture. With RGBA8 (the alpha component is irrelevant, but needs to be present for alignment reasons), that comes to 4MB. With S3TC, that reduces down to 0.5MB. If you use something like JPEG 2000, you may be able to reduce this to 0.04MB.

Now, when you have this texture on the GPU, it takes up 0.04MB in the JPEG 2000 case, and 0.5MB in the S3TC case. However, when you have to use this texture, it takes up 4.04MB in the JPEG 2000 case, and 0.5MB in the S3TC case. This is because texture units cannot directly access and decompress JPEG 2000, so any such textures must be decompressed into GPU memory before the texture units can access them.

0.5MB is smaller than 4.04MB. S3TC wins. I can pack more S3TC textures into video memory than JPEG 2000.

Also, let’s consider decompression performance.

You keep saying that hardware has JPEG decompression built into it. And it does (though that document says nothing about JPEG 2000, which is a very different format from regular JPEG). But how many 2048x2048 images do you think they can decompress in 1/60th of a second?

The purpose of built-in JPEG decompression in mobile GPUs is to alleviate the CPU burden when viewing images over the web. In those circumstances, you don’t need instant results. You can wait 0.2 seconds for all 20 of the website’s images to be decompressed.

You cannot wait 0.2 seconds for 20 images to be decompressed in a real-time application. That murders performance. And if the textures are always uncompressed, then you’re not saving any video memory. Indeed, you’re losing video memory by having both the uncompressed and compressed forms around.

You are not the first person to think that their wavelet/JPEG/etc based format can beat S3TC, nor will you be the last. But thus far, all of them run afoul of the simple and obvious fact that S3TC is a texture format, designed for the specific purpose of being quickly addressed and decompressed by texture units. And the alternatives are not.

The_Little_Body · April 11, 2011, 1:25pm

Alfonse,

The output of various DXT compressions tools that I have found is between 60 to 200 MP/sec (in software and with MMX/SSE that give a good boost)

My “partial but local wavelet with the compression that come with it” seem to work at about 300 MP/sec, with nothing optimised (software, no MMX/SSE)

DXT compression tools on GPU can output between 279 to 1690 MP/sec

But the comparaison can only be make on CPU … because my algo doesn’t exist (for the instant) on a GPU …

ZbuffeR · April 11, 2011, 2:18pm

The Little Body: you do not read what Alfonse detailed.
You realize that video card do dozens of GIGA texels per second ?
You can not take a CPU algorithm and says it will perform equally well on a GPU.
DXT, with its block access, is well suited to GPU.
Wavelet, not so much. It does not mean it is not possible, it means it is difficult. And you did not prove you understand GPU architecture enough to make your point.

The_Little_Body · April 11, 2011, 3:02pm

Of course, GPUs have a very impressive texture output, like multiples GIGA texels / seconds for betters

But in input ???

The GPU begin a bottleneck if it can very speedly display textures (cf. the output) but is limited on the other side for to stock and/or construct them from on-disk/CPU memory (cf. the input) …

I have read and reread many times Alfonse’s answers

But still stubbornly believe (normal, I’m Breton ) that to add a little buffer of 64 bytes + somes unfortunates few hundreds / thousands of transistors is as hard to this to add on a modern GPU that have already millions or billions of them …

So no problem, my gl[Copy]Tex[SubImage]Image2DYLP() and glBindYLP() “proof of concept” functions are on the road

Alfonse_Reinheart · April 11, 2011, 3:15pm

But in input ???

Who cares about the input? Compressing textures happens offline. That is, before you ship your product. It’s part of what you do to prepare a build for distribution. Compressing textures does not happen in the middle of high performance applications. The textures are given in compressed form.

But still stubbornly believe (normal, I’m Breton smile ) that to add a little buffer of 64 bytes + somes unfortunates few hundreds / thousands of transistors is as hard to this to add on a modern GPU that have already millions or billions of them …

And then there’s this nonsense. The idea that, “Well, GPUs have billions of transistors, and obviously my algorithm would only take a few thousand transistors at the most, so they could just throw it in.”

How much do you honestly know about transistors and GPU design? Based on what can you say that your algorithm, when replicated across every texture unit, would only cost “hundreds / thousands of transistors?”

IHVs do not just “add stuff” to their GPUs. Even though they have billions of transistors, you can’t just throw stuff into them. Everything affects everything else, and in well-designed GPUs, everything has a transistor/space budget. If you make your texture units bigger, you now have to change the layout your arrangement of shader processors. in order to maintain a reasonable GPU size, you may need to have fewer texture units per shader processor or remove shaders altogether.

Can you honestly say you know enough about the details of GPU design to even begin to speculate on whether what you’re suggesting would be at all practical?