CPU based DXT compression

Hi all,

I would like to start a (short) discussion about the CPU based DXT compression. Namely, DXT is probably the most common texture compression method on desktops. It is widely supported and provides pretty good compression factor (8x for DXT1). Textures can be pre-compressed and stored in DXT format, or recompressed on the fly in the execution phase. The first approach requires more storage space, since DXT’s compression factor is far behind JPEG, ECW, MrSID or similar formats. The second approach is very computationally expensive. The compression (or to be more precise, transcoding, since we have to decode from some image storage to a texture compression format) can be done on the CPU or on the GPU. The GPU based DXT compression (apart from the image storage format decompression) is far superior, since this process can be very efficiently parallelized.

However, using CUDA compressor (for example) is not a very wise decision in combination with OpenGL. Context switching and synchronization degrade performance significantly. So, we have finally come to the topic of this thread…

There are several popular CPU based DXT compressors. Many of you have probably heard about: Squish, Crunch and stb_dxt libraries. They enable tweaking parameters to trade quality for the performance. Well, I wanted to compare their performance to the in-driver DXT OpenGL based implementation. My findings are very interesting. I had no idea of the magnitude in which performance vary. (Also, it is strange that I have found bugs in implementations, which is strange considering their open-source implementation and usage. But let us discuss it later, if anyone is interested in that.)

My first idea was to make a chart which graphically depicts the difference in the performance. But, the execution times differ for several orders of magnitude. It could be our first subtopic for the discussion. Regardless of its impact on the image quality (the differences are imperceptible in most of the cases), a 400x (40000%) slow down cannot be justified.

Here are the results (CPU time) of DXT1 compressing a 4096x4096 texel ortho-photo image:
1104.5 [ms] – NV OpenGL driver
714.3 [ms] – STB (STB_DXT_NORMAL)
819.4 [ms] – STB (STB_DXT_HIGHQUAL)
1244.2 [ms] – Squish (squish::kDxt1 | squish::kColourRangeFit | squish::kColourMetricUniform)
1242.9 [ms] – Squish (squish::kDxt1 | squish::kColourRangeFit | squish::kColourMetricPerceptual)
292566.7 [ms] – Squish (squish::kDxt1 | squish::kColourClusterFit | squish::kColourMetricUniform)
291219.6 [ms] – Squish (squish::kDxt1 | squish::kColourClusterFit | squish::kColourMetricPerceptual)
297179.2 [ms] – Squish (squish::kDxt1 | squish::kColourIterativeClusterFit | squish::kColourMetricUniform)
295666.9 [ms] – Squish (squish::kDxt1 | squish::kColourIterativeClusterFit | squish::kColourMetricPerceptual)
10812.7 [ms] – Crunch (m_dxt_quality = cCRNDXTQualitySuperFast,)
10814.9 [ms] – Crunch (m_dxt_quality = cCRNDXTQualitySuperFast, cCRNCompFlagPerceptual = true)
17317.8 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityFast,)
17132.2 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityFast, cCRNCompFlagPerceptual = true)
35206.8 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityNormal,)
35798.2 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityNormal, cCRNCompFlagPerceptual = true)
122889.5 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityBetter,)
83308.3 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityBetter, cCRNCompFlagPerceptual = true)
276621.2 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityUber,)
192210.9 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityUber, cCRNCompFlagPerceptual = true)

The compression time depends on the content of the image. Here are the results of the DXT1 compression of the 8192x4096 tex. World map:
1250.2 [ms] – NV OpenGL driver
650.2 [ms] – STB (STB_DXT_NORMAL)
717.3 [ms] – STB (STB_DXT_HIGHQUAL)
1198.4 [ms] – Squish (squish::kDxt1 | squish::kColourRangeFit | squish::kColourMetricUniform)
1198.1 [ms] – Squish (squish::kDxt1 | squish::kColourRangeFit | squish::kColourMetricPerceptual)
129952.8 [ms] – Squish (squish::kDxt1 | squish::kColourClusterFit | squish::kColourMetricUniform)
130010.8 [ms] – Squish (squish::kDxt1 | squish::kColourClusterFit | squish::kColourMetricPerceptual)
129752.7 [ms] – Squish (squish::kDxt1 | squish::kColourIterativeClusterFit | squish::kColourMetricUniform)
129910.4 [ms] – Squish (squish::kDxt1 | squish::kColourIterativeClusterFit | squish::kColourMetricPerceptual)
7219.9 [ms] – Crunch (m_dxt_quality = cCRNDXTQualitySuperFast,)
7248.6 [ms] – Crunch (m_dxt_quality = cCRNDXTQualitySuperFast, cCRNCompFlagPerceptual = true)
11090.0 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityFast,)
11102.2 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityFast, cCRNCompFlagPerceptual = true)
21382.1 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityNormal,)
21804.8 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityNormal, cCRNCompFlagPerceptual = true)
63659.5 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityBetter,)
47218.5 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityBetter, cCRNCompFlagPerceptual = true)
142282.2 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityUber,)
109012.4 [ms] – Crunch (m_dxt_quality = cCRNDXTQualityUber, cCRNCompFlagPerceptual = true)

Although the world map is two times greater, the compression time is significantly less. The reason is a huge area of uniform color (oceans and seas).
What are your experiences using DXT compression libraries? Does anyone have a different experience with them? What do you use in your engines? Where did I make a mistake? Those numbers look absurdly large. Spending 5 minutes for something that can be done for 0.5 seconds could not be justified by any quality improvement.

Where did I make a mistake? Those numbers look absurdly large. Spending 5 minutes for something that can be done for 0.5 seconds could not be justified by any quality improvement.

I’d say that your mistake is right there: the part where you decided up front that image quality was irrelevant.

I make no claim as to having done an image quality analysis on any of these libraries. But it seems to me that, to start with the assumption that image quality doesn’t matter seems to put a pretty big hole in your analysis. After all, DXT compressors only have two useful metrics: compression performance and image quality.

If you decide that image quality is irrelevant, then naturally algorithms that prioritize image quality over performance will come out badly. So… you haven’t really proven anything that we already didn’t know.

As for your specific claim about compression time, consider a game development environment. There, it is frequently the case that you perform long batch processes overnight. Given that, what does it matter if it takes 6 hours to go through 8000 textures? Nobody’s waiting around for them. You can use lower-quality settings/algorithms for quick tests and for most of development. But when you get towards release, you run your batch process overnight, and you have the benefit of any image quality improvements that might provide.

Not everyone cares about performance, and those who do don’t necessarily care equally at all times.

Also, it should be noted that most compressors, for many different kinds of codecs, have “uber”-style compression settings. x264 calls it “placebo”. These are usually considered overkill by people actually doing compression work, with the quality almost never being worth the performance.

And this is part of the reason why, to me, the more interesting question regarding compressors is always image quality vs. performance. How much image quality does switching from stb_dxt to Squish buy you, if any? That is, rather than assuming that the image quality gain from Crunch’s “uber” modes aren’t worthwhile, actually investigate it. Then, you’d have some evidence about what is and is not “justified”. Maybe there’s a nice local minimum, where spending 2x the time gives a 2x improvement in peak signal-to-noise or some better metric.

Never really used compressed textures because so far I hat no need for them in my projects.

I could imagine that you still can compress raw DXT very well with some default compression algorithms. And then you just pick one that is optimized for decompression cpu/memory usage.

Also if you use lossy compression like jpg as source format to compress to DXT, quality is not in the argument anymore. :wink:

Just tested it. Created a dds file (DXT3) = ~16.2 MiB
Compressed with lzma (7z default settings) = 4.49 MiB

So there is a lot of room to just compress the dds files.

Artists exporting textures to JPEG or other lossy formats should choose a quality level that is “good enough”. At high bitrates JPEG is often perceptually indistinguishable from the original (yet still much smaller). So using lossy compression does not mean you don’t care about quality.

Valid point.

I tested some more and to me it looks like dxt3/5 + 7z(default compression settings) has a slightly better image quality then a jpg at the same file size. (Using gimp + dds plugin)
And that difference gets bigger after compression the jpg data to dxt.

Tested with a single bmp image: http://www.dicander.com/pix/files/alps.zip

Of course there are better compressions then jpeg. But having comparable file sizes when compressing dxt data and no need for live dxt compression could be a big advantage.

Thank you, Alfonse, for being the common sense of this forum! I really appreciate your opinion and completely agree with everything said, but I tried to stress something different.

I didn’t say the quality is irrelevant, but that an algorithm with execution time 1000x longer cannot justify it with the improved quality. The quality cannot be 1000x higher. Furthermore, the improvement is sometimes just marginal, if it exists at all. Or, as you noticed it nicely:

I completely agree, and here are some quantitative results:
[ATTACH=CONFIG]1913[/ATTACH]
I’m sorry for posting an image, but the difference is more obvious if data are aligned.

In this comparison, I used metric based on a Euclidean distance in RGB space. RMS is shown as the third column.
The last column is derived metric gained by multiplying CPU_execution_time and RMS, and using the inverse value. So, the greater value means the better result.

I hope that it is now obvious what I wanted to say. Furthermore, I have included van Waveren’s algorithm. It has the highest RMS error (excluding NV GL implementation in drivers, which is even worse), but is the absolute winner.

Just compare 47 ms CPU execution time and 11.5 RMS error of van Waveren’s algorithm with 291669 ms CPU execution time and 10.0 RMS error of the Squish CF MU. The first one is more than 6000 times faster. There is the difference in the quality, but it is imperceptible. That was the point of my first post in this thread.

Are you using textures at all? If you are using it is quite strange that compression is not included. Benefits are various.

[QUOTE=Osbios;1271761]
I could imagine that you still can compress raw DXT very well with some default compression algorithms. And then you just pick one that is optimized for decompression cpu/memory usage.

Also if you use lossy compression like jpg as source format to compress to DXT, quality is not in the argument anymore. :wink:

Just tested it. Created a dds file (DXT3) = ~16.2 MiB
Compressed with lzma (7z default settings) = 4.49 MiB

So there is a lot of room to just compress the dds files.[/QUOTE]

I have to completely disagree with you.

  1. Lossless image compression cannot compete lossy compression in any case (except uniform colors).
  2. Zipped DXT1 file used in my tests are just 13% smaller than the original one.
  3. The image alps.bmp, used in your test, has a height not dividable with 4 (the size of the block), so DXT file contains unused data. alps.DXT1 is 406080 B, while zip(alps.DXT1) is 347484 B. Just 17% smaller.

A marginal improvement is still an improvement.

Different contexts have different requirements. In the case where the compression is performed “on the fly” while the end user is staring at a progress bar or loading screen, then compression time matters. If it’s performed once while generating the final master of a product which took a couple of years to develop, “overnight” is probably fast enough.

In that regard, I would expect the in-driver implementation to have the worst quality as it’s the most time-sensitive. It may even get called repeatedly for the same texture as the application swaps textures in and out of the resident set. Conversely, implementations which are provided as stand-alone programs aren’t targeting on-the-fly compression.

Agree! In that context, Crunch (with m_dxt_quality = cCRNDXTQualityUber and cCRNCompFlagPerceptual = false) is the best solution.

GL DXT compression has the worst quality, as you can see from the table.

After all, I hope my little analysis could help someone to choose a proper DXT compressor, according to his/her needs. :slight_smile: