Saturday, May 21, 2011

Compression

The other day an email arrived from sourceforge which mentioned that they were hosting the file compression program 7-zip. Now I had used 7-zip under windows, as an all purpose archiving tool, mostly for reading zip files. I'd never thought of it in a Linux context. But both openSuSE and Ubuntu have the command-line version available, and what more do you need?

In OpenSuSE, the RPM is called, simply enough, 7z. In Ubuntu it's a bit more complicated. There's p7zip, which provides the bare-bones standalone version of the compression program, 7zr, and a wrapper p7zip, which makes 7zr work like gzip. For the full-blown 7z program, you want the package p7zip-full, which includes 7z. While you're at it you might want to get the p7zip-rar package, which lets you decompress RAR files.

The reason you want 7z and not just 7zr is that the smaller program only compresses to the 7z format, but, as it says in the blurb,

not only does [7z] handle 7z but also ZIP, Zip64, CAB, RAR, ARJ, GZIP, BZIP2, TAR, CPIO, RPM, ISO and DEB archives.

And by handle, they mean read and write to these formats (except RAR). So I could create a zip archive with 7z, or a gzip/bzip2 compressed tarball. And guess what?

7z compression is 30-50% better than ZIP compression.

Is it? Well, let's find out.

Test 1

Presented for your consideration: an uncompressed tarball:

$ ls -l Ru.tar
-rw-r--r-- 1 dave dave 26716160 2011-05-21 15:41 Ru.tar

This is IO from the elk FP-LAPW code, so it has a lot of repetition of text, and what for our purposes are a bunch of random numbers. First we'll try compressing it with programs using their native formats. I'll try for maximum compression in all cases. Note that zip and 7z will create separate archives, while gzip and bzip2 compress the file in place.

Program Command File Size Ratio
zip zip -9 Ru Ru.tar 3899523 0.146
gzip gzip -9 Ru.tar 3899386 0.146
bzip2 bzip2 -9 Ru.tar 2992422 0.112
7z 7zr a -mx=9 Ru.7z Ru.tar 2242708 0.084

Pretty good, huh? As advertised, 7z is about 40% better than zip/gzip, and 25% better than bzip2. But wait, there's more. Not every computer is going to have 7z available, so you may want to compress files using a more established protocol. 7z can do that, too, which is why we wanted it, not just 7zr:

Format Command File Size Ratio
zip 7z a -mx=9 -tzip Ru.zip Ru.tar 3287420 0.123
gzip 7z a -mx=9 -tgzip Ru.tar.gz Ru.tar 3287335 0.123
bzip2 7z a -mx=9 -tbzip2 Ru.tar.bz2 Ru.tar 2989193 0.112

So 7z compresses to zip/gzip better than the native programs do it themselves. It doesn't really outperform bzip2 here, though. The only disadvantage compared to gzip or bzip2 is that it doesn't compress the files in place, unless you go through a script such as the one in /usr/bin/p7zip.

Test 2

Pi to 4 million Decimals has, duh, π to, actually, 4,194,034 places. The file pi.tar.gz has it in ascii, with a bit of header information. If we uncompress that file, it comes in at 4362370 bytes. Since the digits of π don't repeat, it's hard for a compression program to find blocks of bytes to compress. The following table lists the compressions achieved by our test programs, in whatever formats they can use. (See the above tables for the appropriate commands.) Let's see how everybody does:

Program Protocol File Size Ratio
None None 4362370 1.000
zip zip 2041130 0.468
gzip gzip 2040997 0.468
bzip2 bzip2 1863892 0.427
7z zip 1983378 0.455
7z gzip 1983297 0.455
7z bzip2 1860047 0.426
7z 7z 1884999 0.432

Here, bzip2 is competitive with 7z. Oddly, though, you should use 7z to do the bzip2 compression. Weird, huh? But still, 7z is pretty good.

Wrapping Up

So what's not to like?

For one thing, 7z is slow. If you just want to quickly compress a file, go ahead and use gzip, or bzip2. There is a price to pay for better compression.

Then, too, there's a warning on the 7z man page:

DO NOT USE the 7-zip format for backup purpose on Linux/Unix because 7-zip does not store the owner/group of the file.

you can get around this by piping tar into 7z:

tar cf - directory_to_be_archived | 7z a -si directory.tar.7z

which creates the analog of a gzip/bzip2 tarball.

And finally, the native 7z format isn't standard, yet, so it's not going to be available everywhere, and might even vanish. But 7z and its compression algorithm LZMA are open source, so they are likely to stay around for awhile. A few years ago bzip2 wasn't standard, and once upon a time neither was gzip. It's probably safe to compress your files to 7z format, but if you want to be safe, use 7z to compress to gzip or bzip2 format.

0 comments: