Hacker Codex   Linux servers · Python development · MacOS tinkering

Speed Up Compression via Parallel BZIP2 (PBZIP2)

By pure chance one morning, I came across a post that mentioned PBZIP2. Having never heard of it, of course I had to look it up. Crikey. File this one under “Why Didn’t Someone Tell Me About This Earlier?!”

Wait a minute,” I said aloud to nobody in particular. “BZIP2 doesn’t support symmetric multi-processing? And there’s an alternate implementation that does take advantage of multiple CPUs?”

Whiskey. Tango. Foxtrot.”

And after a few tests, I’ll be tarred and feathered if it ain’t true: the speed improvement was, as promised, linear to the number of cores.

Installation

To install it via Homebrew on MacOS:

brew install pbzip2

To install it on Ubuntu or Debian:

sudo apt install pbzip2

The pbzip2 binary should now be available. Refer to the manpage for the gory details.



Testing

Using a 91 MB tar archive as my test file, I ran the following commands on a quad-core 2.93 GHz i7 running Mac OS X 10.7 (Lion) to see whether there was indeed any improvement in compression speed:

time bzip2 -k testfile.tar
time pbzip2 -k testfile.tar

The results: 18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That represents an 81% reduction in compression time and a five-fold increase in speed in this particular test.

While decompression speed increases weren’t nearly as dramatic, pbzip2 decompression appears to faster than stock bzip2.

New Aliases

I don’t want to have to remember to specifically use the pbzip2 command, so I decided to add some aliases. First, let’s detect whether pbzip2 is installed and available:

# Check to see if pbzip2 is already on path; if so, set BZIP_BIN appropriately
type -P pbzip2 &>/dev/null && export BZIP_BIN="pbzip2"
# Otherwise, default to standard bzip2 binary
if [ -z $BZIP_BIN ]; then
  export BZIP_BIN="bzip2"
fi

Using the above logic, I set bz as an alias to pbzip2 if available, and if not, to bzip2:

alias bz=$BZIP_BIN

I usually compress directories more often than individual files, so I added some commands to quickly compress directories and expand bzipped tarballs:

tarb() {
  tar -cf "$1".tbz --use-compress-prog=$BZIP_BIN "$1"
}
untarbzip() {
  $BZIP_BIN -dc "$1" | tar x --exclude="._*"
}
alias buntar=untarbzip

Usage:

bz myfile
tarb mydirectory
buntar mytarball.tbz

Got a better method?

Have you had any experience with parallelized bzip2 compression? Find me on Twitter and let me know.