Hacker Codex   Linux servers · Python development · MacOS tinkering

Speed Up Compression via Parallel BZIP2 (PBZIP2)

By pure chance one morning, I came across a post that mentioned PBZIP2. Having never heard of it, of course I had to look it up. Crikey. File this one under “Why Didn’t Someone Tell Me About This Earlier?!”

Wait a minute,” I said aloud to nobody in particular. “BZIP2 doesn’t support symmetric multi-processing? And there’s an alternate implementation that does take advantage of multiple CPUs?”

Whiskey. Tango. Foxtrot.”

And after a few tests, I’ll be tarred and feathered if it ain’t true: the speed improvement was, as promised, linear to the number of cores.

Installation

To install it via Homebrew on MacOS:

brew install pbzip2

To install it on Ubuntu or Debian:

sudo apt install pbzip2

The pbzip2 binary should now be available. Refer to the manpage for the gory details.


Open source is great, but installing it sucks. Fortressa to the rescue!

Testing

Using a 91 MB tar archive as my test file, I ran the following commands on a quad-core 2.93 GHz i7 running Mac OS X 10.7 (Lion) to see whether there was indeed any improvement in compression speed:

time bzip2 -k testfile.tar
time pbzip2 -k testfile.tar

The results: 18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That represents an 81% reduction in compression time and a five-fold increase in speed in this particular test.

While decompression speed increases weren’t nearly as dramatic, pbzip2 decompression appears to faster than stock bzip2.

New Aliases

I don’t want to have to remember to specifically use the pbzip2 command, so I decided to add some aliases. First, let’s detect whether pbzip2 is installed and available:

# Check to see if pbzip2 is already on path; if so, set BZIP_BIN appropriately
type -P pbzip2 &>/dev/null && export BZIP_BIN="pbzip2"
# Otherwise, default to standard bzip2 binary
if [ -z $BZIP_BIN ]; then
  export BZIP_BIN="bzip2"
fi

Using the above logic, I set bz as an alias to pbzip2 if available, and if not, to bzip2:

alias bz=$BZIP_BIN

I usually compress directories more often than individual files, so I added some commands to quickly compress directories and expand bzipped tarballs:

tarb() {
  tar -cf "$1".tbz --use-compress-prog=$BZIP_BIN "$1"
}
untarbzip() {
  $BZIP_BIN -dc "$1" | tar x --exclude="._*"
}
alias buntar=untarbzip

Usage:

bz myfile
tarb mydirectory
buntar mytarball.tbz

Got a better method?

Have you had any experience with parallelized bzip2 compression? Find me on Twitter and let me know.


Want more articles like this one? Support us by checking out Fortressa!