Burrows-Wheeler Transform Theory

paradox/vivid

Introduction

I know there have been at least two previous articles in Hugi discussing BWT. I will deal solely with the theory of why BWT gives the results it does.

General Motivation for the Burrows-Wheeler Transform

The motivation behind BWT is this, if we had some data, and lots of identical characters were close together, then a Move-To-Front encoder would yield just the right kind of output for a statistical encoder like Huffman, because the difference based encoding of MTF would result in lots of zero bytes, because of the zero differences.

For example, "aaaaaccccc", transformed under MTF would yield lots of zeros which would mean the 0-byte has a very high probability, and high probability implies low symbol entropy which in turn implies good compression.

Now I will take a look at each of the stages outlined above:

(i) Grouping the identical characters together, using block-permutation followed by sorting (this is the actual Burrows-Wheeler Transform).
(ii) The Move-To-Front scheme.
(iii) The statistical encoder, for more information on statistical encoders you should read my article in this Hugi issue.

Cyclic Permutations

BWT takes a sequence of bytes and "rotates" them, if we start with "Burrows", we get the following:

		     (Original is Burrows)

		    urrowsB
		    rrowsBu
		    rowsBur
		    rowsBur
		    wsBurro
		    sBurrow
		    Burrows

We could of course generate the cyclic permutations in reverse order, first moving 's' from the start of "Burrows" and the same with the other symbols.

Now the list is sorted alphabetically, to get:

			 Burrows
		    owsBurr
		    roswBur
		    rroswBu
		    sBurrow
		    urroswB
		    wsBurro

We have now grouped, all the final letters of the permutations, the letters we intend to remove, into little sets which have the same, or very similar contexts, because we hope that identical or similar contexts will be followed by identical characters. This is for the MTF scheme applied later, we see now that BWT is using contexts to predict the next character.

Let's look at how this works. Take the example of english text containing many occurences of the word "the", of course it would have many other words in it too, by preforming our permutations, and sorting, we group together many characters whose context is "he", and in most cases these will be 't's resulting in lots of 't's at the end of our permutations. The same argument can be applied for every letter in the block. Well, not quite.

Consider the case of "Burrows" above, the first cyclic permutation gives the 'B' a context of "urrows", which is completely correct. The second cyclic permutation gives 'u' a context of "rrowsB", which is actually partially incorrect, the context is "rrows". It's pretty obvious that after n cyclic permutations the context given to a symbol contains n - 1 incorrect symbols. On the last cyclic permutation all the symbols are incorrect in the context. This seems to make sense, because we have no information about the context after the last symbol, which is the one we are outputting on the last cyclic permutation, because it has no context after it, it's like trying to compress the first byte in a data stream given no information, we can't, from this view it makes sense.

These incorrect contexts can help compression when we have something like: "BurrowsBurrows", because after 7 permutations (there are 7 characters in "Burrows") we will get exactly the right context again, and in fact, the rule of n - 1 incorrect symbols of a context doesn't apply here.

The fact that there are incorrect symbols in the context doesn't harm compression much because the most correct symbols, the left most ones, are used as the primary sorting keys, for example in "these", at the permutation "eseth", the "ese" is correct, the 't' is incorrect, but the "ese" is used as a primary sorting key. A typical block size could be about 500K and only as we get to the very last few permutations will predictions possibly become very poor, in PPM for example, sometimes 5 symbols are used to predict the next character, so 500 * 1024, is a pretty good context. Also note that a context of such a huge size, would be bad for predicting the next symbol, but, BWT takes care of that, it will, as I stated before, use the symbols most recently after the symbol it is trying to predict as primary sorting keys. Lastly, the incorrect symbols in the contexts above don't harm compression, obviously.

At this stage we have done the Burrows-Wheeler transform.

(About sorting alphabetically, actually it's lexicographically, we have to deal with non-alphabet characters like space and full stop when sorting.)

Move-To-Front

Now we perform Move-To-Front encoding on the final letters of each of the sorted permuted strings. Before even thinking about it, we can see that just applying a statistical encoder at this stage would be silly, we have only rearranged the data.

MTF encodes a character as the difference between this occurence of the character and the previous occurence of the character (sort of), it does this by outputting a symbol's index in a list of numbers, and then putting it at the front of the list.

The idea here is that, if we have drawn together characters with similar contexts (and hence hopefully identical characters themselves), upon applying MTF we get generally low differences between all the occurences of the characters, and as a result get lots of very frequent characters, perfect for statistical encoding.

			Burrow - s
		   owsBur - r
		   rowsBu - r
		   rrowsB - u
		   sBurro - w
		   urrows - B
		   wsBurr - o

Above, the data we send to the decoder of the BWT blocks. (Actually the word Burrows isn't very compressible under BWT. :))

Applying MTF gives better results than applying a statistical encoder immediately, if we had "aaaaaaaaaaabbbbbbbb", an MTF encoder would give great output for a statistical encoder.

However, MTF doesn't just make the zero-byte very frequent it also makes other bytes very frequent, like byte number 1, every time a prediction fails under BWT, the corresponding MTF output is a byte 1, making this byte frequent, it is like a mistake indicator, in fact, a statistical encoder that assigned probabalistic weights to low-difference (low) bytes would probably achieve better compression on a BWT/MTF data stream, but I'll update you on that in some other Hugi issue.

Note how well MTF applies to BWT data, when we have a boundary between two blocks it moves the most recently encountered byte to the start of the list immediately and we start to emit zero bytes again.

Statistical Encoding

Now a statistical encoder is applied, as I said there are lots of very frequent characters, which implies high-probability of these frequent characters, which implies low symbol entropy, which in turn implies very good data compressibility. I.e. the data is very predictable and works well under Huffman or arithmetic encoding.

Some final words about BWT

It was stated in the article "Why BWT Works" that BWT is a conversion from data compressible under one model to data compressible under another model, this is a very nice observation.

In short, BWT is an algorithm which uses unbounded order contexts to predict the next element of the data stream.

The main reason the Burrows-Wheeler transform is so good is that it achieves excellent compression - obviously this is a good part of the reason, but it combines excellent compression with very high speeds. It is not "state of the art" a compressor called PPMZ-2 achieves the best compression rates at the moment, but it processes less than 1 byte every 20,000 CPU cycles, even on the new 850 MHZ machines that is pretty damn slow, and PPMZ needs roughly 30 times the size of the file in memory. BWT is a hell of a lot faster, but achieves compression which is almost as good, BWT is really very fast in good implementations (bzip2).

Conclusion

Bye now, I hope this explained things a bit.

vivid.paradox