The Burrows-Wheeler Transform
Introduction
This little text will describe the burrows-wheeler transform, which is used, for example, in the state-of-the-art file compression program bzip2. It is a very interesting (weird?) and elegant algorithm, and easy to implement too. I first read about it in an article [1] by Mark Nelson which originally appeared in the Dr. Dobb's Journal in 1996.
The Basics
As you can see from the title, this is a transformation algorithm, not a compression one. That means that data won't be compressed in any way. Actually, it'll even get expanded a bit. But more about that later. All you should know for now is that the Burrows-Wheeler Transform (I'll abbreviate it BWT) rearranges the data in a way it gets better compressible with standard methods (like huffman coding, for example).
The original paper by Burrows and Wheeler [2] is called "A Block-sorting Lossless Data Compression Algorithm", and that describes pretty well what's happening: You take a block of input data, then sort it (in a special way). Let our test data be the string
HUGENDUBEL
which is the first part of the "original" name of this mag (at least in the first issues). You take this string and make n strings of it (with n being the length of the string) which are all copies of the previous string "rotated" left by one character. Then we have:
Nr ³ String
---Å-----------
0 ³ HUGENDUBEL
1 ³ UGENDUBELH
2 ³ GENDUBELHU
3 ³ ENDUBELHUG
4 ³ NDUBELHUGE
5 ³ DUBELHUGEN
6 ³ UBELHUGEND
7 ³ BELHUGENDU
8 ³ ELHUGENDUB
9 ³ LHUGENDUBE
Now we've got a array of strings. We now sort these strings (this is the "blocksorting" part). Then our array looks like:
Nr ³ String ³ Original #
---Å--------------Å-----------
0 ³ B ELHUGEND U ³ 7
1 ³ D UBELHUGE N ³ 5
2 ³ E LHUGENDU B ³ 8
3 ³ E NDUBELHU G ³ 3
4 ³ G ENDUBELH U ³ 2
5 ³ H UGENDUBE L ³ 0
6 ³ L HUGENDUB E ³ 9
7 ³ N DUBELHUG E ³ 4
8 ³ U BELHUGEN D ³ 6
9 ³ U GENDUBEL H ³ 1
You'll notice that I also mentioned the strings' original number (index). I will soon explain what to do with it. Also, I seperated the first and last columns a bit from the rest. I'll call them F[i] and L[i] from now on with i being the index (string number). They have a key position in the BWT. The F array consists of the characters in the input string, all sorted ("BDEEGHLNUU" in this case). The L array has no particular order (at least at the first look), but also contains all input characters. With this property, you can easily reconstruct the F from the L array by just sorting it. This is VERY important for the BWT. The actual output from the BWT is the L array and a so-called "primary index" which is the index of the original first character in the L array. It is always the number of the string whose original index was 1, in this case 9.
Two basic BWT encoders are provided in the supplementary archive called "bwt.zip". The first one (bwtex1.cpp) uses standard C string comparision functions, the second one (bwtex2.cpp) uses a 32-bit compare function which should be faster on 32-bit architectures (although the compare function is still written in C++).
A big problem
Now, we've got our transformation down, but we also have a big problem: How do we reverse this transform? This doesn't seem possible (at least not with the data we have). The thing you'd need is a so-called transformation vector.
The transformation vector
The transformation vector will give us a chance to restore the original data. I'll call it T. The transformation index is again organized in rows. It will tell you how to "unsort" the F array to reconstruct your original data. It defines in which row the successor to the string in the given row can be found. This means the ORIGINAL successor (the one-character left-rotated version of the string), not the one in the sorted array. Phew! It is pretty hard to explain. I'll give you an example: At row 5 in our "stringlist" the original string "HUGENDUBEL" can be found. T[5] would be the row of the rotated version "UGENDUBELH", 9. The rotated version of "UGENDUBELH" is "GENDUBELHU", which is stored in row 4. So T[9] would be 4. Got the idea? I hope. I'll write down the final table with all information we need:
Nr ³ String ³ Orig# ³ F ³ L ³ Rotated string ³ T
---Å------------Å-------Å---Å---Å----------------Å---
0 ³ BELHUGENDU ³ 7 ³ B ³ U ³ ELHUGENDUB ³ 2
1 ³ DUBELHUGEN ³ 5 ³ D ³ N ³ UBELHUGEND ³ 8
2 ³ ELHUGENDUB ³ 8 ³ E ³ B ³ LHUGENDUBE ³ 6
3 ³ ENDUBELHUG ³ 3 ³ E ³ G ³ NDUBELHUGE ³ 7
4 ³ GENDUBELHU ³ 2 ³ G ³ U ³ ENDUBELHUG ³ 3
5 ³ HUGENDUBEL ³ 0 ³ H ³ L ³ UGENDUBELH ³ 9
6 ³ LHUGENDUBE ³ 9 ³ L ³ E ³ HUGENDUBEL ³ 5
7 ³ NDUBELHUGE ³ 4 ³ N ³ E ³ DUBELHUGEN ³ 1
8 ³ UBELHUGEND ³ 6 ³ U ³ D ³ BELHUGENDU ³ 0
9 ³ UGENDUBELH ³ 1 ³ U ³ H ³ GENDUBELHU ³ 4
And here you get a simple decoder which will decode our L array to the original string given the T vector and the primary index (it was directly derived from Mark Nelson's). I'll use C++ syntax.
int T[]={2, 8, 6, 7, 3, 9, 5, 1, 0, 4};
char L[]="UNBGULEEDH";
int primary_ind=9;
int main()
{
int ind=primary_ind;
for (int i=0; i<10; i++)
{
cout << L[ind];
ind=T[ind];
};
cout << endl;
return 0;
};
This is also included in the supplementary archive. It is called "unbwtex1.cpp".
Now it's getting tricky
Okay, we've got the transformation vector thing, and the problem is, we definatively need it for decoding. But we cannot store it in the file because that would enlarge data too much. So there's only one possible solution: We have to generate it from the data we have, the L array and our primary index. Actually we need only the L array to rebuild the transformation vector. I'll explain how to do that now.
For creating the transformation vector, you need a copy of the F and L arrays. But, as I already stated before, we can get the F array for free by sorting the L array.
Now I need to say something about creating the transformation vector (again). Because now we've got a problem. We've got only the first and last column of the string, but not the rest (at least not in order). So what can we do? The scheme I used to calculate the transformation vector in the previous part won't work. But take a look at the "final table" again. You'll see that the transformation vector can be generated in an easier way. Think of it as the following: when you rotate a string left by one character, what does our F[i] character become? The L[i] character! So we only have to search where in the L array our F[i] character can be found. If you do that, you'll see the transformation vector values you get are right. But you'll soon get another problem: In our example string there are some characters duplicated, so which entry in the L array to take for them? The answer is simple: One after another! The F array (and the rest of the strings) is sorted, and the L array is only a function of F (at least you can think of it as it was). So the entries in the L array for one given char appear in the same order as in the F array. If you don't understand this (or if I didn't explain this good enough :), simply believe me, it's right, and it works. So, the first 'U' in the F array corresponds to the first 'U' in the L array. With this additional information you can build up the transformation vector also in reverse (from the L array):
Our L array is: "UNBGULEEDH"
After sorting we get F, which is: "BDEEGHLNUU"
You take the first letter in L, 'U'. Then you search for the first occurence of 'U' in F, which is at index 8. Therefore T[8]=0. This seems to be right (compare with our first calculated transformation vector).
Then you take the second letter, 'N'. Its only occurence in F is at position 7, so T[7]=1, which also seems to be true.
You can do that for all characters in L, and I promise you, it works just perfectly.
Optimizing it a bit
The algorithm which I explained till now works perfect. But in fact you can also do without actually creating the F array, which will save you both memory and some computation time. The idea is simple: We only need the F array during the computation of the translation vector, and there we only need the POSITIONS of the characters. So I adapted an idea by Mark Nelson: First, you count the occurences of all possible input values (i.e. 0-255 for normal bytes) in the input block, then you calculate an array of the "starting positions" in F where each possible character value starts. This can be done by this trivial C++ program:
static unsigned long startpos[256];
static unsigned long counts[256];
void generateStartingPositions(char *data, unsigned long length)
{
unsigned long i, sum;
for (i=0; i<256; i++) counts[i]=0;
for (i=0; i<length; i++) counts[data[i]]++;
sum=0;
for (i=0; i<256; i++)
{
startpos[i]=sum;
sum+=counts[i];
};
};
Easy enough? I hope. Later you have also a use for the counts array: before actually building the transformation vector you reset it to 0, and then you count the occurences of the bytes again. This way you can use the counts array in conjunction with the startpos array to index your transformation vector. Here you've got the complete transformation vector build code (it uses above function).
static unsigned long *T;
void buildTransformationVector(char *L, unsigned long L_len)
{
unsigned long i;
if (T) delete[] T;
T=new unsigned long[L_len];
generateStartingPositions(L, L_len);
for (i=0; i<256; i++) counts[i]=0;
for (i=0; i<L_len; i++)
{
char val=L[i];
T[startpos[val]+counts[val]]=i;
counts[val]++;
};
};
Then you can decode your data with the same code as in my first example. I won't list that program here. It is included in the supplementary zipfile, again, and is called "unbwtex2.cpp".
Making the data more redundant
The BWT only transforms data, but doesn't reduce its redundancy much. At least not for normal statistical compression algorithms like static or dynamic huffman coding (not the adaptive versions). That's because only the ordering changes, not the actual data. So Burrows and Wheeler recommended letting a second transform run over the data: Move to Front Encoding. It's a very simple algorithm, easy to understand, and easily reversible. You have an order table which has the same precision as your input data and holds all possible values. How they are ordered in the beginning doesn't matter, but the start order table for coder and decoder must be the same. Also they must _really_ hold all possible values (otherwise you'll get problems). The algorithm itself is very simple: You take an input byte, look up its position in the order table, output the position and then move the byte value to the front (the top) of the order table (if it isn't already there) shifting all behind it down a entry.
Example: Imagine you have a very small order table with only 3 entries, 'a', 'b' and 'c'. Also you have a input stream,
"abacababb".
Then your MTF output would be:
0; the character a is already on top, no shifting
1; the character b is moved up, the character a gets shifted down a level
1; the characters a and b again switch their places
2; c would get first, a second, b third
1; a would get first again, then c, then b
2; b would get first, we have b, a, c
1; a is first now, then b and c
1; b is on front again, then a and finally c
0; nothing changes.
An easy algorithm, isn't it? The decoding algorithm works the same way. You just know the position of the character instead of its code.
I've written two programs, a MTF encoder called "mtfenc.cpp" and a MTF decoder called "mtfdec.cpp". I won't explain them here (or any of its core functions), because the algorithm should be easy enough for everyone to understand it; also, the code is commented heavily.
How to actually compress the data
So, now you've applied two transformations to your input data, but you've still not compressed anything. But the data you've got from your MTF output will compress very well with statistical methods like huffman coding. You may also let a run length encoding step run over it before encoding it. Algorithms that won't work well are dictionary-based coders like the whole LZ78 family, LZW for example, and sliding-window coders like LZ77, LZSS, LZP and alike. This is because the BWT removes most of the redundancies these types of codecs are seeking for.
Mark Nelson used Arithmetic Encoding for his article, but I don't recommend that to you. This has many reasons. First, most (or all, I don't know exactly) arithmethic encoding algorithms are patented. This may lead to problems. Second, Arithmethic coding is slow (at least slower than simple huffman coding). The BWT is already slow enough :)
Me I only tried Huffman and Arithmethic coders because I don't have access to other statistic compressors like the PPM family coders for example. However even the results with huffman coding were very pleasant.
I won't explain huffman coding here, the article is already long enough. If you need a description, simply contact me and I'll explain the algorithm to you. If there is enough demand for it, I'll also maybe write an article about it for the next Hugi issue.
Implementation tips
If you want to really compress data with this algorithm, you'll need some further additions and tips. First, this algorithm compresses bigger files better than smaller ones. Also the decoding isn't very simple (compared to LZ77-like algorithms, for example). And you won't be able to compress bigger files in one chunk, you have to split them into blocks. Best sizes of these blocks lie between 100k and 900k, as used in bzip2. Below, the compression achieved won't be very good. And with higher values you'll get soon problems with memory. (You can reduce this, of course. For example, you don't have to actually store the F array in memory. With the trick I explained to you this can be done on-the-fly). And the BWT is slow. This is mainly because you have too sort a array of <blocksize> strings of <blocksize> length. And string compares are slow. So the sorting/comparing part is definitely the part to optimize.
I also included a combined BWT+MTF encoder/decoder. It produces output files you may filter through other codecs like huffman for example. Also a very simple huffman coder written by me is included. This version is very basic, it stores all input character frequencies in the file for example. It is also not commented because it was never meant to be released. (If you want a description of huffman coding, mail me and I'll try to explain it to you. If there's enough response on this I'll maybe write an article about it for the next Hugi Issue...) For my benchmarks (see next section) I also used an arithmetic coder, but I won't give this one to you simple because a) it was not written by me and b) arithmetic encoding is heavily covered by patents and I'm not interested in getting any patent problems because I gave it to you...
The encoder is called fbwtenc.cpp, the decoder is called fbwtdec.cpp (the f stands for "final"). The decoder is pretty fast, but the encoder is SLOOOOW. This is because of the many string compares that have to be done. I used Mark Nelson's method of reducing them in some way, (that's mostly hack, hehe) but the encoder is still quite slow. I can't change this without mainly rewriting it, and then the code wouldn't be easy understandable. I'm pretty sure it wouldn't be understandable at all (take a look at bzip2's sourcecode and you'll see what I mean). I reduced the block size to 100kb, and then it was nearly acceptable (before I used 500kb and it took about 5 min. to encode 542kb of data, arg!). I know the encoders' execution times are bad promotion for BWT, but look at its compression and you'll be happy :)
Some numbers
Today on IRC I was asked why the output of a BWT/MTF encoder is better compressible (hi tmbinc), and, well, I couldn't answer. :) I just promised him it is. Well, for all you people out there who are asking themselves the same question, here's a table containing some numbers:
Compression Method ³ Size (bytes)
--------------------------------------------------------------Å-------------
None ³ 555479
Plain huffman coding (with the encoder supplied) ³ 372179
Plain arithmetic coding ³ 366670
BWT (fbwtenc, Blocksize 100kb) & huffman coding ³ 164586
BWT (fbwtenc, Blocksize 500kb) & huffman coding ³ 159162
BWT (fbwtenc, Blocksize 100kb) & arithmetic coding ³ 155690
gzip (best compression) ³ 148504
BWT (fbwtenc, Blocksize 500kb) & arithmetic coding ³ 146102
bzip2 (BWT, Blocksize 600kb, huffman coder) ³ 127883
I used Ralf Brown's Portlist, Version 57 (part of his interrupt list) for testing. It contains many repeations which can be compressed with LZ77 algorithms very well, and this is the main reason why gzip ranked so good.
References &emp; suggested reading
[1] "Data Compression with the Burrows-Wheeler Transform"
by Mark Nelson
1996
released in some 1996 DDJ
http://www.dogma.net/markn/articles/bwt/bwt.htm
[2] "A Block-sorting Lossless Data Compression Algorithm"
by M. Burrows and D. J. Wheeler
1994
Digital Systems Research Center Research Report 124
http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/...
...src-rr-124.html
Closing words
So, that was my first article for Hugi (well, it was my first diskmag article ever) and I hope you liked it. And please send me criticism, criticism, criticism! Thanks in advance.
Oh... and I want to thank adok for providing me with the information I needed to make this article "Hugi-Compatible". And for telling me about Aurora. Man, it's the best text editor I've ever seen for MSDOS and it rules. :)
If you want to get some additional information, take a look at the two documents mentioned above. At least the first one is worth reading (I wasn't able to download the original paper yet. Maybe this is a permanent problem. I need some information about this).
Okay, so much about my first article. Below there are the ways to contact me.
Contacting info
email: fabian.giesen@t-online.de
or gfabian@jdcs.su.nw.schule.de
snailmail: (only use if you really really have to :)
Fabian Giesen
In der Beckersbitze 1a
D-53639 Königswinter
Germany