Static Huffman
The aim of Huffman is to provide shorter codes for bytes with more frequency. It's not a standalone coder, but it can be used, and it is used, with other schemes like LZ variants (ARJ, ZIP) or even in image compression (JPEG). So it's very useful. Huffman is not the best entropy coder, but it's easy, good and, to my knowledge, free of patents. If you want to learn this you only need experience in programming, nothing else. I hope I've written enough explanation and examples so you can easily understand it.
The Huffman algorithm is quite old. David Huffman presented it in 1953, one year after Claude Shannon and Fano had published their method. Huffman's method is efficient, while Shannon-Fano isn't always efficient. Probably this is because Huffman sorts the probabilities after combining them, and Shannon-Fano doesn't. The algorithm uses the symbols' probabilities and then it creates variable-length codes. Note that one code can only have ONE symbol. Huffman uses binary trees, so after explaining Huffman I'll explain binary trees. If you have already used them, skip this section.
If you want to make variable codes you have many solutions, some of them will not work at all. The codes MUST be unique decodable, so the decoder will have no problems. For example if we have the following codes:
Byte Code
---- ----
'a' 01b
'b' 0b
'c' 11b
When the decoder gets a 0 it won't know if it's a 'b' or the start of 'a' (probably it will choose 'b'). Codes that would create no error could be:
Byte Code
---- ----
'a' 1b
'b' 01b
'c' 001b
'd' 0001b
This accomplishes the following rule: no code can be the prefix of another code. (In our first case the 0b code was the prefix of the 01b code.) But the codes presented above waste too many bits; that's why we use Huffman codes. The Huffman codes would be something like that:
Byte Code
---- ----
'a' 1b
'b' 01b
'c' 00b
They accomplish the rule and also are optimal. Note that those codes are also instantaneous, so we only need the bits of a code itself for recognizing it. (Once we have 00b we know that this code corresponds to 'c'.)
Binary trees
A binary tree is something like this:
1
/ \
/ \
2 3
/ \
/ \
3 3
The '1' is the root of the tree. '2' is a node. And the '3' are the leaves. If we want to scan a binary tree (and we do) we must have some difference beetwen a node and a leaf, so the structure of every element of the tree will have an integer, telling what is a node (or the root) or a leaf. Also you can see that every element (except the leaves) has two pointers to other elements; so the data structure of an element is as follows:
INT value ;identify the element
POINTER left ;pointers
POINTER right
The integer has the value of the element itself. In our case (Huffman codes), if it's between 0-255 then it's a byte (a leaf), if it's 257 then it's a node. The pointers will be 0, if they aren't pointing to other element; if they aren't 0, then the pointer will point to the address (in memory) of the next element. Let's do the tree again. This binary tree is invented, just an example of how a binary tree should be. This is not the binary trees that Huffman does (but almost the same). Imagine those offsets for every element:
Element Offset
------- ------
1 00h
2 01h
3 02h
4 03h
5 04h
And we have three bytes: 21, 34, 56. So here is our tree:
(257,01h,04h) <- 1 (root)
/ \
/ \
(node) 2-> (257,02h,03h) (56,00h,00h) <-5 (leaf)
/ \
/ \
(leaf)3-> (21,00h,00h) (34,00h,00h) <-4 (leaf)
Note that the nodes (and the root) have a value of 257, which makes them 'invalid' bytes. Also note that the pointers of the leaves are 00h because they don't point to anywhere. Note too that for scanning the binary tree we only need the pointer to the first element, the root.
Our Huffman trees will have this structure:
INT value ;if it's a byte or a node
UNSIGNED LONG probability ;the probability of the byte
POINTER left ;pointers
POINTER right
The probability is used for making the tree (combining elements). When I implemented it, my structure was:
INT value
INT dummy
UNSIGNED LONG probability
POINTER left
POINTER right
So it was 16 bytes long and made sorting easier. (There was need of doing muls so I emulated them with shifts.)
How to scan binary trees
Because we don't know the length (the number of nodes and leaves) of a binary tree, the routine to use must be recursive. The length of a binary tree is also called height. (The height of our sample binary tree is two.)
It will start at the root. Then check if we are in a leaf (in that case do what is needed), then check the pointer right, if it's NOT 0, then call the same function but as a pointer the pointer of that leaf (or node). And then the same for the left pointer. Ok?
How it works
I'll present you a simple string as an example: 'acbacaa'. First we must have the probability of each symbol. For doing that we keep a list of unsigned longs (double words) with 256 entries (yes, one for every byte), scan the whole file (or whatever - it may be the result of a compressor), and increment the entry of the byte. (Not very difficult.)
Byte Probability
---- -----------
'a' 4
'b' 1
'c' 2
Then we sort them, the first the biggest (based on the probability).
Sorted list
-----------
'a' 4
'c' 2
'b' 1
Here, let's start building the binary tree. What will do is the following: Get the last two elements and combine them in another binary tree. First we put both elements and the fictional new node pointing them, then discard both of them in the probabilities list and put a new element with the probabilities of both of them added in that list.
Sorted list Binary tree
----------- ----------- (257(1),3,&c,&b)
'a' 4 / \
'c' 2 \_ Fictional node(1) / \
'b' 1 / prob.= 2+1 = 3 ('c',2,0,0) ('b',1,0,0)
After doing this we sort the list again and join the last two elements (we should do this till we only have one element so we can't combine more):
Sorted list Binary tree
----------- ----------- (257(2),7,&257(1),&a)
'a' 4 \_ Fictional node(2) / \
257 3 / prob.= 4+3 = 7 / \
(257,3,&c,&b) ('a',4,0,0)
/ \
/ \
('c',2,0,0) ('b',1,0,0)
Some things to note:
During this process we have two pointers, one to the list of probabilities and the other to the binary tree. The pointer to the list points to the LAST element. The pointer to the tree points to the start of the binary tree, and when the tree is been built to the next position.
What we do is, first create the new fictional node (in the pointer to the binary tree), put the value to 257 (not a byte), get the probabilities by adding the probabilities of the last two elements of the list and then the pointers to the new elements just to the next position in the binary tree. Following this fictional node put the two combined elements. Now increment the pointer of the list (so you skip the LAST element) and then put here the new fictional node, just by adding the last probability and the current one, and putting the value 257. (In fact I first copied the elements of the list, and then made the new fictional node.)
Once this is over, sort the elements of the probabilities list again. And repeat this till you only have two elements.
When there's only two elements in the probabilities list you shouldn't do the same, this is a special case (it's the end), what shall you do? Read it in the example.
Phew, what a long explanation, now an example. Note one thing: the data structure of the list of probabilities is itself the (start of the) binary tree.
Pointer to list
---------------
'a',000000004h, 00000000h, 00000000h
'c',000000002h, 00000000h, 00000000h
'b',000000001h, 00000000h, 00000000h
Pointer to tree
---------------
... (empty)
As you can see the list of probabilities hold the byte itself, its probability and two pointers. This will make creating the binary tree easier.
Let's assume the value of the pointer to the (start of the) binary tree is 00050000h. And now let's start with the real example, step by step:
-> copy both LAST elements
Pointer to list
---------------
'a',000000004h, 00000000h, 00000000h
'c',000000002h, 00000000h, 00000000h
'b',000000001h, 00000000h, 00000000h
Pointer to tree Offset
--------------- ------
'c',000000002h, 00000000h, 00000000h (00050010h) (let's invent the
'b',000000001h, 00000000h, 00000000h (00050000h) length)
-> create the new fictional node
Pointer to tree Offset
--------------- ------
257,000000003h, 00050010h, 00050000h (00050020h)
'c',000000002h, 00000000h, 00000000h (00050010h)
'b',000000001h, 00000000h, 00000000h (00050000h)
Note that putting the correct offsets in the fictional node is only matter of substracting from the current pointer (to the fictional node) the length of every element. (In our case 10h, 16 bytes.)
-> discard both elements and do a new one (fictional element)
Pointer to list
---------------
'a',000000004h, 00000000h, 00000000h
257,000000003h, 00000000h, 00000000h
Note that we don't intialize the pointers (no need) and what we have to do is add to this probability the previous ('b': 1h) and instead of this value put a 257.
-> sort the list
Pointer to list
---------------
'a',000000004h, 00000000h, 00000000h
257,000000003h, 00000000h, 00000000h
Pointer to tree
---------------
257,000000003h, 00050010h, 00050000h
'c',000000002h, 00000000h, 00000000h
'b',000000001h, 00000000h, 00000000h
In that case sorting the list of probabilities didn't change it, but you'll find that most of the time it will do.
-> and then continue
And when we only have TWO elements left, we can't work as we used to do, we have to do something different:
-> copy FIRST element
Pointer to list
---------------
'a',000000004h, 00000000h, 00000000h
257,000000003h, 00000000h, 00000000h
Pointer to tree
---------------
'a',000000004h, 00000000h, 00000000h
257,000000003h, 00050010h, 00050000h
'c',000000002h, 00000000h, 00000000h
'b',000000001h, 00000000h, 00000000h
-> as usual make fictional pointer (this is the ROOT)
Pointer to tree Offset
--------------- ------
257,000000007h, 00000020h, 00000030h (00050040h)
'a',000000004h, 00000000h, 00000000h (00050030h)
257,000000003h, 00050010h, 00050000h (00050020h)
'c',000000002h, 00000000h, 00000000h (00050010h)
'b',000000001h, 00000000h, 00000000h (00050000h)
As you see making the root is as easy as making a fictional node (in fact it is the same). Now your binary tree is finished, you should save the pointer to the ROOT, so you can scan the binary tree.
And about the way you make your probabilities list:
It can be unsorted. However this will make Huffman codes not optimal, and this is what we want when using Huffman codes at all.
And when combining elements:
Of course you can avoid sorting them. I never implemented it (without sorting) but I think it will make the same codes as Shannon-Fano does.
Scanning the Huffman binary tree
Once we are here, we have our binary tree, now we'll do the codes. As you see the bytes with more probability are at the top and those bytes with LOWEST probability are at the bottom of the binary tree. So let's take profit of that. We'll scan the binary tree. Everytime we go to the right we assign a 0 to the code, and when we go to the left we assign a 1. The length of the code is equal to the height of the byte in the tree. (So based on that we know that 'a' will have a code with the length of 1 bit, and 'c' and 'b' two bits.)
Once we arrive to a byte, we assign that code to the byte. Note that we must keep track also of the length of the code.
Pseudo code:
do_codes:
Increment 'length'
Are we in a byte?
- yes -
- Assign to this byte 'code' and 'length'
- end (go to return)
- no-
- Is right pointer valid? (no if 0)
- yes - shif to the left 'code' and or with 0
- call do_codes
- next pointer
- no -
- next pointer
- Is left pointer valid?
- yes - shif to the left 'code' and or with 1
- call do_codes
- end (go to return)
- no -
- end (go to return)
return
Okay? Of course you must care about pushing 'code' after calling again, and almost nothing else.
Example about doing codes
Now as an example let's do the codes from our example.
(257(2),7,&257(1),&a)
/ \
/ \
(257,3,&c,&b) ('a',4,0,0)
/ \
/ \
('c',2,0,0) ('b',1,0,0)
Code=0b Length=-1
Note that length starts with -1 so we can easily enter the recursive routine.
Read root. Length=0. Not a byte. Right pointer valid. Code=0b. Call again.
Read 'a'. Length=1. A byte. Assign code 0b. Return.
Reading root. Left pointer valid. Code=1. Call again.
Read fictional node. length=2. No byte. Right pointer valid. Code=10b. Call.
Read 'b'. Byte. Assign code 10b. Return.
Read fictional node. Left pointer valid. Code=11b. Call again.
Read 'c'. Byte. Assign code 11b. Return.
Return and end.
The final codes are:
Byte Code Length
---- ---- ------
'a' 0b 1
'b' 10b 2
'c' 11b 2
In the output list (256 entries) you save the code and the length.
Bit stream
Once you have this list you rescan the whole file and output for every byte this code with this length. Example for our string 'acbacaa':
Read 'a'. Output 0b.
Read 'c'. Output 11b.
Read 'b'. Output 10b.
Read 'a'. Output 0b.
Read 'c'. Output 11b.
Read 'a'. Output 0b.
Read 'a'. Output 0b.
And this makes this bit stream: 0111001100b
As you see there should be no error decoding this stream.
Of course you call the putbits with the length of the code, and as a value the code.
Saving the codes
You should note that the bits of the codes MUST be sent from left to right because they were created from left to right.
For example the code 011b must be sent as: 0b 1b 1b
If you don't do it as above, then imagine our list:
Byte Code Length
---- ---- ------
'a' 0b 1
'b' 10b 2
'c' 11b 2
If with code for 'b' we output it from right to left, then we output the 0b first instead of the 1b. So a decoder will bet the 0b and just output 'a'! Another solution will be to invert all the codes, this will make coding and decoding faster, let's take as an example:
Byte Code Length
---- ---- ------
'a' 0b 1
'b' 10b 2
'c' 111b 3
'd' 110b 3
Inverted, this will be:
Byte Code Length
---- ---- ------
'a' 0b 1
'b' 01b 2
'c' 111b 3
'd' 011b 3
Now the codes are from right to left, so we can save them with only a putbits, no need of doing slow routines for putting codes.
Of course fast ways of inverting the codes are welcome.
Output of a Huffman coder
Now here it comes how to save the file. Of course you only don't have to output the bit stream, the decoder also needs to know what represents every code and how long they are. So we need to save also the codes. I'll present you two ways of doing it.
Save the binary tree
This is the solution that you'll find everywhere. Save the whole binary tree. Then the decompressor (or decoder, name it as you want) reads it. Once it has that, start to decompress. Get one bit, scan the binary tree. If we found a leaf, stop scanning and output the byte at this leaf. If that was a node, then get another bit and continue scanning till you found a leaf. I read a lot of times that this scheme is slow. I never implemented it, so I can't tell you if it's slow or not.
Save the codes and lengths
When I implemented Huffman coder I needed (of course) a decoder, and then I invented this scheme. Probably someone has already invented and implemented it, however I never saw it. Feel free to use it.
My idea is the following:
We divide all the codes in lengths. For example the length 5 (5 bits of code) has 3 differents codes. I keep track of how many codes a length has. Then I get a bit. I search in the length 1. If I found the code, output the byte. If I didn't found it, then get another bit and search in the next list of lengths. If you implement this you have to care about some things: what's the minimum length that a code has? (This is different for every imput file.) And then start getting those minimum bits.
Also you must care about something else: lists that don't have any code at all.
How I did everything?
First, save it: I sorted the list by the bytes. Then a loop. Read byte [0], it has a code? Output 1 if yes, and then the code and the length. If no, simply output a 0. And this for the 255 bytes. For outputting the length and the code I reccomend that you first output 5 bits with the length, and then exactly the bits needed for this code. (The decoder will have no problems, it already knows the length.)
The decompressor reads its and make a nice decompression list. First the minimum numbers of bits needed. Then how many codes with the minimum length we have. If it's different than 0 put the codes, and bytes that correspond to the code. Then next length, put the number... Once this list is finished, start to decompress. Get one bit. Read how many codes are. If 0, skip (read another bit), and next list.
I think you don't need an example about this, however if you feel it's needed, let me know and I'll (probably) do a new version.
Improvements
As far as I know static Huffman can only be improved (in the ratio) with one trick: Canonical Huffman. I'll not fully explain here, just a bit.
Canonical Huffman: We do everything in the same way, but we don't care about the codes, just the lengths, then based on the length of the codes we make the code itself. So when saving the tree (or with my scheme) we only output the lengths, the decoder just has to read the lengths and do the codes again.
Of course there's other improvement, this is adaptive Huffman. The way it works is more difficult than the static scheme. If you need an adaptative entropy scheme you should check any (unpatented) arithmetic coder.
It's better than Huffman because Huffman assumes that the minimum length for a code is 1 bit while arithmetic coding doesn't.
Charles Bloom told me about adaptative Huffman: "Slow, bad compression." Of course feel free to implement it, also feel free to implement Shannon-Fano!
Closing words
A lot of effort has been put in this article, as you can see from the explanations and examples, so feedback will be welcome.
"Walking blind nightly
kissing lovely
from the darkness
to the madness."
DARiO PhONG