Arithmetic Coding

paradox/vivid

Disclaimer

I'm no expert. Sorry if this has mistakes. You're welcome to email me if you want to report any. I'm sure there are at least a few.

Introduction

Hi all. This article is about arithmetic coding theory. I will explain why the algorithm works. I will also provide a quick overview of the algorithm, not suitable for those new to it, but perhaps clarifying the motivation and the most basic underlying theory.

Background

[ this section is a re-introduction to the algorithm ]

Arithmetic coding, as the name stipulates, is a coding algorithm. The purpose of a coding algorithm is to express each symbol in the length given by its self information. The self information of a symbol is given by:

                      1
              S = ÄÄÄÄÄÄÄÄÄ
                  log(P(i))

Where logarithms are usually taken in base two, for the length in bits. And where P(i) is the probability of symbol i.

The reason this is desirable is because it results in the minimum expected representation of any file - given a model. Where a model is something which generates the probability distribution of any file we give it.

We separate arithmetic coding from the idea of a model. A model provides us with a probability distribution, whereas in arithmetic coding we are concerned only with writing bits to a file which represent it in the number of bits denoted by its self information.

Arithmetic coding is the most important coding algorithm. It is the most widely used in the field of text compression, since it has bounds of redundancy which are very, very small.

The Idea

In arithmetic coding, the basic idea is something like this:

Since the interval [0, 1) contains an infinite number of numbers, and there are an infinite number of files, we should be able to map each file to a number inside [0, 1). Notice that we can express any number n, in log2(n) bits or the next integer above that - always. Now, if we could factor in the statistical properties of our file, so that it was a number in [0, 1), then the log of this number would be the entropy of the file.

The Algorithm

To perform arithmetic coding, we assign each symbol a subinterval of [0, 1) which is proportional to (as close as we can get it) its probability.

E.g. If we had two symbols A and B with probabilities P(A) = 0.75 and P(B) = 0.25, we would assign A a subinterval of [0.75, 1) and B a subinterval of [0, 0.25).

Now the coding step can begin, we repeatedly subdivide the subintervals to code the file.

E.g. Given A and B, as above, if we had the stream "BBA", we would first subdivide [0, 1) to [0, 0.25), since that is the subinterval denoted by 'B', we would then subdivide the resultant subinterval, i.e. [0, 0.25) into its first quarter, since B occurs again. Giving [0, 0.125). Finally we would subdivide this into its last three quarters.

Given the subinterval corresponding to a file after all its symbols have been seen, we output *any* number within that subinterval as the code for that file.

This process is reversible, since, once we see the number outputted is in a certain subinterval, e.g. above for "BBA" it would be in [0, 0.25) we know the first symbol was 'B'. We can then continue to decode the entire file by undoing our subdivision. It is possible to prove that this process is reversible by showing it generates prefix free codes (which are always uniquely decodable). However, although we haven't done so rigorously, we can see that this algorithm is reversible.

For a much more detailed and rich discussion of the above I recommend [1]. Another good explanation of the arithmetic coding algorithm (but not why it works!) is [2]. Finally, if you don't want to splash out on these (really expensive) books I recommend [3], it gives a gentle introduction to arithmetic coding with necessary implementation details, unfortunately, once again, omitting why it works.

To subdivide the interval we use the following algorithm:

H = 1.0
L = 0.0
for each symbol
   H = H + sH * (H - L)
   L = L + sL * (H - L)

Where H and L represent the high and low components of the current subinterval and sL and sH represent the high and low components of the subinterval assigned to a particular symbol.

What this is essentially doing is subdividing the interval such that each subdivision results in a symbol being assigned an interval in exactly the same proportions as it would have been on an unsubdivided interval.

The first thing we must notice is that after n events H - L is the product of the probabilities of those n events in that file, f.

It is simple to prove this, consider two events, x and y. We assigned the subintervals [A, B) and [C, D) to them respectively.

We know that: P(x) = B - A, and P(y) = D - C

Our conjecture is that P(x) * P(y) = H - L after two steps of our coding algorithm, that is:

P(x) * P(y) = A + ((B - A) * D) - (A + ((B - A) * C))
            = (B - A) * D - C * (B - A) 
            = (B - A) * (D - C)
            = P(x) * P(y)

So we can see it is true.

The product of the probabilities of the events, if the file is i.i.d will simply be the probability of that file, we note that:

                 1
       H = log ÄÄÄÄÄ
               H - L

Is the ideal number of units (bits, nats or Hartleys) for that file.

To decode, we must output a number within [L, H). The question is, how many bits does this take?

This is simple, but not obvious. Suppose we have some number in a subinterval [L, H), call this number 0.rrr, 0.rrr represents a number with an arbitrary number of digits specified. This number is within a subinterval as stated, so:

L <= 0.rrr < H

Now suppose we also have the first n digits of 0.rrr, call this number 0.xxx. If no matter how many more digits of 0.xxx we specify 0.xxx is still inside [L, H) then there is no point in specifying any more digits - since regardless of what actually comes next in 0.rrr it doesn't make any difference to which interval 0.xxx is inside by specifying those digits. We want to find out when this happens, since this is the minimum amount of information we have to output to provide unique decodability.

If we assume all zeros follow 0.xxx (i.e. they are the next digits in 0.rrr), then it is possible that this is less than the true 0.rrr, this means it is also possible that it is outside [L, H). If we assume all nines follow 0.xxx then it is possible that this is outside [L, H) by being greater than 0.rrr. If both of these extensions of 0.xxx are inside [L, H) then any extension of 0.xxx is also inside, since 0.xxx with nines on the end is the largest 0.rrr could be, and 0.xxx with zeros on the end is the smallest 0.rrr could be.

Hence, a necessary and sufficient condition that any extension of 0.xxx is inside [L, H) is:

0.xxx999 - 0.xxx000 < H - L 

Where 0.xxx000 represents an infinite number of zeros appended to 0.xxx, and similarly with 0.xxx999.

From our inequality above:

0.xxx999 - 0.xxx000 < H - L
=> 0.000999 < H - L
=> 0.0001 < H - L
=> b^-n < H - L
=> -n < log(H - L)
=> n > -log(H - L)

Where b is the base we are working in, for the sake of example b was 10 above.

So we must specify greater than -log(H - L) digits, so, for it to be uniquely representative we must output, -log(H - L) + 1 bits, but H - L = P(f).

So,

n = -log(H - L) + 1
  = -log(P(f)) + 1

This is within one bit of the entropy of f.

This is significant since we can compute n in linear time proportional to the length of f. Rather than in exponential time, as is the case with Huffman. This is why arithmetic coding is useful.

Bounds

Call the average length of an arithmetic code for a source S, C(S) it is easy to show that,

H(S) <= C(S) < H(S) + 2

Now consider the following, we have a source S,

S = { s1s2s3...sn }

And we consider all possible combinations of k symbols,

e.g. if S = { a, b, c } and k = 2

then S^k = { aa, ab, ac, ba, bb, bc, ca, cb, cc }

In general if we have n symbols in the original source we will have n^k in the extended source. I'll write this as S^k and it is called the k-th extension of source S. S^k is not a number, S^k doesn't mean a number S raised to a power of k.

We can establish this bound:

H(S^k)     C(S)    H(S^k) + 2
ÄÄÄÄÄ  <= ÄÄÄÄÄÄ < ÄÄÄÄÄÄÄÄÄÄ
  k         k          k

The division by k is because we want the entropy and average code length of the code per symbol - not per block of k symbols.

A simple property of entropy, is that H(S^k) = kH(S), this is because a choice from n^x ('^' denotes exponent here) different things can be represented by x choices between each group containing n elements, followed by a choice from one of those elements in the group we do actually choose.

So:

kH(S)     C(S^k)    kH(S) + 2
ÄÄÄÄÄ  <= ÄÄÄÄÄÄ < ÄÄÄÄÄÄÄÄÄÄ
  k         k           k

           C(S^k)           2
=> H(S) <= ÄÄÄÄÄÄ <  H(S) + Ä
             k              k

For the k-th extension of source S. The convergence to optimality is slower than Huffman coding, which was to be expected, since Huffman is optimal. Of course, extensions are really a theoretical result, and are never useful in practice. However, Huffman is completely intractable when it gives a bound better than arithmetic coding - arithmetic coding does almost exactly what optimal Huffman does in linear time instead of exponential time.

Conclusion

I haven't addressed any of the implementation details here. The implementation details of arithmetic coding are very important, and non-trivial.

As always, send (almost) anything to: nnash@stokes2.thphys.may.ie

Byeee...

--paradox

References

[1] Introduction To Data Compression, second edition, by Khalid Sayood, published by Academic Press, 2000. Available at amazon bookstores.

[2] Text Compression by Bell, Cleary and Witten, published by Prentice Hall, 1990. Available at amazon bookstores.

[3] Arithmetic Coding + Statistical Modelling = Data Compression by Mark Nelson, Dr. Dobb's Journal, February 1991. Available at www.dogma.net/markn.