Terminology for data compression

paradox/vivid

Introduction

Hi all, this is just an article I'd advise looking at before attempting to read my data compression articles, I've found lots of people don't know basic terminology when it comes to data compression, and they can't read articles about it as a result.

Terminology

(i) Probability

We start with the idea of a probability space, where an event, w, is assigned a particular value Pr(w), which is called its probability. The conditions for assigning the probability are that it must be a nonnegative real number between 0 and 1, inclusive, and the probabilities in a particular space must sum to one.

(ii) Probability Distribution

A probability distribution is a function, f, which maps each event in a probability space to a probability, Pr(w).

(iii) Sample space

A sample space is the same as a probability space.

(iv) Frequency

In data compression, the frequency of a particular symbol is the number of times it has occured up till the current position in the stream. The total frequency of a symbol is the number of times it occurs in total in the data stream.

(v) Data compression probability definitions

The actual probability of a symbol in the data stream is the total frequency of the symbol divided by the total length of the data stream. The (what I call) "adaptive" probability of a symbol is the frequency of the symbol divided by the number of bytes read from the data stream.

Contexts

(i) What they are

A context is defined to be a set of bytes coming before another byte or set of bytes. For example, look at the english text:

		     "This is a test."

The context of the 's' in "This" is "Thi", the context of 'i' in "This" is "Th", the same applies with other parts of the text. Symbols may also be assigned adaptive and actual probabilities based on contexts.

(ii) The Order of a context

A context is said to have an order, essentially meaning the length of the context. For example, above the order of the context before 's' is 3, because it is "Thi", it's called an order-3 context. If we took just the "hi" before 's' that would be an order-2 context, you can also have order-1 contexts, being just a one letter context. Order-0 contexts are contexts which are non-existant, zero previous symbols are taken into account when encoding a particular symbol, all symbols are viewed as independent under order-0 contexts. order-(-1) contexts are contexts which define a probability of each symbol, the probability does not vary, typically, it is 1/256 for an 8 bit symbol file.

Ordern is another term for an order-(-1) context, it means order-negative.

Entropy

(i) The Entropy of a symbol, short definition

Entropy is the smallest number of bits a symbol can be losslessly encoded into given that the symbols have no correlations (on average of course!)

For a symbol, A, it is -log2(Pr(A))

(reads: minus the log in the base two of the probability of the symbol 'A')

Entropy comes from minimizing a special case of the expected value.

Types Of Encoders

LZ Encoders

These are string matching encoders they look for matching runs of characters, for example in the line above they would find the word "matching" and encode it. Statistical Encoders

These work on a symbol basis, they try to encode a string of text symbol by symbol, these compressors have been more successful than LZ encoders. (i) Huffman

Huffman expresses all symbols as close to their entropy as possible in an integer number of bits. (ii) Arithmetic Encoding

Arithmetic encoding expresses all symbols in non-integer numbers of bits (if needed), and gets them much closer to their entropy.

Finally, combinations of statistical and LZ encoding schemes can give good results.

Conclusion

Hopefully you will be able to read my articles with ease now :)

PvAiRvAiDdOX