Compression Ramblings

paradox/vivid

Introduction

Some compression ramblings for you to read are here, no amazing algorithms are presented, just a few things I've thought about.

I've included the source code to some routines.

True Context Orders

Storing almost any information about an order-4 or above context is impossible on most PCs at the moment (just wait 5 years...). As a result we cannot store information about their position of occurence in the data stream.

However, various compromises have been used. One, for example, is hashing. The way I use "true" context orders is very simple, and I'm sure is nothing new. We record the position of a sequence of bytes each time it occurs, and then when we are trying to determine the matching context to the current one we search through the data using these positions of sequences of bytes to speed things up. This is quite simple, the diagram below should explain it:

	 [ "hello John", I suddenly heard, and then replied, "hello Paul" ]
       |       |	       |	   |		 |
     record   record	      record	  record	record
    position  position	     position	 position      position
								\ look for
							 matching context now.

The algorithm algorithm would search after all occurences of 'h' for a matchin g context, it would eventually find the "hello" as a matching context, if the order of the context was high enough.

This method is suitable because to store the position of a byte in the data stream takes only 4 bytes (a 32 bit integer) and storing 1024 positions of each byte only uses 1MB of memory.

Above, I said we record the position of a sequence of bytes each time it occurs, the length of this sequence of bytes (L) is, in practical circumstances, optimal at 1 byte, if L is greater than this, we cannot store as many occurences, here is a semi-formal proof:

Assume: We are storing the maximum number of order-1 positions.

Proof:

If we store order-2 positions in favour of order-1 positions the number of positions we can store is 256 less.

In storing 256 more order-1 positions we would encompass all these order-2 positions and more and hence have better opportunity to find a matching context.

Furthermore, in storing order-2 positions in favour of order-1 positions the number of bytes stored about the data stream is:

		2L    L
	  ÄÄÄ = ÄÄÄ
	  256	128

By basic axioms of information theory, since we are storing less information, it will be less advantageous.

Also it follows from that proof that storing bits instead of bytes would be more advantageous with respect to information content, well, that is true, because we could store 8 times more occurences, but, it would be much, much, slower and totally impractical, thus, it is justified to say that 8 bits is "optimal" for this kind of searching (it is sub-optimal with respect to information content).

We usually store a large number ( > 1000) occurences of each byte and it is very, very effective at finding the highest order matching context.

This searching algorithm goes really really slow on long runs of the same byte, just like BWT does (if quick-sort is used). It's pretty simple why it goes so slow, the runtime can be represented by an arithmetic series, look -

Data stream: [ aaaaaaaaaaaaaaa ]

On the first occurence of 'a' there are no searches performed, on the second, one is performed, etc, The number of searches performed is represented by the arithmetic progression: 1, 2, 3, 4 ... i.e.:

 T  = a + d(n - 1)
  n

where a = 0, d = 1.

The total number of searches performed for a given data stream containing repetitions is simply the sum of this series,

			    n(n + 1)
     Total Searches =  ÄÄÄÄÄÄÄÄ
			   2

Which is the nth triangular number btw, and hence the worst case has is O{n²} just like qsort. Note that the total searches performed on some data is all the time increasing in arithmetic series, but usually not this fast, the rate of change of Tn with respect to the number of bytes processed is much lower.

Remember that for each search there is at least 1 data comparison, but in the worst case, there will be (where CO is the context order):

	  CO(ný + n)
     ÄÄÄÄÄÄÄÄÄÄ
	  2

Pretty damn slow, running an RLE encoder of the data first fixes this problem.

Finally let's talk about memory consumption, for excellent performance, using 4000 or so order-1 positions should be used, so that's 256 * 4000 * 4 bytes of memory, round that to 4096 positions and it comes to 4MB, which is really quite a lot. Unless the input is very large all these 4000 occurences of each byte will not be used, so we should only allocate what we need, which usually results in a lot less than 4MB since I worked with a window based encoder.

In short, this straightforward method is excellent for detecting high order contexts using small amounts of memory, it is quite slow however. It can be adjusted for tradeoffs between speed and performance though.

Basic Symbol Difference Encoding

The basic idea in non-order-0 symbol difference encoding is to define a maximum context order, then, on the occurence of a byte, its context is sought for in the data stream up to this maximum length, although smaller contexts are accepted, and then having found the context, the current byte is encoded as the difference between the predicted byte and itself.

Example:

	 [ "hello John", I suddenly heard, and then replied, "hello Paul" ]
       ÄÄÄÄÄ		  Ä		    Ä		 ÄÄÄÄÄ

Consider the above example, if the maximum context order was at 5, then the "hello"s in the data stream would be found, and the space character predicted. Also, the 'e' would correctly predict an 'n' at one point. Note that only the successful predictions are marked above.

What this results in is lots of zero-bytes, making the data more compressible.

The encoder and decoder for this transformation are extremely similar with only one or two lines of code different.

An arithmetic encoder is then applied, and it will compress the data better than before (in most circumstances for our data).

Slightly Deeper Analysis

It is true that the difference encoding results in lots of zero bytes, however, when a prediction is incorrect we get a random byte, and random data is uncompressible. Consider order-0 SDE (symbol difference encoding).

After the Burrows-Wheeler transform is performed on data, typically a MTF encoder is applied. Move-To-Front has some interesting aspects, for example,

Input:	   [   ...aaaaaaxaaaxaa... ]
		/		      \
   Move-To-Front:		       Symbol Difference Encoding:

[...000000?100310...]			  [...000000**00**0...]

Totals: 9 zero bytes.			   Totals: 9 zero bytes.
	2 1 bytes.				   4 random bytes.
	1 3 bytes.

The '?' in the MTF column represents the fact that we don't know the last time 'x' occured, it's coz I couldn't be bothered making up something.

Looking at the output, both had the same effectivness at decreasing the zero byte's entropy. However, MTF also decreased the 1 byte's entropy and the 'x' byte's entropy. Order-0 SDE simply introduced random bytes in these places.

Note that the term random is used correctly, the difference between two consecutive blocks of any data is entirely random under an order-0 model. In BWTed data this only applies for bytes which are in the middle of a localised area like the 'x's in the block of 'a's, I mean bytes which have been incorrectly predicted.

It is not altogether that surprising that MTF gets better results, it stores much more information than SDE, SDE stores 1 byte of information, MTF stores 256 bytes of information.

It seems likely that we can develop an equivalent version of MTF for our transformation.

Developing an MTF Equivalent

Consider adding an equivalent form of MTF difference encoding to our transformation algorithm, it is not that simple, because our algorithm doesn't store the information like the last context that predicted a certain byte, in fact, we need to store this information, and BWT DOES store it, in a very special way, it localises the data into blocks which are predicted by the same context, and the MTF encoder can work away perfectly!

Let me explain further, a BWT data stream of tells us that all bytes localised had the same context. If we see an 'a' we know it had the same context as the 'a' beside it.

Note: the above is only true in theory, no two bytes have the same context in BWT, but it is the idea behind it.

We can only attain this information from creating a "natural order" version of BWT, we can even see an extremely probable way that BWT was discovered, they were faced with the problem we were, they realised to exploit the redundancy in the way they wanted they had to store which context predicted which bytes and came up with a much better way - block sorting.

Ok, so our MTF development failed, we could do it, but we'd get a really slow compressor which needed lots of memory - basically it'd suck.

There follow two ideas for getting fast AND good compression, possibly better than BWT.

Optimising Things for Speed, Compression, and Memory

We developed a really simple string searching algorithm which recorded occurences of bytes to find a matching string, and if we wanted to search for a string of length n, it'd take this number of data comparisons:

		f(s[1]) + f(s[2]) + ... + f(s[n])

Where f(x) is the frequency of symbol x in the data stream. If we had the string "hellothere", a better way to store the information than we do is as follows:

  [ e, e ]
e [ l, r ]
l [ l, o ]
o [ t ]
t [ h ]
r [ e ]

See? Now we only store the exact number of occurences we need, I implemented this in the transform emfsde.cpp, which uses exact memory, it's *lots* less than usual. Note something though, we CAN'T use this method to untransform the data, it's obvious why, we don't know which symbols should have lots of space allocated for them in occurences and which should have small numbers of occurences allocated. So, we'd have to do something horrible like send the frequencies of the symbols with each block, or allocate memory as the compressor advances... hum... ok.

Now what? well, using our string searching algorithm we could try an LZ based encoder, LZP if you like, find an order-n context, use it as a match, that way we'd have nice order-4 contexts or more. This compressor is significantly faster than the crappy symbol difference encoder example (both can be adjusted of course, try using FSDE.EXE with CONTEXT_ORDER at 4 and MAX_PROBES at 25, not so slow eh?).

The compression rates are quite good for its simplicity but nothing special.

To improve this compressor is something I haven't tried yet, an obvious way is to do what can be done for LZP implementations, take into account the last x occurences of an order-n context, instead of just the last one, and send to the decoder which one was used (unlike the key based encrypter!), if we use the last 64 occurences of 2 byte sequences to compress with we can get better compression under standard LZP, (note 64 occurences needs lots of memory). And we can write which occurence we took in just 6 bits, doing standard LZP that way gives compression similar to using "my" "virtually true context order" method, I think combining the two could give better compression.

Anyway... enough rambling.

An Algorithm for Key Based Cryptography

I've never tested this:

There is a very simple way to make encrypted data using a similar algorithm to above, and also, depending on one number we can change in the encrypter it is more secure, typically the number is about 500 for high speed, and depending on the size of the data it is raised to a power of about 10,000 or even 100,000. After raising it to this power this new number represents the number of combinations needed to decode the data, and even if a computer could try 1 every second it'd take millions of years, and in practice trying a combination would take much longer. Also note that this algorithm will compress data extremely well and operate at a good speed. It does have disadvantages however, especially on non-redundant data.

The algorithm works as follows, first a number, N, is defined, then the encryption/compression can begin:

A byte is read from the input file, the last N occurences of it are searched for and if data is found matching the data after it in these N occurences a match length is written. The algorithm advances in the input by the length of the match number of bytes and repeats this process again.

That algorithm compresses (or encodes or encryptes) the data in a LOSSY way, we do not know which occurence of the byte led to a match. A key is generated as the data compresses which comprises of which occurence was used to find match in the data. If we set N to 1024, the key would be 10 bits * Number of matches long, and even sending it WITH the data would result in compression (for most of the data we would be using).

Isn't all this great, we have a combined encrypter/compressor which is as secure as RSA, AND it is not open to statistical attacks because it generates random data. Well unfortunately it's not as good as that. On very small files it will generate easily decodable/decryptable output. It will also generate easily decodable output on random files and files with little redundance. Having said that it is still quite powerful, the above applies for compressors too, but they are still very important over the internet, so I conclude that this algorithm is also useful and has some very good properties, mainly:

(i) It generates compressed output.
(ii) It can be tuned for speed/security.
(iii) It is "perfectly" secure.

Let's lay encryption aside for a moment, and think about using this lossy method for compression, I have one method which I'm not sure about, because I don't know how Cylic Redundancy Code (CRC) algorithms work, but I've heard people say they can allow you to modify data and there is only a 1/4 billion chance of it NOT being detected, so we could encode the data in the lossy way mentioned above and then send the CRC for the decoded data, and then try all possibilities until the CRC is satisfied, so we'd have amazing compression with only a 1/4 billion chance of going wrong, the only problem I can see is TIME! It's the same as the encryption algorithm, millions of years maybe a variant will work.

Source Code

I have included source code to a windowed version of the transformation algorithm it processes the data in blocks (about 400K is good). There are one or two features which are added to speed up the algorithm further. There is EMFSDE which is an exact memory version of FSDE just look at the memory allocation differences, don't try decoding with EMFSDE. There are some other utilities, entropy.cpp which computes the entropy of a file, and test.cpp which gives some info about a file (intented for FSDEd files)

Conclusion

I hope you enjoyed this article!
Feedback??

paradox slash vivid