Lz77, The Basics Of Compression

Dario Phong

Table of contents

 - Introduction
 - Theory
 - First Implementation
 - Working with bits
 - Some ASM instructions
 - Pseudo-code
 - How to improve it
 - Some greetings
 - Text to compress
 - Closing words
 - Contacting the author

Introduction

In this article, you will learn the basics of compression. I will deal with Lz77, a simple algorithm for a good and fast compression.

Maybe you are asking, "Where are the sources codes?" There are no source codes. I will not teach you how to cut'n'paste a piece of lame C code, 'coz a lot of lamers can do that, and I don't like that. So in this article you will think and implement the algorithm in any language you want, though I recommend Assembler, and when it's finished you can improve it, and this will be YOUR compressor.

If you have already implemented the lz77 go directly to: "How to improve it."

Theory

In 1977 and 1978 Abraham Lempel and Jacob Ziv discovered the Lz compression algorithm. What's the theory? It's very simple and intuitive. When you find a match of enough bytes, instead of writing those bytes you write the offset and the length of the repetition: where is it and how long is it. This is a scheme with dynamic dictionary, and the offset/length pair is called codeword.

First Implementation

Imagine you are compressing a text: "ab ab". You read the first part, "ab ", and write it uncompressed, then you read "ab" and write that two bytes are repeated from offset 0. And decompression? You first read "ab " and then the offset and length, and you copy the bytes from there, so: "ab ", copy 2 bytes - "ab ab".

But how can the decompressor know if there is an offset/length pair or an uncompressed byte? Simple, with a prefix. If it's a 0, then it's an uncompressed byte, if it's 1, then it's an offset/length pair.

Working with bits

But wait a minute, if I only have to write 0 or 1, why should I waste a byte on it? You can use a bit instead. Well, you must use bits if you want good compression.

But if all the operations work with bytes and when you save info to a file there are bytes, how do I use the bits? With a clever use of some instructions. For this I will use ASM, but it can be done in C, too. If you don't know ASM, learn it! If you don't want to learn it, read the articles about 'Lzss' and 'lzp' from Hugi #12 (http://www.hugi.de/). Be sure to download it, it's very cool.

Well, let's continue with the bit operations in ASM.

The main idea is to keep a byte and a counter with the bits written, and when you have written 8 bits, you write that byte and start again with another byte. I will use some instructions, be sure to read the explanation of them in the section 'some ASM instructions'.

Here is the main idea of the put_bits, don't copy it, rewrite it and understand it!

 @@put_bits_loop:
      push    cx                      ;the number of bits to write
      mov     bh,_byte_out            ;the output byte (where to write)
      mov     bl,_byte_in             ;the input byte (the bits to write)
      mov     al,bl                   ;we store the byte to read from al
      shr     al,1                    ;we shift al to the right, so we leave
                                      ;the first bit in the carry flag
      xor     ah,ah                   ;set ah=0
      adc     ah,0                    ;we add 0 to ah and the carry flag, so
                                      ;if the carry flag was 1, then ah=1
      mov     bl,al                   ;save the input byte
      mov     cl,_total_bits          ;the bits that we have written
      shl     ah,cl                   ;put the bit in its position by
                                      ;shifting it to the left
      or      bh,al                   ;put the bit in the output byte
      mov     _byte_out,bh            ;save it
      inc     _total_bits             ;the bits written
      cmp     _total_bits,8           ;Have we written the whole byte?
      jne     @@no_write_byte         ;nope E-)
      mov     di,ptr_buffer           ;the pointer to the buffer
      mov     es:[di],bh              ;save the byte (es is the segment of
                                      ;the buffer)
      inc     di                      ;next byte in the buffer
      mov     ptr_buffer,di           ;save it for the next time
      inc     bytes_writed            ;when the buffer its full write it to
                                      ;a file or something like that
      mov     _byte_out,0             ;so the next time is clear
 @@no_write_byte:
      pop     cx                      ;we saved it
      dec     cx                      ;more bits to write?
      jnz     @@put_bits_loop         ;yes, repeat everything

Well, it's done. As I mentioned I don't like spreading source code, but I thought that was the best way to understand it.

The names of the variables are self-explanatory, but anyway:

 - _byte_out  -> the byte that will be written to the output buffer;
                 it stores the bits that we are currently writing
 - _byte_in   -> the byte that stores the bits that we want to write
 - total_bits -> the number of bits written so far, at start 0
 - ptr_buffer -> if you are under real mode, it stores the offset to the
                 buffer and es the segment; if you are in pmode it stores
                 the whole offset

When you enter this routine cx must have the number of bits to write, and _byte_in the bits to write. Be careful, after entering the loop test if cx is 0 'coz if it is and you don't test it, you will write 1 bit, then decrement cx, so that it's 255, and then you will write 255 bits! So remember:

        test    cx,cx                   ;i will NOT explain that ;-D
        jz      @@put_bits_end

This is the 'structure' of a byte:

        00000000b
               |
               bit 1

When you have written all the bits (ex.: the compression is over) then you have to test if any bits are waiting for being written, and if there are any (total_bits!=0), you write the _byte_out and increment all the pointers.

Note that this function will fail if you pass more than 8 bits to it 'coz it takes the input bits in a byte, not in a word. So if you want to write more than 8 bits, first write the low-part (8 bits) and then the rest. To write and to read it, you should use 'and', 'shl', and 'shr'.

Now you need the get_bits function. Hey I explained the basic operations with bits, you should now be able to make that function yourself, happy coding! E-)

Some ASM instructions

Mov: Do you really think I will explain this one? ;-D

Shr Shift (move) the bits to the right:

         shr     al,1    ;first al=11100010b then al=01110001b (cf=0)
         shr     al,2    ;first al=01011010b then al=00010111b (cf=1)

The last bit shifted will go to the carry flag. Instead of a direct value you may use cl, and only cl.

         mov     cl,2
         shr     al,cl   ;this will shift al by 2

Note that the 'new' bit is filled with 0.

Shl: The same but to the left. (The carry flag is changed, too.)

         shl     al,2    ;first al=01001011b then al=00101100b (cf=1)

Adc: Add with carry. This adds to any register any value and the content of the carry flag too.

         adc     ah,0    ;first al=0 (cf=1) then al=1
         adc     ah,0    ;first al=0 (cf=0) then al=0
         adc     ah,3    ;first al=1 (cf=1) then al=5

Or: This performs a bit operation. There is the table: - 0+0=0 - 0+1=1 - 1+1=1 -

         or      al,00001111b    ;first al=0, then al= 00001111b
         or      al,00001111b    ;first al=10110010, then al= 10111111b

Maybe you knew how these instructions worked; if you didn't, then learn it.

Pseudo-code

So go now and first of all get the basics of your program, getting the parameters, the work with the files, and do some test with the bit operations, save it to a file, read them, and test if the returned bits are the right ones.

Oki, now I will assume that you have already done all this work. This is how the file will be saved:

- First 2 bytes (or more if needed) containing the lenght of the uncompressed file.

- Then the compressed data.

The compressed data looks about like that:

Uncompressed chunk: 1 bit <equal to 0> 8 bits <the uncompressed byte>
Compressed chunk: 1 bit <equal to 1> 16 bits <offset> 6 bits <lenght>

Well, the inner loop:

 - Loop till there are no bytes to compress.
         - Scan the input buffer till the current byte that we are comparing.
           Note that the decompressor  can't copy bytes from a position where
           the bytes aren't already defined.
         - Have we found an equal?
                 - Yes.
                   - Then compare the next byte from the current position with
                     the byte in the next position  when we found a byte equal
                     to the first.
                   - Continue comparing till you have found a byte that is not
                     equal, but remember keeping the number of equal bytes.
                     - Now you have found a byte that isn't equal.
                       Is the number of found bytes more than 2?
                       - Yes. Write the offset of the FIRST byte you found and
                         the number of repeated bytes (length).
                         Then increase the pointer to  the current byte by the
                         number of repeated bytes  ('coz we have saved it) and
                         continue searching.
                       - No. Continue searching.
                 - No. Continue searching.
         - If you didn't find any match, simply write an uncompressed byte.

That is all the work. Remember to write the prefixes too.

Go and implement it, test it with a simple text, check the 'text to compress' section, try the first one and the second one.

Your compressor does not seem to have any bugs? Yes? Well, now we have to do a decompressor:

 - You read the lenght of the uncompressed file (the first word).
 - Then you loop till you uncompressed the whole file.
         - Read a bit (the prefix).
                 - It's 0:
                   - Read 8 bits and write  them to the output buffer
                     (remember they are an uncompressed byte).
                     Increment the pointer to the output.
                 - It's 1:
                   - Read the whole offset, 16 bits, then the length.
                     Copy 'length' byte from 'offset'  to the current
                     position, and add the 'length' to the pointer to
                     the output.

Now the compressor and the decompressor are done, correct, when you have finally killed all your bugs (not so difficult as it seems), you are ready for the next mini-part.

If you have implemented the compressor and the decompressor without any bugs, then you have done a good work, congratulations, but there is still a lot of things to do. E-) Hey, keep up the good work! ;-)

How to improve it

First take a look at this string: "444 4444 4444"

Imagine you are compressing it and have already compressed "444 4444 ". Then you read '4' and find a match, "444", right? NO. You have found a match, but maybe it isn't the best one. What you have to do is to save this temporarily and continue until you have scanned all the buffer till the current position, and then get the best one and write it. (This is string 3.)

Another one is the length of the offset. Maybe 16 bits is too much. Of course you may do something to have less bits. One good idea is to reduce the loop back buffer (the buffer starts in the first byte to compare and ends in the current position) so that you will gain speed, 'coz you need to compare less bytes, and you waste less bits in the offset. Imagine your loop back buffer is 8192 bytes long, so the offset will be 13 bits, and you will not spend so much time on comparing.

More? Of course. E-) The length of the length. E-D I've done some tests, and I can say that usually the lengths are smaller than 15 bytes! So you only have to use 4 bits on it. Remember the minimum bytes to write a compressed code? So to save some bits we can do: length-= minimum_bytes. So if minimum_bytes is two, a length of four will be saved as 1, and the decompressor will get the length (1) and add it the minimum_bytes, and we have our length back! E-) Now the maximum value that our 4 bits can hold is 18 instead of 15. Good, no?

More about the offsets: We can do a variable offset length so that it depends on the length of the loop back buffer in the current position. Example: If we are at position 412 bytes, we need 9 bits to store that number, so we read only 9 bits. If we are at 19000, we will read 14 bits. For this you may use an instruction called bsr. Look it up in Helppc or something like that.

Even more about the offsets: We can do another thing to have variable length offsets, namely to waste a bit as a prefix and use it to determine if we have a maximum offset length (let's say 15 bits) or a smaller one (9 bits).

Lazy encoding, what's that? The theory behind it is that sometimes it is better to discard some bytes and find better lengths. For example, look at the string number 4, think a lot and code a little.

Argh, do you still need more? I need it too. };-) I found info about those topics: Huffman, Markov, Lzp.

And think a lot, first of all you have to know how the files are. Do they have big repetitions in them? Then you are ready for coding.

Some greetings

Greetings? Why, if I did all the work alone. E-)

Well, at least thanx to:
Picard - For his lzss article, I learned a lot with it.
Red XIII - For his lzp article, man I don't have time to read it. ;-(

Text to compress

You have to save every string to a diferent file. This is only for debugging purposes, and to improve you compressor.

 1 - "11 222 11 222"
 2 - "111222111312221"
 3 - "444 4444 4444"
 4 - "curry urrent current"

Closing words

First af all, if there is any problem with the put_bits, don't mind too much, rewrite it, be sure you get and put the bits in a correct way, etc. Excuse my poor English.

Now I suppose that you have read the whole text and you still have to start working. It may seem a tad dificult, but it isn't. I learned all this stuff in two weeks with the only help of the mentioned articles in Hugi #12. Maybe you are asking what my compressor is all about. Well, I have already implemented all this stuff, now I'm doing the lazy encoding, of course I always get the best length, now I'm implementing a better usage of the prefixes. I have a lot of ideas but almost no time for coding. E-(

I still don't know how Huffman, Markov or the hash index works, so if anybody wants to enlight me, contact me, or better, write an article, like I did, for everybody and send me a copy, and we will discuss about compression.

Contacting the author

You can contact me via e-mail, sorry for possible delays but I only check it once a week:

dario_phong@mixmail.com

Well, here's my address:

                ARTURO SAN EMETERIO CAMPOS
                C/ROSAL  48  4,1
                CP: 08004   BARCELONA (SPAIN)

Other productions by DAriO PhONG:
dp_shan.zip - My first 4k intro, cool design (I think }E-)
dp_hintr.zip - An article about hintros.
dp_ktc10.zip - A program that makes the text look cooler.
If you want to get them, send an email to me.

C U SO0N! And good luck with the compressor!

- DAriO PhONG, Barcelona 16/12/1998