Optimizing "Getbits" Part I
Introduction
After reading putbits I and II, we all know how to do a 'putbits' routine (right?). So here is the other vital routine for any depacker/decompressor. This document contains tips and advice on how to read a number of bits from a bitstream.
Terminology
"Bitstream": This is a continuous block of bytes which is used like a huge series of bits. If our bitstream is 100 bits long then we need 13 bytes of storage (actually only 12.5 with 4 bits spare).
"BitRack": A temporary byte/word/dword in which bits are shifted out from, or into.
"InputByte": This fetches the next byte from our bitstream and passes it back in the AL register.
Bit-by-Bit
As we are still dealing with a bit-stream, we might as well start with fetching 1 bit at a time from it.
Right, we need a counter to record how many bits we have left in the BitRack, and the BitRack itself.
Initialise:
BitRack = InputByte()
BitPos = 7
Get1Bit:
newbit = ( BitRack >> BitPos) AND 1
BitPos - 1
If BitPos < 0 then BitRack = InputByte()
BitPos = 7
Now, let's use some 80x86 to make life easy.
Initialise:
call InputByte
mov ch, al ; CH = BitRack
mov cl, 7 ; Bit = 7
Get1Bit:
mov al, ch
shr al, cl
and al, 1 ; newbit = (BitRack >> BitPos) AND 1
dec cl ; BitPos - 1
jns short notempty ; if BitPos < 0 then
call InputByte ;
mov ch, al ; BitRack = InputByte()
mov cl, 7 ; BitPos = 7
notempty:
The bits a-x in the bitstream are stored like this:
--------------------> bitstream
abcdefgh ijklmnop qrstuvwx .... etc ...
byte 0 byte 1 byte 2
Mask and Wrap
As you may have already noticed the BitPos counter (register CL) counts down to -1 and then begins again from 7, i.e. it wraps on a power of 2.
All the BitPos is currently doing is telling us when the 8th bit has been shifted out of the BitRack.
Another way to do this is to use a logical AND instruction to mask off unwanted bits and limit the results to a small range, in this case 8 (7 to 0).
E.g.
dec cl ; 2
and cl, 00000111b ; 2
jnz short <label> ; 2
This takes 6 bytes and uses the ZF (Zero Flag) to indicate when the 8th bit has been reached.
A better way is to use the MOD 256 property of the byte register itself to perform the logical AND for free, and simply scale up the countdown of CL to be 1/8th of 256 (which is 32). E.g.
sub cl,32 ; 2
jnz short <label> ; 2
As a bonus, we get the CF (Carry Flag) set every 8th time, which could be an advantage depending on your code.
The Bit Loop (or "fake BitRack")
Another way to perform the once every 8th (or 16th or 32nd) event counter is by using a single set bit within a byte, word or dword.
This is just a "fake" BitRack which is rotated round and round without being reloaded or modified in any other way.
Intialise:
mov cl, 01h
Get1Bit:
rol cl, 1
jnc short notyet
call InputByte
mov ch, al ; ch = InputByte()
notyet:
shl ch, 1 ; BitRack << 1
; CF = next bit
As you can see, we never need to reload CL after the initialisation, the ROL instruction takes care of it for us.
This "fake" BitRack method can be used for every 2nd 4th, 16th etc. event tick, simply load the fake BitRack with the correct value during initialisation:
E.g.
00000001 = every 8th
00010001 = every 4th
01010101 = every 2nd
11111111 = every 1 event tick.
This method can be employed in many different programming problems, not just packing/depacking bits in a bitstream.
9 Bit BitRacks
So far we have seen the countdown technique and the Bit-Loop (or "fake BitRack") technique. Both are pretty good in terms of speed and memory usage. But there is another method which combines them both into a single BitRack, and so can be implemented in a single variable or register.
This method can be tricky to understand and code, but don't worry, it's not that bad.
At first sight it may seem that we can just shift the BitRack until is it empty, or zero. Then ALL 8 bits have been shifted out, right?
WRONG !!
Try loading the BitRack with the value 192 (binary 11000000) and you will soon realise that after just 2 shift operations the BitRack would be 0 (binary 00000000), which would indicate that it is empty, but in fact there are 6 more bits which must be shifted out.
The problem is that they are all 0's, because our bitstream is made up from individual bits and a bit can ONLY have two values (unless you know better), it is impossible to have a tri-state bit. We need this 3rd state to denote a "marker" bit.
But for certain values where the final bit is 1 this method does indeed work for all 8 bits. Values such as:
1st 8th
11000001
00101001
00001111
10000001
00011001
After the 8th bit has been shifted out of the BitRack, we do indeed have an empty BitRack, and its value is 00000000.
Well, in fact every ODD number works. The trick lies in the final (8th) bit to be shifted out of the BitRack MUST be 1.
But if we set this 8th bit to 1 all the time then we could only store 7 bits in each byte. This is NOT good news for compression, is it? #:o(
We need to somehow pack 9 bits into just 8!
RCL and RCR
The solution to this is to use the RCL and RCR instructions and set the CF (Carry Flag) to 1 before shifting out the first bit and every 8th bit (or whenever the BitRack is reloaded with a new byte).
Initialise:
call InputByte
mov ch,al ; CH = InputByte()
stc ; **
rcl ch, 1 ; **
CF = 1st bit
Get1Bit:
shl ch, 1
jnz short notyet
call InputByte
mov ch, al ; reload BitRack
stc ; **
rcl ch, 1 ; **
notyet:
As you can see this method is slightly different from the normal ones. Please note the ** instructions. These shift the 1st bit of each byte into the CF while also setting the marker bit (Bit #0) to 1.
Say we had a byte with bits abcdefgh, then the ** instructions do this:
CF = 1 ----------->>-----------+ stc
|
|
CF = a <--<--- ch = bcdefgh1 <-+ rcl ch, 1
^
The CF has the value of the 1st bit, and the final bit in the BitRack has the value 1, which is the marker bit.
When the marker bit has itself been shifted out of the BitRack, the BitRack will be 0 (binary 00000000). At this point we need to reload the BitRack with the next byte from our bitstream and perform the 1st shift operation again, just like the initialisation did.
TIP: If your InputByte routine (read from memory, disk etc.) does not affect the CF then you can remove every STC instruction AFTER the initialisation one, because we know the marker bit has just been shifted out of the BitRack, so CF is already equal to 1.
Of course in your packer/compressor you could encode the 1st byte of the bitstream with the final bit = 1. This would mean the first byte only having 7 bits, but it would possibly save a STC instruction in your depacker or decompressor (saving 7 bits).
Multiple Bits
As I said in Optimizing PutBits II, a bit-by-bit approach can be very slow when used within a loop. For decoding a .GIF picture this would be very lame.
"token" - This is a single code read from
the bitstream.
Usually 16-bit variable/register.
"tokenbits" - How many bits should be read to
make up the "token".
Initialise:
BitsLeft = 0
GetBits:
If BitsLeft = 0 then BitRack = InputByte()
BitsLeft = 8
token = BitRack >> ( 8 - BitsLeft )
While tokenbits > BitsLeft
BitRack = InputByte()
BitsLeft = BitsLeft + 8
token = token OR (BitRack << BitsLeft)
Wend
BitsLeft - tokenbits
token = token AND ((1 << tokenbits) -1)
NOTE: Be careful, the BitRack is normally only 8 bits long, but the "token" is usually 16 bits long, so YOU must deal with the zero extending and data conversion yourself.
Or, just look at the following 80x86 code.
Initialise:
cl = 0 ; BitsLeft = 0
BL = tokenbit
GetBits:
cmp cl, 0
jnz short notzero
call InputByte
mov ch, al ; ch = InputByte()
mov cl, 8 ; BitsLeft = 8
notzero:
mov dh, 0
mov dl, ch
push cx
neg cl
add cl,8
shr dx,cl ; token = BitRack >> ( 8 - BitsLeft)
pop cx
whilelp:
cmp bl, cl
jbe short wendlp ; while tokenbits > BitsLeft
call InputByte
mov ch, al ; BitRack = InputByte()
add cl, 8 ; BitsLeft + 8
mov ah, 0
mov al, ch
shl ax, cl
or dx, ax ; token OR (BitRack << BitsLeft)
jmp short whilelp
wendlp:
sub cl, bl ; BitsLeft - tokenbits
push cx
mov ax, 1
mov cl, bl
shl ax, cl
pop cx
dec ax
and ax, dx ; token AND ((1 << tokenbits) - 1)
Gives:
AX = token
Improvements
You could extend this method to use 16 or 32 bits for your BitRack instead of the current 8 bits.
Of course, the "InputByte" shouldn't be a sub-routine which you call, I only used it to help explain things. The most likely replacement is probably a LODSB, LODSW or some other MOV & INC instruction pair.
TIP: For packing & depacking really huge files it is much easier to break it down into 16 or 32 Kb chunks (sometimes even 1 or 2 Kb chunks). This not only means that all the code can be easily written in Real/8086 mode (and so works on ANY machine), but it allows you to use the most optimum compression method for each, individual chunk.
Take EXE files for example, at the start there is normally a very large range of possible byte values but towards the end of the EXE there is plenty of data and redundant 0's. If you break the EXE down into small 2 Kb chunks then you can use a byte before each chunk to tell the decompressor which method of compression was used on it. The most optimum would just be a single byte and a repeat-count, and the worst would be a "RAW" uncompressed chunk.
No matter how fantastic you think your compression is, there are always situations in which you CANNOT compress the data, remember this. Then think about the "chunk" method...
Closing Words
Well, I hope you have enjoyed reading this document and I hope some coders out there either learned something, or better still, have been inspired to code their own packer/depacker routines.
If you are new to compression, then I would suggest reading about the LZ77, LZ78 and LZW algorithms, but first read about the elegant Huffman, respect due.
Have fun.
Regards