Optimizing "Putbits" Part II
Introduction
After reading Hugi #14, and especially Dario Phong's "Optimizing Putbits", I thought there was quite a bit more which needed to be said on this subject. And so here is my contribution, which I hope will inspire others to take this subject still further.
As Dario Phong correctly said, the "putbits" function is very heavily used when writing a compressor (or "packer" as many call it), so it seems sensible to try and optimize it to reduce the amount of time needed to compress a file.
But, this should be done after first optimizing the rest of the compressor, because let's face it, searching for the longest string match in a large block, or the most optimum packing sequence can take a 100 or a 1000 times more time than the "Putbits" routine which outputs the finished compression token.
If anything the "Getbits" routine (used in the depacker/decompressor) would be a far better place to spend your time optimizing, after, of course, the actual compression algorithm itself.
In this article I will only focus on optimizing the Putbits" routine, and I leave the rest of your super-compression engine for you to optimize.
There is no real brain-blasting methods or fantastic new algorithms here, most of the techniques have been know about for years. The marker bit for example was used in the ZX Spectrum tape load/save routines and probably before years before 1982.
Terminology
"Bitstream": A continous block of bytes treated as one long series of bits.
"Token": This represents a command or a number of data bytes. The token can range from a single bit upto 16 or more in length.
"Putbits": This takes a token and writes all of its bits into the Bitstream. It has to deal with crossing byte boundaries and writing bit(s) to the correct bit positions in each Bitstream byte.
"Sequence": I use the term sequence and string to mean the same thing. A continious block of bytes of a certain length. A string is more usually used when dealing with ASCII characters, so sequence seems a better choice.
"Bitrack": A temporary byte (or word) into which a number of bits are shifted into, until an entire byte (or word) has been filled. After which it is written to disk/memory and the Bitrack is re-initialised, so the process can start again.
Bit-by-Bit
I will first decribe a 1 bit "Putbits" routine in which the "newbit" is only 1 in length and has the value 0 or 1.
Initialise:
BitRack = 0
BitPosition = 0
Putbit:
BitRack = BitRack OR (newbit << BitPosition)
BitPosition + 1
If BitPosition = 8 then OutputByte(BitRack)
BitRack = 0
BitPosition = 0
This would output bits into the bitstream in this order:
--------------------> bitstream
hgfedcba ponmlkji xwvutsrq .... etc ...
byte 0 byte 1 byte 2
where the very first bit is written is to bit #0 of the first byte.
Countdown
A much easier way to understand (and code) is to start at bit 7 and countdown until we go past bit 0 and then output the byte before repeating the process again from bit 7.
Initialise:
BitPostion = 7
BitRack = 0
PutBit:
BitRack = BitRack OR (newbit << BitPosition)
BitPosition - 1
IF BitPosition < 0 then OutputByte(BitRack)
BitRack = 0
BitPosition = 7
Now bits are written in this order:
--------------------> bitstream
abcdefgh ijklmnop qrstuvwx .... etc ...
byte 0 byte 1 byte 2
The Marker Bit
We can actually get rid of the BitPosition variable and combine it with the Bitrack. I will use some 80x86 for this.
Rather than taking the input bit, shifting it to the correct bit-location and combining it with our bitstream byte, we can simplify this by using a rotating byte (word, or dword etc.) instead.
Initialise:
mov [BitRack], 01h ; Bit 0 = marker bit
PutBit:
mov al, <newbit> ; AL = new bit
shr al, 1 ; shift the bit into CF
rcl [BitRack], 1 ; rotate CF into Bitrack
jnc short notfull
mov al, [BitRack]
call OutputByte ; OutputByte(BitRack)
mov [BitRack], 01h ; re-init the BitRack marker
notfull:
This time the BitRack itself acts as the bit counter. It is initialised with 01h (binary 00000001). As each new bit is rotated into the BitRack the marker bit will be also be rotated until finally the marker bit falls into the CF. At this point an entire 8 bits have been stored in BitRack in the order abcdefgh, we just have to output the BitRack and re-initialise it with the marker value.
As you can see it is far more efficent in terms of speed and register usage than using the "BitPosition" method.
This time bits are written into byte starting from bit 7:
--------------------> bitstream
abcdefgh ijklmnop qrstuvwx .... etc ...
byte 0 byte 1 byte 2
PLEASE NOTE: In order to output the final BitRack, you must do this:
FlushMarker:
mov [BitRack], 01h ; Is the BitRack empty ?
jz short rackempty
flushloop::
shl [BitRack], 1 ; shift in 0's until
jnc short flushloop ; the marker-bit falls out
mov al, [BitRack] ; then write the final byte
call OutputByte ; OutputByte(BitRack)
rackempty:
Without the above above code the final byte of the bitstream would be missing. The SHL instruction is VERY important, it aligns the remaining bits in the BitRack to begin at bit position 7, without it our marker-bit would still be in the BitRack somewhere, with the last 0 to 6 bits after it. (E.g. 00001xxx where 'xxx' are the last 3 bits written into the BitRack with our markerbit before it).
Multiple-Bits
Using the "Putbit" routine is fine for single bit tokens, but for multiple bits a loop would be needed, and so it would be much, much slower.
A far better method is to write multiple bits into the bitrack at once as this avoids having to loop for each and every bit.
token - This is the value we wish to
write into the bitstream.
tokenbits - The number of bits to write.
(max. 16 bits)
Initialise:
BitPos = 0
BitRack = 0
PutBits:
BitRack = BitRack OR ( token << BitPos )
BitPos + tokenbits
While BitPos >= 8
OutputByte(BitRack)
Bitrack = token >> (BitPos AND 7)
BitPos - 8
Wend
Now bits are written into byte starting from bit 0:
--------------------> bitstream
hgfedcba ponmlkji xwvutsrq .... etc ...
byte 0 byte 1 byte 2
Now in 80x86:
dx = token
ch = tokenbits
PutBits:
mov cl, [BitPos]
mov al, dl
shl al, cl
or [BitRack], al ; Bitrack OR (token << BitPos)
add cl,ch ; BitPos + tokenbits
whilelp:
cmp cl, 8 ; is BitRack full ?
jb short wendlp ; no...
mov al,[BitRack]
call OutputByte ; OutputByte(BitRack)
push cx
mov ax, dx
and cl, 7
shr ax, cl
mov [BitRack], al ; Bitrack = token >> (BitPos AND 7)
pop cx
sub cl, 8 ; BitPos - 8
jmp whilelp
wendlp:
mov [BitPos], cl
Could do Better
Now the above looks like a nice, neat loop, but in fact it's a very general solution. The loop itself is only entered when the BitRack has been filled with 8 bits. In which case the BitRack is output and the remaining bits in the token are placed in the BitRack and the loop repeats, until the entire token has been shifted into the BitRack.
But when the BitRack is full (and the while loop is entered) the remaining bits from the token are written at bit position 0 in the Bitrack.
Here is a much better solution:
dx = token
ch = tokenbits (0 to 16)
PutBits:
mov cl, [BitPos]
mov ax, dx
shl ax, cl
or al, [BitRack] ; AX = BitRack OR (token << BitPos)
add cl, ch ; BitPos + tokenbits
cmp cl, 8
jb short notfull ; BitRack is NOT full?
call OutputByte ; OutputByte(AL)
mov al, ah ; AL = BitRack (bits 15..8)
sub cl, 8
cmp cl, 8
jb short notfull ; BitRack is NOT full?
call OutputByte ; OutputByte(AL)
sub cl, 8
mov ah, dh
mov al, 0
rol ax, cl ; AL = overflow byte (bits 23..16)
notfull:
mov [BitRack], al
mov [BitPos], cl
To understand this, look at this terrible ASCII. Here is the worst possible case,
BitPos = 7
tokenbits = 16
bit 15 0
aaaaaaaa bbbbbbbb <-- the token
7 0
_rrrrrrr <-- the BitRack
After shifting the BitRack looks like this:
23 16 8 0
_aaaaaaa abbbbbbb b_______
Combined with the BitRack:
23 16 8 0
_aaaaaaa abbbbbbb brrrrrrr
Improvements
Well, I have only used 8086 instructions and 16-bit registers to make this as compatible as possible. You will no doubt be using 32-bit registers and addressing-modes which will certainly help speed things up a great deal.
The SHLD and SHRD instructions might be useful.
The BitRack is probably best extended to 32-bits to reduce memory accesses. Using Protected-Mode will again help to speed things and of course it allows huge buffers to be easily maintained without having to juggle segment registers.
"RAW" bytes
I almost forgot this tip.
Roughly about 50% (hopefully less) of the time you will be forced to output "RAW" 8 bit bytes because you are unable to compress data. For example, the first time you encounter a byte which hasn't been seen before you need some way to encode it. This is normally done with a single-bit prefix '0' or '1' so when decompression takes place we can decode "RAW" bytes quickly without too much overhead.
Now, rather than encode every 8-bit byte using our "putbits" routine which has to perform shifting, byte boundary checks and so on... We can output the byte directly into the bitstream.
This means "RAW" bytes are encoded and decoded without the need to shift or deal with 8 bits across byte boundaries in the bitstream.
To use this method you must use some kind of output buffer which allows random access because you will need to go back and insert/overwrite a previously written byte later on.
You must also reserve space in the bitstream for each BitRack image BEFORE you output any "RAW" bytes.
+------>--------->------+ +>+
| | | |
xxxxxxxx <RAW byte> <RAW byte> xxxxxxxx xxxxxxxx <RAW byte>
bitrack bitrack bitrack
image 1 image 2 image 3
In the above diagram we have 3 RAW bytes and 24 BitRack bits (denoted by the 'x' characters).
Because we are writing BitRack images and RAW bytes at different speeds we must be VERY careful how we output bytes. We must reserve space for the BitRack image byte before writing any RAW bytes.
byte OutputBuffer[1000] ; an example output buffer
OutAddr = 0
Initialise:
RackAddr = OutAddr ; reserve space for the
OutAddr + 1 ; 1st BitRack image
BitRack = 0
BitPos = 0
We have 2 output routines now, Put1Bit and PutRAW:
Put1Bit:
BitRack = BitRack OR (newbit << BitPos)
BitPos + 1
if BitPos = 8 then OutputBuffer[RackAddr] = BitRack
RackAddr = OutAddr
OutAddr + 1
BitRack = 0
BitPos = 0
And the PutRAW routine:
PutRAW:
OutputBuffer[OutAddr] = rawbyte
OutAddr + 1
The reason why we reserve a byte for the BitRack BEFORE it is full, is because when we come to depack the bitstream we need to read the BitRack image first, to know what to do with the following bitstream bytes.
Closing Words
Well, that's all for putbits. Don't worry if you can't understand it all in one go, just start off with simple code and work you way up slowly, making sure you understand things fully before moving onto something more difficult.
Enjoy.
Regards