Optimizing "Putbits" Part III
(or how I beat Dario Phong... heh heh)
Introduction
Earlier today I recieved a curious email from Dario Phong (whose original article in Hugi #14 inspired me to write my first ever article).
After some very fair comments about LZT (which I had outlined in my LZT article) he mentioned about the 'putbits' routine in which he said some strange things which I disagree with, but rather than argue what's the fastest putbits etc. etc. I will describe some size optimizations for dealing with bitstreams together with a very simple way to write multiple bits to the bitstream using just ONE shift operation.
------ Serious mode off ------
I want to dedicate this entire article to Dario Phong for his rather confusing email he sent me today. I've read it a number of times, but I'm still baffled!
------ Serious mode on ------
Random Access.
Dario said: "I think random access should not be done in the putbits."
As I am unclear about what he meant by "random access" I will reply to the two possible things which I think he is referring to.
1. Random Bit Access:
For a single-bit write, a random-access to the current bit position inside the bitrack is slow. That's why I wrote the 'marker-bit' section in which the bitrack is initialised with 01h (binary 00000001) and is used to indicate when a full 8-bits have been shifted in.
But for multiple bits (i.e. writing more than 1-bit per loop) this random-access is vital. It not only tells us the starting bit-position within the BitRack for the new bits to go, but it is also used to determine if the BitRack is full.
2. Random BitRack Access:
The only other random-access I can think of is for the BitRack and Raw (literal) bytes. Because bits are written into the BitRack and complete Raw bytes are written at different rates, we need random-access to go back and output the full BitRack later on.
As far as I know there is no way to overcome this problem because of the way most depackers work (BitRack bytes and Raw/literal bytes are intermixed throughout the packed data-stream).
The only way to remove this random-access would be to use two buffers, one for the bitstream and one for the bytestream (raw literal bytes). But this complicates the depacker code because another memory pointer is needed for the 2nd buffer, not to mention a couple more instructions to set this pointer up. The only good thing about this method is that some other compression scheme can be performed on the raw/literal byte stream, so by gaining a little more compression.
Too many Shifts.
Okay, here is the bit which most people have been waiting for... (apart from the end of this text... heh heh).
I am surprised that I didn't think of this method before, but after writing close to 20 articles for Hugi #15 I did get rather bored.
What is the problem with the original 'putbits' routine?
Well, nothing; except for tokens which cross the BitRack (byte) boundary. This needs two shift operations, one to align the lower bits with the current bit-position within the BitRack, and one to align the overflow bits with bit #0 of the next BitRack byte.
Say we have a 16-bit token we wish to write into the bitstream:
<---- token ---->
abcdefgh ijklmnop
And we have a BitRack with 3 bits (xxx) already in it:
BitRack
76543210
_____xxx
If we use a 24-bit temporary buffer and shift the token left 3 places:
<----- the "shifter" ---->
76543210 76543210 76543210
_____xxx <-- the BitRack
_____abc defghijk lmnop___ <-- the token << 3 bits
So it looks like we need a 24-bit register to perform the above operation. This would mean a 80386 or better CPU for those nice 32-bit reggies.
But after thinking about some old 8086 sprite code that I wrote on an old 8Mhz Amstrad PC (yuck) I remembered this obvious trick. Sometimes even with the latest super-sonic PC, the old tricks are still the best.
Don't stop now, I'm on a ROL-L
Yeah, instead of a SHL (shift left) use a ROL (rotate left) instruction. This means the rotated token would look like this:
76543210 76543210 76543210
_____xxx <-- the BitRack
... defghijk lmnopabc <-- the token << 3 bits
ÀÂ- ^^^
À---->------>------
Now let's forget all about the BitRack and access the Bitstream directly.
(Hey, Dario... check this out...)
*** WARNING ***
This code is TOTALLY untested, so there probably is a heap of bugs in it. I typed it directly off the top of my head, anyway I will demonstrate a much better 'putbits' very soon..
(AX) = value to write (the 'token') (CH) = bits to write ('NumBits')
push bx dx di
mov bx,[BitPos]
mov di,[BitPtr] ; [ES:DI] --> Bitstream byte
mov cl,bl
rol ax,cl ; rotate token left
mov dx,ax
and dl,table[bx] ; get overflow bits
xor al,dl
add cl,ch ; BitPos + NumBits
or es:[di],al ; combine with previous BitRack bits
cmp cl,8
jb short @@done ; < 8 bits in shifter?
inc di
mov es:[di],ah ; else output Bits 15..8
cmp cl,16
jb short @@done ; < 16 bits in shifter?
inc di
mov es:[di],dl ; else output Bits 23..16 (overflow)
@@done:
and cl,7 ; BitPos MOD 8
mov [BitPtr],di
mov byte ptr [BitPos],cl
pop di dx bx
ret
table db 00000000b
db 00000001b
db 00000011b
db 00000111b
db 00001111b
db 00011111b
db 00111111b
db 01111111b
db 11111111b
Faster? Are you taking the P-mode?
Most of us 80x86 coders are now using p-mode, right? If not, why not? Well the following method of accessing the bit-stream directly can be used in either P-mode or Real/V86-mode, of course 32-bit code segments will require a few less override bytes.
(EAX) = value to write (the 'token') (CH) = number of bits to write
mov cl, [BitPos]
mov edi, [BitPtr]
shl eax, cl ; shift token to BitPos
or [edi], eax ; combine with Bitstream
add cl, ch ; BitPos + NumBits
movzx eax, cl
shr eax, 3 ; EAX = BitPos / 8
and cl, 7 ; BitPos MOD 8
add [BitPtr], eax ; Advance BitPtr by EAX bytes
mov [BitPos], cl
Well, that certainly looks fast and compact, but there are two small details which I have missed..
In order for the above code to work you MUST zero out the bitstream buffer before using the above routine. A nice REP STOSD or optimized loop will do.
Also it may overrun the bitstream buffer by upto 3 bytes due to the
or [edi], eax ; combine with Bitstream
instruction, so an extra dword in your buffer is perhaps a good idea.
You can remove the need to clear the bitstream by replacing the OR instruction with these:
or al, [edi] ; combine with Bitrack byte
mov [edi], eax ; wrt BitRack + overflow byte(s)
But remember to zero out the very 1st byte in the bitstream buffer!
Also the 'token' value in the (EAX) register MUST be zero-extended, so you can't load 1234h into (EAX) and only write the lower 8 bits, you MUST load (EAX) with 34h. This is due to this instruction:
or [edi], eax ; combine with Bitstream
If you want to only write a certain number of bits from (EAX) then you can perform a logical AND instruction before putbits (i.e. and eax, 255).
BTW: bits are written into bytes starting from bit 0:
--------------------> bitstream
hgfedcba ponmlkji xwvutsrq .... etc ...
byte 0 byte 1 byte 2
It's nice to see that Intel's low...high byte order can sometimes be very useful (heh heh).
Improvements
Of course the above code could be optimized. The usual AGI stalls and pairing rules should be used.
You could use the above code as a MACRO in your source code, and reserving the (CL) and (EDI) registers for the bitstream could also help speed things up a little more.... but like I said in a previous article the string-search and other CPU hungry parts of your packer are better places to spend your time optimizing.
You could even write a MMX version using this 'sprite' technique. But perhaps the MMX instructions would be better used in the string compare/search routines in your packer.
The P-mode version can handle up to 24 bits, which should be enough for most people.
Maybe using dwords (32 bits) instead of bytes (8 bits) for the bitrack would also be a good idea, as this would reduce the extra clock-cycles needed for a mis-aligned dword. Of course if you need more than 24 bits then you could give the SHLD and SHRD instructions a quick look.
Closing Words
Well, I hope that's the last 'PutBits' article that I will write.
...unless I get another email from Dario (grin).
Enjoy.
Regards