Gettin jiggy with the Mr. Bit biggy
This is another old Hugi article update inspired by my updated GetBits article. This time its the 'PutBits' routine that gets some 80x86 attention in terms of optimisation.
Introduction
As you all know already, 'putbits' writes a varying number of bits into a bitstream so they can be later read back out using a 'getbits' routine. In many ways writing bits into a bitstream can be easier than reading 'em back out, as you will soon see.
Bit-blast from the past.
Okay, here is a snippet of code from my old Hugi 16 article. You pass it the number of bits you want to write along with the bits themselves. It will only handle a maximum of 24-bits, but this should be enough for 'most' packers.
;      (EAX) = value to write (the 'data')
;       (CH) = number of bits to write
;
PutBits:
                mov     cl, [BitPos]
                mov     edi, [BitPtr]
                shl     eax, cl          ; shift token to BitPos
                or      [edi], eax       ; combine with Bitstream
                add     cl, ch           ; BitPos + NumBits
                movzx   eax, cl
                shr     eax, 3           ; EAX = BitPos / 8
                and     cl, 7            ; BitPos MOD 8
                add     [BitPtr], eax    ; Advance BitPtr by EAX bytes
                mov     [BitPos], cl
(double) words of warning.
Like I mentioned in that old article, the 'OR [EDI],EAX' assumes that the bitstream buffer has already been cleared (at init time). Also allow for an extra 3 bytes (or better still 4 bytes) at the end of a bitstream.
Anyway, here is some init/pre-clear code to deal with that possible problem.
        .DATA?
BitPtr          dd      ?
BitPos          db      ?
bitstream       db      16384 dup (?)
        .CODE
Init:
                mov     [BitPtr], offset bitstream
                mov     [BitPos], 0
                mov     dword ptr [bitstream], 00000000h
                ret
;      (EAX) = value to write (the 'token')
;       (CH) = number of bits to write
;
PutBits24:
                mov     cl, [BitPos]
                mov     edi, [BitPtr]
                shl     eax, cl          ; shift token to BitPos
                or      [edi], eax       ; combine with Bitstream
                add     cl, ch           ; BitPos + NumBits
                movzx   eax, cl
                shr     eax, 3           ; EAX = BitPos / 8
***             mov dword ptr [edi+4], 0
                and     cl, 7            ; BitPos MOD 8
                add     [BitPtr], eax    ; Advance BitPtr by EAX bytes
                mov     [BitPos], cl
        *** = clear ahead 4 bytes
The 'Init' part clears the first dword of the bitstream, this allows upto 32-bits to be output before more of the buffer needs to be cleared. The rest of the buffer is pre-cleared inside the 'PutBits24' routine.
Of course it would be much quicker to clear the entire bitstream buffer at 'Init' time rather than inside the 'PutBits24' routine. I used this method to show an easy size-optimization method.
Random-access
Now we get to the fun part. The improved PutBits routine, which as you've already guessed it is based on random-access rather than incremental or advancing the address-pointer and bit-position variables.
        .DATA?
BitSeekPos      dd      ?
bitstream       db      16384 dup (?)
        .CODE
Init:
                mov     [BitSeekPos], 00000000h
                mov     dword ptr [bitstream], 00000000h
                ret
;      (EAX) = value to write (the 'data')
;       (CH) = number of bits to write
;
PutBits24:
                mov     edx, [BitSeekPos]
                mov     cl, 7
                and     cl, dl
                shr     edx, 3
***             mov dword ptr bitstream[edx+1], 0
                or      bitstream[edx], eax
                movzx   edx, ch
                add     [BitSeekPos], edx
                ret
        *** = clear ahead 4 bytes
Closing words.
Easy Huh?
An improvement to both the PutBits and GetBits routine would be to handle data in dword chunks rather than bytes. This would speed up access (no mis-aligned dword memory accessing) but would require slightly more code. Maybe some total 80x86 nutter might like to code a MMX or FPU version of PutBits or GetBits? hehe..
Just in case you don't understand Roman Numerals, 'Part IV' means the same as Part X-VI. Hope that helps ;)