Storing and Searching For Byte Sequences
Introduction
A common problem which coders face is that of storing and search for ASCII strings (or in fact, any variable length sequences of bytes). Anyone who has ever written an assembler, or a dictionary based compression program will understand how important creating a fast string search and storage method is. In this article I hope to describe some techniques which are fast and more importantly easy to understand.
Terminology
"Link": Just like its name-sake this is a way to connect two items together. An address-pointer is stored inside one item and its value gives the address of the next item. By changing the value of a link we can point to an entirely different area of memory.
"Null Link": An invalid value (usually 0) is used as the link's value. Instead of using this value as an address to the next item, it denotes that there is no connected item, i.e. the end of the link chain (end of a list etc.).
"Node": A small, simple data structure which only has a few data items and most important, it has a number of "Links". These links can be used to either point to a 2nd node, or point to another data structure.
"Hash": This is a small number which has been created from a much larger value such as an ASCII string, or a number of different data values. As a hash value is such a crude representation of something much bigger it can only be used as the first step in finding an ASCII string or byte sequence.
Byte-by-Byte
Compression algorithms like LZW or LZ78 require you to search through an existing dictionary of byte sequences (or "strings") to locate the longest one which matches the input data. This is basically a slow string-search, and this can take a great deal of time, not to mention storage space for each and every byte sequence.
And after you have located the longest matching sequence, you must create a new sequence from this string by adding the next byte. This new sequence is then added to the existing dictionary. This means over time very long sequences can be built up, but this also means more time will be spent searching for the longest string the next time around.
For example, say we found that the longest match was "ABCDEF" and the next byte was "Z" then we would need to add the sequence "ABCDEFZ" to our dictionary. Then if we found "ABCDEFZ" and the next byte was "q" then we would need to add "ABCDEFZq" to the dictionary. As you can see these sequences can add up to a great deal of memory.
But as each sequence is built up from another sequence which is just one character less, we can use some nodes and links.
First we need to define a structure for our node:
node STRUC
Character DB ? ; the final byte in this sequence
SideLink DW ? ; link to the the next node which shares
; the same sequence, but has a different
; 'Character' value.
LongerLink DW ? ; link to next node for the sequence
; which is 1 byte longer than this one.
node ENDS
As the above doesn't mean much, let's draw a nice ASCII.
Longer longer longer
A--->-- D--->-- D--->-- E--->-- D
side ³ ³
link ³ ³
v v
B--->-- E C--->-- D--->-- D
side ³ ³ ³
link ³ ³ ³
v v v
C X Y
³
³
v
W
The above diagram shows 14 nodes, each node has its own 'Character' field (e.g. A, D, E, B, X, Y, W). As each node is 5 bytes long (1+2+2) we need 70 bytes. This may seen a lot, but it actually describes all these sequences:
A
AD
ADD
ADDE
ADDED
ADC
ADCD
ADCDD
ADCX
ADCDY
B
BE
C
W
Finding the longest byte sequence
Okay, the diagram looks very pretty (well, as pretty as ASCII can be), but how does it help with byte sequences?
Well, we begin with the 1st byte that we want to find and compare this against the 1st node's "Character" field. If we get a match then we follow the "LongerLink" pointer and repeat for the next byte that we want to find. This continues until we run out of bytes, encounter a null "LongerLink" or a non-matching "Character" field.
When we do encounter a byte which does not match the "Character" field, we follow the "SideLink" pointer until we either find a match, or encounter a null "SideLink".
E.g.
lea dx, dictionary ; [DS:DX] --> 1st node
sub cx, cx ; match length = 0
call AllocateNode
mov di, ax ; [DS:DI] --> next free node
@@find:
call InputByte ; AL = next byte/character to find
@@follow:
mov si, dx ; [DS:SI] --> node
cmp [si+node.Character], al
jnz short @@match ; does the node match our newbyte?
mov dx, [si+node.SideLink]
test dx, dx
jnz short @@follow ; follow "SideLink" if <> 0
mov [si+node.SideLink], di ; else add a new "SideLink" node
jmp short @@extend
@@match:
inc cx ; count number of byte matches
mov dx, [si+node.LongerLink]
test dx, dx
jnz short @@find ; follow "LongerLink" if <> 0
mov [si+node.SideLink], di ; else add a new "LongerLink" node
@@extend:
mov [di+node.Character],al ; \
mov [di+node.SideLink],0 ; + build up the new node
mov [di+node.LongerLink],0 ; /
The above example code not only searches for the longest matching sequence which is already in the node dictionary, but it also adds a new node for the the new, longer sequence.
Of course you need to write your own "AllocateNode" and "InputByte" code. Both of these routines will be very simple. The "AllocateNode" could look something like this:
E.g.
AllocateNode:
mov ax, [nodepnt]
add [nodepnt], 5
ret
Where [nodepnt] has been initialised like this:
InitNodes:
lea di, dictionary
call InputByte
mov [di+node.Character], al
mov [di+node.LongerLink], 0
mov [di+node.SideLink], 0
add di, 5
mov [nodepnt], di
ret
NOTE: I have allocated and built a node for the very 1st byte. This removes the need in check in the main loop for an empty dictionary (no nodes have been defined).
Mash the Hash
The node method is very good for LZ type compression, where sequences are built up byte-by-byte. But for assemblers, compilers and other programs which require a more flexible way to build strings the previous node method isn't very efficient in terms of memory storage.
Currently there are around 400 or more instructions, reserved symbols or assembler/compiler directives, and of course operations and functions. Given the fact that projects (and so, source code) is getting ever larger with a typical project ranging from 10,000 upto 100,000 lines of code, the need for an efficient string-search to locate a particular symbol is vital.
On a typical line of source code there could be around 4 symbols which need to be scanned, found and processed. Also when you take into account macros, text-equates, register names and so on... speed is even more important.
When searching for a particular string we could use a single loop, but as more and more symbols are added into the symbol-table (our dictionary of ASCII byte sequences) the time taken would get worse and worse.
We need to break up the vast number of these strings to help narrow down the search more quickly. Something like a binary-search wouldn't really help because the order in which strings are added to the symbol-table is almost random, (i.e. symbols are not added in an alphabetical order). We would end up having to sort the entire symbol-table every time we added a new symbol.
But, we could use the number of characters in each string to group together every symbol with the same length (e.g. "ADD" "SUB" "ECX" "END" etc.). Now when we come to find a particular symbol in the symbol-table we use the length of the symbol, index into a table and compare each string until we either find a match or run out of symbols.
By using the length of a symbol to divide up a large symbol-table we have used it as a form of "hash", although it's a very basic hash, it does demonstrate the basic idea of one. But there is a much better way.
Many years ago I wrote a 68000 assembler on the old Atari ST, after performing a few tests for help speed up the string searching I discovered that the length method wasn't that good because most symbols (labels, equates, macros) had roughly the same length of between 6 and 10 characters. #:o(
A widely used hash technique is the multiply by 13 method. This was in the Amiga disk function (if I remember correctly). As you fetch each character in the symbol name you multiply the current total by 13 and then add this new character to the total.
E.g.
total = 0
symblen = 0
While newchar is valid
total = ( total * 13 ) + newchar
newchar = InputByte()
symblen + 1
Wend
This 'total' is a hash of the entire symbol-string, and so it can be used to quickly narrow down a search in the symbol-table. For example, you could logically AND this total with 3FF hex to form a 10 bit value and use that directly as an index into a table of starting nodes. Each node would describe an unique string (e.g. "DrawSprite" or "TrackerCountDown") and have a link to the next node which shares the same hash value. This way you only compare the number of symbols which share the same hash value. In the most ideal situation symbols will be distrubuted evenly throughout the entire 10 bit (1024 entry) start table, so we have given 10,000 symbols we would only need to compare about 9 strings!!
But this assumes that every symbol was the same length. In fact this can help reduce the number of string compares even further. In the node structure we could add another link; "LengthLink". This is used to narrow down our search to only the symbol which share the same hash and the same length.
A node structure for a symbol could look like this:
symb STRUC
NextLink dw ? ; pointer to the next symbol of the same hash
; and same length
LengthLink dw ? ; pointer to the next symbol of the same hash
; but of a different length
Length dw ? ; number of character in the following string
String dw ? ; address of the ASCII string
symb ENDS
The process goes something like this:
1. Scan the symbol "search" string, create the hash and count the length.
2. Index into the hashtable using the hash, this gives the starting node. If the hashtable entry is 0 (null), then there is no string which matches the hash, so define the new entry in the hashtable, and finish.
3. If the node's string length <> symbol string length then follow the "LengthLink" pointer and repeat until we find the correct length.
4. Compare the node's string against the search string, if it is equal then we have found the string, so quit. Otherwise follow the "NextLink" and repeat the process.
As you can hopefully see, it's an efficient way to reduce the number of string compares. First the hash divides the number of possible searches by 1024 or so, and then the length of the symbol reduces this further. And all of this happens before a single string-compare has been done.
A groovy new hash technique
Some time ago I had a nice idea to improve the string hashing function, and it all came from thinking about disks, sectors and formatting.
Okay, so what do we need for a hashing function?
It must produce the greatest number of hash values for the widest number of input strings. It should be very fast and it give the same, identical value for the same, identical input string. And most important of all, the hash value must be much smaller than the input string, a word or a byte is ideal. Similar string should give very different hash values.
There is something which already does this...
Do the letters C...R...C... ring any bells?
After reading about CRCs and writing a simple test program, it does indeed seem to work and it's no slower than the "multiply by 13" method.
Closing Words
Well, that's my 3rd article for Hugi done. I hope it wasn't too dull. If you are still having problems trying to understand the techniques used here then try reading about "Linked-Lists" and "Binary-Trees" in case you need a little reminder.
Have fun.
Regards