Introduction to Sorting Algorithms
Sorting is a topic every student of computer science who plans to graduate
from university successfully will have to deal with. That's why it's
important to have some knowledge about it (even though it's not absolutely
necessary to master for everyday coding as you can use the
qsort() function that is included in the standard
library of C). My doc can be used both as a tutorial for beginners and
as a reference.
Introduction & Declarations Definition. In general sorting means rearrangement of data according to a defined pattern. The task of a sorting algorithm is to transform the original unsorted sequence to the sorted sequence. Types of algorithms. Computer scientists differentiate between internal and external sorting algorithms. While external ones need a large amount of extra memory, the internal algorithms manipulate the original array and only need a few bytes for stack and temporary variables. With the exception of Merge Sort, all of the algorithms I'm going to describe belong to the class of internal sorting algorithms. Practical use. Sorting algorithms such as the ones covered in this doc are used in a lot of applications. Some examples are: office apps (databases, spreadsheets, word processors,...), compression utils (e.g. for Huffman encoding), computer games and even demos ;) (e.g. for Z-buffering). Limits. In this doc I only deal with the sorting of numerical sequences. But the same methods can be applied to all kinds of mono-key data, including strings. However, the algorithms I am going to describe cannot be used for sorting sequences according to several keys, e.g. surname and first name. Conventions. From now on I'll refer to the number of elements in the unsorted sequence as n. The first element gets the index 0, for which reason the last element gets the index n-1. The central value of the unsorted sequence is the element with index int n/2. Sub-sequences are sequences inside a larger sequence. I'll also use the terms "section", "partition" and "area", which are supposed to have the same meaning as "sub-sequence". "Ascending order" and "incremental order" mean "iterating forwards" (e.g. for(i=1; i<100; i++)). "Descending order" and "decremental order" mean "iterating backwards" (e.g. for(i=99; i>0; i--)). Example sequence. The initial sequence I'm going to use in order to show how the following algorithms work is 8, 4, 1, 5, 4. Our aim is to transform it into 1, 4, 4, 5, 8. Selection Sort The principle of Selection Sort is to find the smallest element of the data sequence (from index 0 to n-1) and put it to the beginning of the sequence. This procedure is then applied on the yet unsorted areas (1 to n-1, 2 to n-1 and so on), until the area from n-2 to n-1 has been sorted. Now the entire data sequence has been transformed to sorted form. To get the smallest element to the right position, simply exchange it with the first in the (sub-)sequence. Example:
Selection Sort is an iterative algorithm (although a recursive implementation is possible, too - but it's just a waste of brain- and CPU-power...). You use a for-loop during which the counter is gradually incremented from 0 to n-2. As a consequence the data is sorted after n-1 iterations. My implementation of Selection Sort in C:
It uses the function swap that is defined as follows:
Division and Insertion Algorithms: Insertion Sort The data sequence is divided into two sections: an already sorted target sequence and a still unsorted source sequence. The target sequence initially consists of the first element only. However, it grows by one element per iteration. In every iteration the first element of the current source sequence is compared with the elements of the target sequence in incremental order. If the first element of the source sequence is lower than or equal to an element of the target sequence, all elements to the left of the first element of the source sequence and to the right of the current element of the target sequence are moved to the right, the current element is inserted to the corresponding location in the target sequence and the iteration is stopped. Sorting is finished once all elements belong to the target sequence. Example (the target sequence is in bold print):
Similar to selection sort, insertion sort is also iterative. The moving of the elements can be done by swapping, see my code below.
Optimizing Insertion Sort: Now this version isn't optimal. The algorithm can be further improved. One way is not to compare the first element of the current source sequence with the elements of the target sequence in incremental order, but in decremental order. I.e. instead of comparing array[source] first with array[0], then array[1] etc., compare array[source] first with array[source-1], then array[source-2] etc. The advantage of this: If array[source] is greater than or equal to array[source-1], the execution of the iteration can be stopped and array[source] can be treaten as part of the source sequence. However, if array[source] is lower than array[source-1], you can go on by swapping the elements, then comparing array[source-1] with array[source-2] etc. In C code:
Division and Insertion Algorithms: Binary Insertion Sort This is a differently optimized variant of the original, slow Insertion Sort algorithm. Now it isn't the smallest element of the target sequence that is compared with the current element of the source sequence; instead, it's the central value. Depending on whether it's lower or greater than the current element of the target sequence the elements to the right or to the left of the central value are compared with it. (If the two values are equal, simply insert the source sequence element to the right of the target element in order to reduce the number of move operations.) In this way the number of comparisons may be reduced. Division and Insertion Algorithms: Shell Sort Shell Sort is an extension of Insertion Sort. First various parts of the data sequence are "pre-sorted" using a slightly altered insertion sort function that has a parameter used to specify the distance between the elements that are to be compared. Now the entire sequence is processed using a normal call of Insertion Sort (i.e. using distance 1). Due to the previous "pre-sorting" this may work faster than directly applying it on the unsorted sequence. My experiments have led to the conclusion that the efficiency of Shell Sort heavily depends on two factors: a. the initial distance, b. the value by which the distance is divided in each iteration. Here's some statistics:
In other words, the only of these three ways that brought an improvement in these two cases in comparison to Insertion Sort (where the number of iterations = n) was the third one, with the initial distance = n/10 and dividing by 10 per iteration. Here is my implementation of it:
Exchange Algorithms: Bubble Sort The name of this algorithm is supposed to indicate that the elements "rise" like bubbles. All elements are compared with their successors; if the element with the lower index is the greater one, the elements are exchanged. Once the entire sequence has been processed this way, the whole process is repeated - until there's been an iteration during which no exchange took place. Then the data is sorted. Example (the elements that are currently compared with each other are in bold print):
Bubble Sort is iterative. In contrast to the other algorithms explained so far the number of iterations of the main loop cannot be deduced from n. Due to this one usually makes use of a while loop. Bubble Sort is very slow most of the time, except if the unsorted sequence actually is already sorted - then Bubble Sort works faster than the other algorithms: Already after one iteration (consisting of n-1 comparisons) it realizes that the sequence doesn't need to be sorted and stops. Here's my implementation of Bubble Sort:
Exchange Algorithms: Shaker Sort This is a variant of Bubble Sort. The neighbouring elements are first compared in incremental, then in decremental order and so on. This allows the following optimizations: 1. You are now able to regard a subsequence ending with index n-1 as already sorted if the last iteration in incremental order ("from left to right") was executed without a single exchange. Scanning through the data in decremental order ("from right to left") you can now ignore this sub-sequence; simply continue by comparing the initial element of this sub-sequence with its predecessor. 2. Due to this the following applies to sub-sequences starting with the element with index 0 if the last iteration was in decremental order: Continue with the element of this sub-sequence that has the greatest index and compare it with its successor. The data sequence is sorted once one of either scans (from left to right or from right to left) occurred without an exchange operation. Example (the elements that are currently compared with each other are in bold print):
You see: This example takes two comparisons less than the same using Bubble Sort. Exchange Algorithms: Comb Sort The Comb Sort algorithm is based on a generalization of the principle of Bubble Sort. At the beginning it doesn't compare neighbouring elements but such with a greater distance between them (the "initial distance"). After every iteration the distance is decreased until it finally is 1, like in Bubble Sort. If an iteration occurs then without an exchange the data sequence can be regarded as sorted. The performance of Comb Sort heavily depends on its initial distance. Here's an example with an initial distance of 3 (the elements that are currently compared with each other are in bold print):
Sometimes - especially if there's a lot of data - Comb Sort is more efficient than Bubble Sort. However, the opposite may also be the case. This applies to our example: While Bubble Sort needs 12 comparisons until the sequence is sorted, the Comb Sort shown above needs 13. However, this isn't really the fault of the algorithm: it's the fault of one of its parameters, the initial distance. In this example I've chosen an initial distance of 3, which has proved to be not very efficient. If we, however, set it to 2, performance rises:
Compared to Bubble Sort, we've saved 40% of the comparisons. Of course the ideas behind Comb Sort and Shaker Sort can be combined to new algorithms. Divide and Conquer Algorithms: General The term Divide and Conquer stands for algorithms which divide a task into several sub-tasks, solve each of them and then re-merge them into a total result. They're all recursive algorithms. Divide and Conquer Algorithms: Merge Sort The unsorted sequence is divided into two sub-sequences. These, again, are divided into two sub-sub-sequences, and so on. The smallest partitions consist of just one element each; thus, each of them can be regarded as sorted within its borders. Now neighbouring partitions are merged to larger sub-sequences (that's the source of the name of this algorithm), however not before the elements have been correctly ordered. The resulting sub-sequences are, again, merged until finally the entire data sequence has been reunited. Example:
One realizes instantly that the number of necessary divisions t = log2(n) if n is a power of 2, otherwise t = int(log2(n))+1. After the i-th merge the "borders" of the sub-sequences match the ones they received by partition t-i. As soon as i is equal to t, the entire data sequence has been reunited and is sorted. Divide and Conquer Algorithms: Quick Sort Quick Sort, developed by C.A.R. Hoare, combines certain aspects of the division and exchange algorithms. Its main principle: The unsorted sequence is sorted by means of exchanging in such a way that to the left of a certain element only values lower than it remain, and so that to the right of a the same element only values greater than it remain. Now the sequence is divided into two sub-sequences which are sorted according to the same principle, and so on. How Quick Sort works in detail: First you pick a value of this sequence. It can be the first, the last or any other - you can even use a random numbers generator to choose it for you. I've decided for the central value (see the Introduction & Declaration part for a definition). Now the main loop starts, including two sub-loops. In the first, a variable (let's call it min-counter) is iterating through the list in ascending order, starting with the element with index 0, until an element greater than or equal to the central value is found. That's where the min-counter loop stops. In the second, another variable (let's call it max-counter) is iterating through the list in descending order, starting with the element with index n-1, until an element lower than or equal to the central value is found. If min-counter now is lower than or equal to max-counter, the elements to which these two variables point are exchanged; then min-counter is increased and max-counter is decreased by 1, and the main loop continues. However, if min-counter is greater than max-counter, the execution of the main loop is stopped. Now it's time to check which sub-sequences haven't been sorted yet. Only if min-counter is equal to the greatest index and max-counter is equal to the lowest index of the current part of the sequence, this section can be regarded as sorted. Otherwise the function needs to call itself with new parameters: - If min-counter is lower than the greatest index in the current section, the sequence from the element with index min-counter to the one with the greatest index in the current section has to be sorted. - If max-counter is greater than the lowest index in the current section, the sequence from the element with the lowest index in the current section to the one with index max-counter has to be sorted. Example (the current section is in bold print, the current number in italic, and the positions where the loops stop are marked with > > for min-counter, < < for max-counter and | | for both.)
As you see, Quick Sort usually deserves its name. But in some cases (e.g. if the initial sequence is in reverse order) it's very slow (as computer scientists would write it: it has the complexity O(n^2)). Therefore Quick Sort isn't the way to go in realtime applications such as the controlling mechanisms of medical devices. Furthermore, due to the fact that it's a recursive algorithm, it may need a lot of memory. Here's my implementation of Quick Sort:
Note that while the other algorithms are called using a pointer to the array and the value of n as parameters, this one requires the pointer, 0 and n-1 as parameters. To standardize the calling process you may additionally use this function:
Optimizing Quick Sort: Quick Sort can be made slightly faster and economical by transforming it to an iterative algorithm. Basically it works very similarly to the recursive one, but you have to organize the stack with the parameters yourself, i.e. (pseudocode):
It saves a bit of memory because only the parameters are put to the stack, while the pointer to the array and the pointer to the function aren't. It's also a bit faster due to this. But it's harder to code because you have to take care of the stack yourself. Always keep in mind what can happen if you don't allocate enough memory! Conclusion Well, this was my doc on sorting algorithms. In the bonus pack you'll find a .cpp prog that demonstrates the usage of Selection Sort, Insertion Sort, Shell Sort, Bubble Sort and Quick Sort. BTW, Dario Phong explained yet another sorting algorithm, Heap Sort, in an article in Hugi #19. And in case you have some open questions on Quick Sort, feel free to consult his tut in Hugi #15. |