I was looking for a method to sort a really large data-set without using all a users RAM, using all their disk space, or having them sit and wait for N² comparisons.
I found the answer in the form of a Trie-tree, a neat data structure that flattens ASCII characters into… well a tree.
Lets say we have the following words:
A Trie Tree would represent the data in the following way:
Each node represents a character which can have 24 children (or more depending on the character set). A node can be marked as an endpoint (in red) to signal the completion of a character string. Duplicate characters in similar words only use one node between them, as seen with A P for ape, apple and app, this has the wonderful side affect of compressing the sort data-structure. Our input data-set contained 24 characters, our tree contains 17 characters (Ignoring overheads).
To retrieve a sorted list of words, you can traverse the tree taking the left most node, checking if its an endpoint (marked in red) if it is, return this, if not, take the left most child, unless there is no more children, then travel up to the parent and take the next adjacent child.
1. It’s not going to be great for supporting the full unicode character set. A binary tree would probably work better.
2. There is a overhead traversing the structure to add and read the items, but compared to other sorts this isn’t extreme.
3. In this implementation only one copy of each word is returned. This can easily be fixed by keeping a counter at each endpoint for duplicates.
4. Features like #3 eat ram in every node. This adds overhead to the tree and the compression is less efficient.