mh-z - Heap-based Boggle/Tangleword Solver Algorithm

mh-z / Heap-based Boggle/Tangleword Solver Algorithm

Introduction:

After reading about the trie-based algorithm on the main page, you should be aware of its limitations. I came up with this algorithm as a way of avoiding the use of the trie altogether, one benefit of which is that it is more suitable for use in a CGI program.

Assuming that there are no duplicate letters on the board, the heap-based algorithm is very simple. I will explain the algorithm in this way initially. The fact that there can be duplicate letters severely complicates things, but we'll deal with that later.

Heap:

A heap is a data structure also known as a priority-queue. A regular queue is similar to a line of people who are treated in a first-come, first-serve manner. In a computer program, a queue can be used to store many bits of information. Let's say we have a queue named queue. We can perform the following operations to insert the letters t, h, and e into the queue:

queue.insert('t'); queue.insert('h'); queue.insert('e');

Then, later, we can do the following:

queue.remove();

...and the result we get back will be "t." The queue can be used with any kind of data: for example letters, numbers, and complex objects.

A heap has the property that there is some sort of weight factor associated with each element in the queue, and the element with the highest priority is always taken out first. A heap where the lowest weight is assigned the highest priority is called a minheap and a heap where the highest weight is assigned the highest priority is called a maxheap.

In the case of the heap used with the Tangleword Solver, a minheap is used and the weight factor is the numerical value of the letter, such that "a" is 1, "b" is 2 and so on. So if we perform the following sequence of operations on a heap named heap:

heap.insert('t'); heap.insert('a'); heap.insert('r');

...and then perform

heap.remove(); heap.remove(); heap.remove();

...we will get back the letters in alphabetical order: first "a," then "r," then "t."

Rather than just inserting letters, we could insert some additional information along with each letter. For example, in the case of the Tangleword Solver, x and y coordinates of the letter's position on the board. That information is not used in determining the letter's weight, although such designations are completely up to the programmer.

No-duplicate solution:

Let's say we have the following board:

s l a r e t u op

First, we stick all of the letters on the board into a heap (heap1). Then we take one letter out, which in this case is a. Then we insert all letters surrounding a (l, e, t) into another heap (heap2). We take one letter out, which is e.

Now, are there any words that start with ae? Assume that we have a very simple dictionary and that there aren't. We scan through the dictionary, starting at the first word, until we hit the first word that either has the prefix ae or is alphabetized after where it would appear. Since we are pretending that there aren't any words with that prefix, the latter is the case.

So, ae is a dead path and we go back to heap2 for another letter: l. Are there any words that start with al? Sure, there are tons of them (but the program doesn't know that yet). We scan through the dictionary and find that there is a word that starts with al. Now we check if the dictionary word is al and not just a bigger word with that prefix. Since our dictionary only has regular words (not names) the result is that it only a prefix. So, the path al is good and we continue with it. We add the letters adjacent to l to a new heap (heap3). It should be mentioned at this point that only letters not yet in the path are added: s, r and e. We take out one letter from heap3: e.

Are there any words that start with ale? We scan forward through the dictionary until we hit a word with that prefix or pass the position where one would be. As it turns out, we do find a word with that prefix: "ale." We check if the prefix is the same as the whole word: it is. So, we've found the first word on the board, "ale," and print it out. Now, since ale is a valid path, we have to continue from there by adding the letters surrounding it to heap4: s r t u o p.

Where we stand:

heap1 contains: s, l, r, e, t, u, o, p
heap2 contains: l, t
heap3 contains: s, r
heap4 contains: s, r, t, u, o, p

The current path is ale

So you don't have to refer so far back up on the page, the board is:

s l a r e t u o p

We continue:

aleo does not start any words.

alep starts at least one word (add t and o to the next heap).

alept does not start any words.

alepo does not start any words.

aler does not start any words.

ales starts a word and is the word, so we print it out (add r to the next heap).

alesr does not start any words.

alet does not start any words.

Now the path from ale is exhausted. We continue:

alr does not start any words.

als starts at least one word (add r and e to the next heap).

alse does not start any words.

alsr does not start any words.

alt starts at least one word (add e, o and p to the next heap).

alte starts at least one word (add s, r, u, o and p to the next heap).

alteo does not start any words.

altep does not start any words.

alter starts a word and is the word, so we print it out (add s, u and o to the next heap).

altero does not start any words.

alteru does not start any words.

alters starts a word and is the word, so we print it out (there are no more letters to add to the next heap).

altes does not start any words.

And so on... the pattern should be clear by now. We continue until heap1 is empty, at which point all the words on the board that are in the dictionary will have been found, in alphabetical order.

Duplicate letters:

What are the problems associated with having duplicate letters on the board?

Let's say we have this board:

a c d e f g h b a

Recall that the first step is to add all of the letters on the board into a heap. We do that, and then take out the first letter, which is a. Notice that there are two a's on the board. If we started with the upper-left a, the first path taken would begin with ac. But, the first path stemming from the lower-right a is ab, which should come first! We need to get the words in alphabetical order.

Assuming we somehow figured out to go with the lower-right a first, then on to ab and fully explore that prefix, the next path to explore from the same a is af. But, that comes after a path which is possible if we had gone with the upper-right a, namely ac. As you can see, we have to jump around a bit, trying prefixes in this order: a, ab, ac, ae, af, ag, and so on. Whether we use one a or the other can change at each step.

Look at the prefix af. There are two ways of getting to it. Which one should we use? The short answer is that we should use both. But which one should be explored first?

On a large board there may be over 10 repetitions of some letters, with many letters on the board being repeated at least once. Obviously a simple method must be devised to take care of the situations that will arise.

Solution with duplicates:

Let's say we have this board, with two duplicates:

a c b e f g h b a

Our first step, as described above, was to add all of the letters on the board into a heap. This time, however, not only do we add each letter into the heap, but we attach some information to each letter: the coordinates of its position on the board, and its predecessor letter structure (if there is one). So one a would be a(1,1,null) and the second a would be a(3,3,null). null means that the letter is the very first one in the word being built, so there is no previous letter to point to.

We add all letters surrounding both a's into the next heap (heap2). For the first, we have c(1,2,a(1,1,null)), e(2,1,a(1,1,null)), and f(2,2,a(1,1,null)). For the second, we have f(2,2,a(3,3,null)), g(2,3,a(3,3,null)), and b(3,2,a(3,3,null)).

We draw a letter out of heap2, which is b. But we have to call it by its full name, b(3,2,a(3,3,null)). (Since the other b on the board is not adjacent to our b, it's irrelevant for now.) We add the letters surrounding this b to heap3: the first one added is e(2,1,b(3,2,a(3,3,null))).

After the path from ab has been exhausted, we take out the next letter from heap2: c for short, or c(1,2,a(1,1,null)) in full. We are at the prefix ac, but now the other a is in use, and that information is stored in the letter structure. The letter structure's information also tells us which letters on the board have already been visited in the current path, so we don't add them into to the next heap.

One more consideration is that it is wasteful to actually store the data about the entire chain of predecessors in each letter structure, because we would have to copy it for each new letter added. This wastes time and memory. For example, if we have e(2,1,b(3,2,a(3,3,null))) as the current letter and want to add the a in the upper-left corner, we would have to make a(1,1,e(2,1,b(3,2,a(3,3,null)))) Rather, we can store a pointer to the predecessor, which contains a pointer to its predecessor, and so on. A pointer is just an index of where a particular item can be found in the computer's memory, that is, its memory address. So in the example here we would have a(1,1,pointer) rather than having to copy that whole string in place of pointer.

Pseudocode and implementation details:

The following is pseudocode for the heap-based algorithm used by my Tangleword solver.

Define a structure called Letter which contains:
- A letter
- The coordinates of the letter on the board
- A pointer to a Letter (a memory address)
- The structure also defines the less-than operator so that two Letters may be compared and in general Letters can be ordered and sorted.
- Each letter on the board will be represented by a Letter containing it and its coordinates.
Create a heap called heap big enough to hold N-squared Letters, where N is the length of a side of the board. Add all letters on the board to heap each as a Letter with its pointer member set to null.
Create a character array called word large enough to hold the longest word in the dictionary file. 50 characters is plenty. Create a variable pos to store the index of the current position in the word, starting out at 1 (the index of the first letter in word).
(A) Begin the following recursive procedure, given a particular heap named heap:
- Create another heap, called nextheap, with the capacity to hold 8 times the number of letters currently in heap.
- (B) Begin a loop:
  - Remove a Letter from heap, call it currentletter.
  - Set the the letter at pos in the array word to the letter contained in Letter.
  - The letters in word up to index pos form a prefix. Scan forward in the dictionary file to see if there are any words that start with this prefix.
  - If there are no words starting with the current prefix, and heap is not empty, go back to the beginning of the current loop.
  - If there are words starting with the current prefix:
    - Check if the dictionary word that starts with the above prefix is the same length as the prefix (i.e., the words are identical), in which case a word was found and should be output.
    - Check if currentletter is q, in which case insert a new Letter into nextheap containing u, and the same coordinates and pointer member as currentletter.
    - Restore the trail of currentletter:
      
      If the pointer member of currentletter is not null, it points to another Letter. Get that Letter and mark its coordinates as visited. Repeat. (Essentially we have a linked-list of Letters and are restoring the data contained in the list).
    - Add adjacent Letters to nextheap:
      
      For each Letter adjacent to currentletter, check if that Letter's coordinates have been marked as visited. If no, add that Letter to nextheap with its pointer member set to point to currentletter.
    - Remove the trail of currentletter which was restored above (a stack may be used to keep track of and remove the trail, or the entire board can be cleared, or we can follow the linked-list starting from currentletter again as we did before, this time marking the coordinates as being unvisited).
    - Test: does the next Letter in heap hold the same letter as currentletter does?
      - If true, repeat the current loop (B) which will add more letters to nextheap. [This is why the maximum size of nextheap is 8 times the number of letters in heap; each Letter can have up to 8 adjacent Letter structures, and nextheap is not cleared between duplicate letters.
      - If false, then the next Letter in heap is not a duplicate. Call (A) using nextheap as heap and incrementing pos by 1. [The current function calls itself, or in other words, recurs.]
  - If heap is not empty, jump to (2). Otherwise, all paths from the letter are exhausted, and we are...
- Done (return to location of last call if there was one).

Possible optimizations:

Although the above method as described is quite efficient already, here are some ideas which could possibly be used to optimize the algorithm even further.

Memory: The amount of memory allocated for nextheap is usually much larger than is necessary. [If you use a real tree for your heap (i.e., memory is allocated on a per-item basis) this does not apply. But in general a tree-style heap uses far more memory than a heap stored in an array, since the pointers necessary to hold the tree structure end up taking up about as much room as the elements.]

The largest number of elements that would need to be stored in nextheap depends on the maximum number of repetitions of letters in heap. For example, take the board

a c e d e f g b a

We need to get the words out of the board in alphabetical order, but we might have words that start with ab begin in the lower right corner, then ac and ad in the upper left corner, finally af in the lower right corner again. This suggests that the letters following both a's should be in the same nextheap so that they can be extracted in alphabetical order. Now let's say the board was bigger than the 3x3 board pictured here. If both a's were not touching the edges of the board, each would require all 8 letters surrounding it to be added to the nextheap. [Note: the same position on the board can be added to the same nextheap, since the Letter structure at that position would have a different pointer member.]

Since nextheap is emptied by the recursive call for each set of letters added to it, the maximum number of letters nextheap must be capable of storing is 8 times the maximum number of duplicates in heap. However, since it requires significant additional computation to determine the maximum number of duplicate letters in each heap, the upper bound is taken to be that every letter in heap is the same (which is entirely possible), although improbable. Unless the user types in a board with every letter the same, just as an experiment...

If one used a quick way to determine the maximum number of repetitions of a particular element in a heap (maybe as a built-in variable in the heap code which is automatically updated as elements are added and removed) then this optimization could be implemented. Note that it is relatively minor (it saves a small amount of memory at the expense of some additional processing) and the program already doesn't use much memory.

Speed: Scanning through the dictionary file does not necessarily require that every word be read in (although that is the case in the current implementation). The current implementation uses a function ScanToPrefix which simply gets the next word in the dictionary, checks if the word is greater than, less than, or equal to the current prefix, and decides either that the current prefix is or is not a prefix to any dictionary words, or that the loop should get the next word in the dictionary and check it. Every single word in the dictionary (up to the last prefix that is checked) is read and temporarily stored into memory. Basically, a 5 megabyte file is read through (in order) on disk each time the Tangleword solver is run.

A better method might be to: Jump ahead in the dictionary file a certain number of bytes, as determined by (a) the last prefix that was checked, (b) the prefix being checked now, and (c) a table of the number of words in the dictionary with common prefixes. Then, perform a binary search by jumping around a tiny bit in the file or just jumping back and scanning to find the starting point of the current prefix or the point at which the prefix should be in the dictionary. Using a method like this it is possible that a great deal of the dictionary file could be bypassed.

Some considerations might be the fact that seeking continually within a file is slower than reading it linearly (I don't know this to be a fact) or that it would interfere with the file being cached well by the operating system. On the other hand, if the program were run several times the entire file could be cached to memory, making seeking within the file fast.

See also:

CGI Boggle/Tangleword Solver