One useful and classic data structure for storing and retrieving dictionary information is a
Binary Search Tree. I have worked on implementing a self balancing BST, specifically
red black since my input is sorted and I need to ensure that the tree is well balanced which would allow us to have search and insertion operations within O(log n)
With such a structure #2 in this case be solved with it too since the closest(lexicographic) values to an element that is not found upon hitting a Nil value are the parent or Nil, parent of parent, and left or right sibling. For example if we search for the misspelled word explsive in balanced BST we will likely to reach a place like this:
Lexicographically, explsive value is greater than explosive and greater than explosiveness where explosiveness is a leaf in the tree so the search will keep going right until it hits a Nil node then it would return the thee nearest lexicographic node.
|
explosive
/ \
/ \
explosion explosiveness
/ \ / \
/ \ / \
Nil Nil Nil Nil
class node(object):
"""node element in the BST, properties:left child, right child, parent, key"""
def __init__(self,key):
self._left = None
self._right = None
self._parent = None
self._key = key
self._red = False
left = property(fget = lambda self:self._left, doc = "The node's left child")
right = property(fget = lambda self:self._right,doc = "The node's right child")
parent = property(fget = lambda self:self._parent,doc = "The node's parent")
key = property(fget = lambda self:self._key,doc = "The node's key")
red = property(fget = lambda self:self._red, doc = "Boolean to check if the node is red or not")
class tree(object):
"""Self balancing binary search tree (red black)"""
def __init__(self,create_node = node):
self._nil = create_node(key = None)
self._root = self.nil
self._create_node = create_node
root = property(fget = lambda self: self._root,doc = "The tree's root node")
nil = property(fget = lambda self: self._nil, doc = "The tree's nil node")
def insert_key(self,key):
"""Inserts a new node into the tree"""
self.insert_node(self._create_node(key=key))
def insert_node(self,newnode):
"""Inserts a new node into the tree and call the rebalance method"""
y = self.nil
x = self.root
while x != self.nil:
y = x
if newnode.key < x.key:
x = x.left
else:
x = x.right
newnode._parent = y
if y == self.nil:
self._root = newnode
elif newnode.key < y.key:
y._left = newnode
else:
y._right = newnode
newnode._left = self.nil
newnode._right = self.nil
newnode._red = True
self._insert_fixup(newnode)
def _insert_fixup(self,newnode):
"""Rebalance the tree after insertion of a new node"""
while newnode.parent.red:
if newnode.parent == newnode.parent.parent.left:
y = newnode.parent.parent.right
if y.red:
newnode.parent._red = False
y._red = False
newnode.parent.parent._red = True
newnode = newnode.parent.parent
else:
if newnode == newnode.parent.right:
newnode = newnode.parent
self._left_rotate(newnode)
newnode.parent._red = False
newnode.parent.parent._red = True
self._right_rotate(newnode.parent.parent)
else:
y = newnode.parent.parent.left
if y.red:
newnode.parent._red = False
y._red = False
newnode.parent.parent._red = True
newnode = newnode.parent.parent
else:
if newnode == newnode.parent.left:
newnode = newnode.parent
self._right_rotate(newnode)
newnode.parent._red = False
newnode.parent.parent._red = True
self._left_rotate(newnode.parent.parent)
self.root._red = False
def _left_rotate(self,x):
"Rotates the tree to the left based on a node x"
y = x.right
x._right = y.left
if y.left != self.nil:
y.left._parent = x
y._parent = x.parent
if x.parent == self.nil:
self._root = y
elif x == x.parent.left:
x.parent._left = y
else:
x.parent._right = y
y._left = x
x._parent = y
def _right_rotate(self,y):
"Rotates the tree to the right based on a node y"
x = y.left
y._left = x.right
if x.right != self.nil:
x.right._parent = y
x._parent = y.parent
if y.parent == self.nil:
self._root = x
elif y == y.parent._right:
y.parent._right = x
else:
y.parent._left = x
x._right = y
y._parent = x
def search(self,targetkey,current,parent):
"""Recursive subtree search starting from current node"""
if current == self.nil:
return (parent.parent.key,parent.parent.right.key,parent.parent.left.key)
elif current.key == targetkey:
return (current.key,)
elif current.key > targetkey:
return self.search(targetkey,current.left,current)
else:
return self.search(targetkey,current.right,current)
Here is a little test program the input file is the .doc I pointed out in my previous post above.
from rbmodule import tree
dic_tree = tree()
print "Loading file..."
for line in open("en_US.dic","r"):
newn = dic_tree.insert_key(line[:line.find("/")])
while(True):
userinput = raw_input("Enter a word to check: ")
search = dic_tree.search(userinput,dic_tree.root,None)
if str(search[0]) == userinput:
print "correct"
else:
print search
Another alternative to using a Red Black Tree to decrease the complication and ensure a balanced BST would have been to read the sorted file into a list and from there have a function that divides the array into sub problems and for each one takes the middle element, least as left and right child as the greatest element and build all subtrees in to one balanced BST. Something like this(not tested, wrote it here directly):
def list_to_bst(words,start,end):
if start > end:
return None
else:
mid = (start+end)/2
element = node(words[mid])
element._left = list_to_bst(words, start, (mid-1))
element._right = list_to_bst(words, (mid+1), end)
return element
All this as long as we are sure that there wont be more additions after the dictionary has loaded.
I realized while making this post that the search function is not very accurate in corrections if the target node is not a leaf or at least one element before a leaf, since the base case is for hitting a Nil node. I will work on enhancing it later today maybe.
Yet the use of BST for a spell checker is not really the best option, one reason is that nodes are compared according to their keys instead part of their associated records. Which makes a
Trie structure much more suitable for this kind of task.
Here is what I have for now I'll see what I can add later on.