[Exercise] Spell Checker

Joe · Nov 14, 2012

I believe this set of problems to be more ambitious than anything we've done here. This is why I would encourage collaboration between members. If you have an idea, even if it's partially relevant, do not hesitate to share it.

According to James Hague, a spell checker used to be a "major feat of software engineering" some 30 years ago. Today's exercise will explore the problem in 2012.

Note on dictionaries
For the purpose of this exercise, we will need a dictionary, a file containing a (more or less) exhaustive list of all the words making up a language (let's stick with English for now).

If you're on Unix (Linux, MacOSX, ...), your OS is already providing you with a file like that. Its location varies according to the distribution, so check with your vendor documentation (or with Google) to find the one on your system. On Solaris it's at /usr/share/lib/dict/words.

If you're on other systems, you might have to check with your vendor documentation or find a dict file on the web.

In any case, let me know if you cannot find the file, I'll upload one to a share host immediately.

Problem 1: Easy exercise
Write a function that takes a text as an input, and returns a list of misspelled words. A word is considered misspelled if it is not found inside the dictionary.

Extra credit:
Note that you're responsible of managing the behavior of your program when dealing with upper/lowercase letters. It is tempting to ignore the issue altogether by making the search completely case-insensitive. However here are a few examples of mistakes that would be missed:

i am living in paris for the moment.
What do yoU thInk Of tHis exErcise

Of course the behaviour is heavily dictated by the dictionary you're using. Including a short text to describe what you're doing will help us follow.

Problem 2: Suggestions
As a follow up to the first problem, modify the function so that it returns several suggestions to replace the mistake detected. For instance, "hosn" could return ("horn", "hose", "horse"). I'll leave it up to you to define what constitutes a valid suggestion (pay special attention to the fact that the mistake might happen at any letter of the word, even the first one).

Extra Credit:
You must determine the suggestion "most likely to be correct". Again, the definition of "most likely" is very loose and up to you to define it. One idea could be for instance the percentage of mistaken letters ("hosn" is more likely to be "horn" than "horse"). An other idea would be to determine it using statistics. We could determine, for instance, that the mistake "s->r" is more likely to occurr than "n->e" therefore "horn" is more likely than "hose". We could add many factors into account, like user-specific habits and/or the layout of a QWERTY keyboard, or whatever you think about.

Problem 3: Conjugating verbs
This is my favorite (actually the only one I like) feature of the Microsoft Office suite. Of course, realistically, we will consider the most simple case possible. The general solution to this problem is outside the scope of this exercise. (Note that some languages like French or German seem more complicated by a whole order of magnitude).

To the best of my knowledge, the English language has only one simple rule for conjugating verbs (excluding to be and to have auxiliaries): In the present tense, we append a "s" to any verb used in the 3rd singular person.

Note that "appending an s" follows a more general rule in the language, with special behaviours for different word endings. For instance a verb ending with a "y" or an existing "s" will behave differently. Feel free to ignore these corner cases and focus on simpler cases like the verbs "run" or "eat".

The goal of this exercise is to detect the subject of the sentence and verify if the verb is properly conjugated.

Disclaimer: I honestly have no idea of how difficult this last problem really is.

Ayman · Nov 14, 2012

Nice exercise Rahmu I will for sure do Problems 1 and 2 in one program and maybe attempt 3 if I have time.

One useful spelling dictionary is that of Open Office Spelling Dictionary Extension download the .oxt file and open it in via winrar(or similar archiving tool) the "en_US.dic" file within is the text file containing a list of 62121 words.

Mm0ei · Nov 14, 2012

1st exercise:

print [s for s in re.compile("[^\w']").split("this is oNly a teest!") if s not in open('local.dic', 'rb').read() or len([s[i] for i in range(1, len(s)) if s[i].isupper()])>0]

returns

['oNly', 'teest']

This will split the string according to special characters except for the apostrophe ' because of some words like "doesn't or don't". The words with capital letters other than the first letter will be considered false and also the words that are not included in the dictionary.

I'll edit my post later on to solve the others.

Ayman · Nov 16, 2012

One useful and classic data structure for storing and retrieving dictionary information is a Binary Search Tree. I have worked on implementing a self balancing BST, specifically red black since my input is sorted and I need to ensure that the tree is well balanced which would allow us to have search and insertion operations within O(log n)

With such a structure #2 in this case be solved with it too since the closest(lexicographic) values to an element that is not found upon hitting a Nil value are the parent or Nil, parent of parent, and left or right sibling. For example if we search for the misspelled word explsive in balanced BST we will likely to reach a place like this:

Lexicographically, explsive value is greater than explosive and greater than explosiveness where explosiveness is a leaf in the tree so the search will keep going right until it hits a Nil node then it would return the thee nearest lexicographic node.

                         |
                   explosive
                 /              \
               /                 \
        explosion      explosiveness
            /     \               /     \
          /        \             /        \ 
        Nil       Nil         Nil       Nil

class node(object):
    """node element in the BST, properties:left child, right child, parent, key"""

    def __init__(self,key):
        self._left = None
        self._right = None
        self._parent = None
        self._key = key
        self._red = False

    left = property(fget = lambda self:self._left, doc = "The node's left child")
    right = property(fget = lambda self:self._right,doc = "The node's right child")
    parent = property(fget = lambda self:self._parent,doc = "The node's parent")
    key = property(fget = lambda self:self._key,doc = "The node's key")
    red = property(fget = lambda self:self._red, doc = "Boolean to check if the node is red or not")

class tree(object):
    """Self balancing binary search tree (red black)"""

    def __init__(self,create_node = node):
        self._nil = create_node(key = None)
        self._root = self.nil
        self._create_node = create_node

    root = property(fget = lambda self: self._root,doc = "The tree's root node")
    nil = property(fget = lambda self: self._nil, doc = "The tree's nil node")

    def insert_key(self,key):
        """Inserts a new node into the tree"""
        self.insert_node(self._create_node(key=key))

    def insert_node(self,newnode):
        """Inserts a new node into the tree and call the rebalance method"""
        y = self.nil
        x = self.root

        while x != self.nil:
            y = x

            if newnode.key < x.key:
                x = x.left

            else:
                x = x.right

        newnode._parent = y

        if y == self.nil:
            self._root = newnode

        elif newnode.key < y.key:
            y._left = newnode

        else:
            y._right = newnode

        newnode._left = self.nil
        newnode._right = self.nil
        newnode._red = True

        self._insert_fixup(newnode)

    def _insert_fixup(self,newnode):
        """Rebalance the tree after insertion of a new node"""
        while newnode.parent.red:
            if newnode.parent == newnode.parent.parent.left:

                y = newnode.parent.parent.right

                if y.red:
                    newnode.parent._red = False
                    y._red = False
                    newnode.parent.parent._red = True
                    newnode = newnode.parent.parent

                else:
                    if newnode == newnode.parent.right:
                        newnode = newnode.parent
                        self._left_rotate(newnode)

                    newnode.parent._red = False
                    newnode.parent.parent._red = True
                    self._right_rotate(newnode.parent.parent)
            else:

                y = newnode.parent.parent.left

                if y.red:
                    newnode.parent._red = False
                    y._red = False
                    newnode.parent.parent._red = True
                    newnode = newnode.parent.parent

                else:
                    if newnode == newnode.parent.left:
                        newnode = newnode.parent
                        self._right_rotate(newnode)

                    newnode.parent._red = False
                    newnode.parent.parent._red = True
                    self._left_rotate(newnode.parent.parent)

        self.root._red = False

    def _left_rotate(self,x):
        "Rotates the tree to the left based on a node x"
        y = x.right
        x._right = y.left

        if y.left != self.nil:
            y.left._parent = x

        y._parent = x.parent

        if x.parent == self.nil:
            self._root = y

        elif x == x.parent.left:
            x.parent._left = y

        else:
            x.parent._right = y

        y._left = x
        x._parent = y

    def _right_rotate(self,y):
        "Rotates the tree to the right based on a node y"
        x = y.left
        y._left = x.right

        if x.right != self.nil:
            x.right._parent = y

        x._parent = y.parent

        if y.parent == self.nil:
            self._root = x

        elif y == y.parent._right:
            y.parent._right = x

        else:
            y.parent._left = x

        x._right = y
        y._parent = x

    def search(self,targetkey,current,parent):
        """Recursive subtree search starting from current node"""
        
        if current == self.nil:
            return (parent.parent.key,parent.parent.right.key,parent.parent.left.key)
        
        elif current.key == targetkey:
            return (current.key,)

        elif current.key > targetkey:
            return self.search(targetkey,current.left,current)

        else:
            return self.search(targetkey,current.right,current)

Here is a little test program the input file is the .doc I pointed out in my previous post above.

from rbmodule import tree

dic_tree = tree()

print "Loading file..."

for line in open("en_US.dic","r"):
     newn = dic_tree.insert_key(line[:line.find("/")])

while(True):
     userinput = raw_input("Enter a word to check: ")
	
     search = dic_tree.search(userinput,dic_tree.root,None)
	
     if str(search[0]) == userinput:
          print "correct"
     else:
          print search

Another alternative to using a Red Black Tree to decrease the complication and ensure a balanced BST would have been to read the sorted file into a list and from there have a function that divides the array into sub problems and for each one takes the middle element, least as left and right child as the greatest element and build all subtrees in to one balanced BST. Something like this(not tested, wrote it here directly):

def list_to_bst(words,start,end):
    if start > end:
        return None

    else:
        mid = (start+end)/2

        element = node(words[mid])
        element._left = list_to_bst(words, start, (mid-1))
        element._right = list_to_bst(words, (mid+1), end)

    return element

All this as long as we are sure that there wont be more additions after the dictionary has loaded.

I realized while making this post that the search function is not very accurate in corrections if the target node is not a leaf or at least one element before a leaf, since the base case is for hitting a Nil node. I will work on enhancing it later today maybe.

Yet the use of BST for a spell checker is not really the best option, one reason is that nodes are compared according to their keys instead part of their associated records. Which makes a Trie structure much more suitable for this kind of task.

Here is what I have for now I'll see what I can add later on.

Eentropy · Nov 17, 2012

While I don't have time to code something up at the moment, the following essay by Peter Norvig is pretty informative on the topic.

Zzalabata · Jun 19, 2014

Hello, i wanted to share this code.

This Code was my first Perl program.

#!user/bin/perl
# Strict and Warnings are used to debug (figure out what else could be wrong)
# Use Text Fuzzy to calculate string distance
# Use Scalar::Util qw(looks_like_number) to check numeric
use strict;
use warnings;
use Getopt::Long;
use Text::Fuzzy;
use Scalar::Util qw(looks_like_number);





my $max = 2;

GetOptions (
    "max=i" => \$max,
);

	print "Welcome, enter the sentence\n ";
	
	my $sentence;
	my @dictionary;
	my @words;
	my @need_to_be_corrected;
	my %hash_mistack;
	$sentence = <STDIN>;
	chomp $sentence;
	
	#read Dictionary
	@dictionary = readDictionary();
	
	@words = split(/[. ]+/,$sentence);	
	
	foreach my $wrd (@words) {
		$wrd = lc($wrd); 
		#if the word in dictionary
		if(!grep $_ eq $wrd, @dictionary)
		{
			#check if numeric
			if(looks_like_number($wrd)==0)
			{
			  #escape Punct
			  $wrd =~ s/[[:punct:]]//g;	
			  push(@need_to_be_corrected,$wrd);
			}
		}			

	}
	
	%hash_mistack = map { $_ => "" } @need_to_be_corrected;
	
	foreach my $mistake (sort keys %hash_mistack) {
		
	       	
		for(my $index=0;$index<=$#dictionary;$index++){
			#create fuzzy string with max
			my $tf = Text::Fuzzy->new ($mistake, max => $max);
			#calculate distance between the dictionary word and the mistake		
			my $dist = $tf->distance ($dictionary[$index]);
			
			if ($dist==1) {
				#Add the nearest Distance to the hash
				$hash_mistack{$mistake} .= $dictionary[$index].", ";
				
			}
	        }

	}

#Display 
foreach my $mistake(sort keys %hash_mistack)
{
	print $mistake.": ".$hash_mistack{$mistake}."\n\n";		
	
}


sub readDictionary {
	my @array;
	open(my $fh, "<", "dictionary.txt")
		or die "Failed to open file: $!\n";
	while(<$fh>) { 
			chomp; 
			push @array, $_;
		} 
	close $fh;
	return @array;
}