I believe this set of problems to be more ambitious than anything we've done here. This is why I would encourage collaboration between members. If you have an idea, even if it's partially relevant, do not hesitate to share it.
According to James Hague, a spell checker used to be a "major feat of software engineering" some 30 years ago. Today's exercise will explore the problem in 2012.
Note on dictionaries
For the purpose of this exercise, we will need a dictionary, a file containing a (more or less) exhaustive list of all the words making up a language (let's stick with English for now).
If you're on Unix (Linux, MacOSX, ...), your OS is already providing you with a file like that. Its location varies according to the distribution, so check with your vendor documentation (or with Google) to find the one on your system. On Solaris it's at /usr/share/lib/dict/words.
If you're on other systems, you might have to check with your vendor documentation or find a dict file on the web.
In any case, let me know if you cannot find the file, I'll upload one to a share host immediately.
Problem 1: Easy exercise
Write a function that takes a text as an input, and returns a list of misspelled words. A word is considered misspelled if it is not found inside the dictionary.
Extra credit:
Note that you're responsible of managing the behavior of your program when dealing with upper/lowercase letters. It is tempting to ignore the issue altogether by making the search completely case-insensitive. However here are a few examples of mistakes that would be missed:
Problem 2: Suggestions
As a follow up to the first problem, modify the function so that it returns several suggestions to replace the mistake detected. For instance, "hosn" could return ("horn", "hose", "horse"). I'll leave it up to you to define what constitutes a valid suggestion (pay special attention to the fact that the mistake might happen at any letter of the word, even the first one).
Extra Credit:
You must determine the suggestion "most likely to be correct". Again, the definition of "most likely" is very loose and up to you to define it. One idea could be for instance the percentage of mistaken letters ("hosn" is more likely to be "horn" than "horse"). An other idea would be to determine it using statistics. We could determine, for instance, that the mistake "s->r" is more likely to occurr than "n->e" therefore "horn" is more likely than "hose". We could add many factors into account, like user-specific habits and/or the layout of a QWERTY keyboard, or whatever you think about.
Problem 3: Conjugating verbs
This is my favorite (actually the only one I like) feature of the Microsoft Office suite. Of course, realistically, we will consider the most simple case possible. The general solution to this problem is outside the scope of this exercise. (Note that some languages like French or German seem more complicated by a whole order of magnitude).
To the best of my knowledge, the English language has only one simple rule for conjugating verbs (excluding to be and to have auxiliaries): In the present tense, we append a "s" to any verb used in the 3rd singular person.
Note that "appending an s" follows a more general rule in the language, with special behaviours for different word endings. For instance a verb ending with a "y" or an existing "s" will behave differently. Feel free to ignore these corner cases and focus on simpler cases like the verbs "run" or "eat".
The goal of this exercise is to detect the subject of the sentence and verify if the verb is properly conjugated.
Disclaimer: I honestly have no idea of how difficult this last problem really is.
According to James Hague, a spell checker used to be a "major feat of software engineering" some 30 years ago. Today's exercise will explore the problem in 2012.
Note on dictionaries
For the purpose of this exercise, we will need a dictionary, a file containing a (more or less) exhaustive list of all the words making up a language (let's stick with English for now).
If you're on Unix (Linux, MacOSX, ...), your OS is already providing you with a file like that. Its location varies according to the distribution, so check with your vendor documentation (or with Google) to find the one on your system. On Solaris it's at /usr/share/lib/dict/words.
If you're on other systems, you might have to check with your vendor documentation or find a dict file on the web.
In any case, let me know if you cannot find the file, I'll upload one to a share host immediately.
Problem 1: Easy exercise
Write a function that takes a text as an input, and returns a list of misspelled words. A word is considered misspelled if it is not found inside the dictionary.
Extra credit:
Note that you're responsible of managing the behavior of your program when dealing with upper/lowercase letters. It is tempting to ignore the issue altogether by making the search completely case-insensitive. However here are a few examples of mistakes that would be missed:
- i am living in paris for the moment.
- What do yoU thInk Of tHis exErcise
Problem 2: Suggestions
As a follow up to the first problem, modify the function so that it returns several suggestions to replace the mistake detected. For instance, "hosn" could return ("horn", "hose", "horse"). I'll leave it up to you to define what constitutes a valid suggestion (pay special attention to the fact that the mistake might happen at any letter of the word, even the first one).
Extra Credit:
You must determine the suggestion "most likely to be correct". Again, the definition of "most likely" is very loose and up to you to define it. One idea could be for instance the percentage of mistaken letters ("hosn" is more likely to be "horn" than "horse"). An other idea would be to determine it using statistics. We could determine, for instance, that the mistake "s->r" is more likely to occurr than "n->e" therefore "horn" is more likely than "hose". We could add many factors into account, like user-specific habits and/or the layout of a QWERTY keyboard, or whatever you think about.
Problem 3: Conjugating verbs
This is my favorite (actually the only one I like) feature of the Microsoft Office suite. Of course, realistically, we will consider the most simple case possible. The general solution to this problem is outside the scope of this exercise. (Note that some languages like French or German seem more complicated by a whole order of magnitude).
To the best of my knowledge, the English language has only one simple rule for conjugating verbs (excluding to be and to have auxiliaries): In the present tense, we append a "s" to any verb used in the 3rd singular person.
Note that "appending an s" follows a more general rule in the language, with special behaviours for different word endings. For instance a verb ending with a "y" or an existing "s" will behave differently. Feel free to ignore these corner cases and focus on simpler cases like the verbs "run" or "eat".
The goal of this exercise is to detect the subject of the sentence and verify if the verb is properly conjugated.
Disclaimer: I honestly have no idea of how difficult this last problem really is.