• Coding
  • [Exercise] Alien Language Pattern Matching

Hello, I was going through some exercises on Google Code Jam and one nice exercise was 2009 qualification round problem A. Not a hard problem but interesting to solve.
Code Jam 2009 - Qualification Problem A wrote Problem

After years of study, scientists at Google Labs have discovered an alien language transmitted from a faraway planet. The alien language is very unique in that every word consists of exactly L lowercase letters. Also, there are exactly D words in this language.

Once the dictionary of all the words in the alien language was built, the next breakthrough was to discover that the aliens have been transmitting messages to Earth for the past decade. Unfortunately, these signals are weakened due to the distance between our two planets and some of the words may be misinterpreted. In order to help them decipher these messages, the scientists have asked you to devise an algorithm that will determine the number of possible interpretations for a given pattern.

A pattern consists of exactly L tokens. Each token is either a single lowercase letter (the scientists are very sure that this is the letter) or a group of unique lowercase letters surrounded by parenthesis ( and ). For example: (ab)d(dc) means the first letter is either a or b, the second letter is definitely d and the last letter is either d or c. Therefore, the pattern (ab)d(dc) can stand for either one of these 4 possibilities: add, adc, bdd, bdc.
Code Jam 2009 - Qualification Problem A wrote Input
The first line of input contains 3 integers, L, D and N separated by a space. D lines follow, each containing one word of length L. These are the words that are known to exist in the alien language. N test cases then follow, each on its own line and each consisting of a pattern as described above. You may assume that all known words provided are unique.
Code Jam 2009 - Qualification Problem A wrote Output
For each test case, output

Case #X: K
where X is the test case number, starting from 1, and K indicates how many words in the alien language match the pattern.
Sample Input/Output:
Input 
 
3 5 4
abc
bca
dac
dbc
cba
(ab)(bc)(ca)
abc
(abc)(abc)(abc)
(zyx)bc

Output:

Case #1: 2
Case #2: 1
Case #3: 3
Case #4: 0
My Python solution coming up in a few minutes. Would love to see how you would solve this in your favorite language.
Here is my python solution works very well on the large dataset on the site.
# Parses each string patterns to list of strings and lists(options)
def parse_patterns(patterns):
    lines = []

    def getclosing(i,pattern):
        for x in xrange(i,len(pattern)):
            if pattern[x] == ")":
                return x
        return -1

    for pattern in patterns:
        line = []
        i = 0
        
        while i < len(pattern): 
            if pattern[i] == "(":
                j = getclosing(i,pattern)
                
                if j > i and j-i > 2:
                    line.append(list(pattern[i+1:j]))
                    i = j

            elif pattern[i] != ")":
                line.append(pattern[i])
            i += 1

        lines.append(line)

    return lines

# Check if a string matches a pattern represented as a list
def checkmatch(word, pattern):
    for i,el in enumerate(word):
        return !(type(pattern[i]) == str and el != pattern[i]) or (type(pattern[i])==list and el not in pattern[i]):
    return True

f = open('A-large-practice.in')
lines = f.readlines()
f.close()

L, D, N = map(int,lines[0].rstrip('\n').split(' '))

# Get list of all alien words
words = [w.rstrip('\n') for w in lines[1:D+1]]

# Get list of patters and parse them
p_patterns = parse_patterns([w.rstrip('\n') for w in lines[(D+1):(D+N+1)]])

# Keep track of number of matches for each pattern here
matches = [0]*len(p_patterns)

# Check every word for matching with any pattern
for word in words:
    for i, pattern in enumerate(p_patterns):
        if checkmatch(word,pattern):
            matches[i] += 1

# Printe results
for i in xrange(0,len(matches)):
    print "Case #{0}: {1}".format(i+1,matches[i])
The idea here is to parse the input into different lists. For the patterns list which is a list of strings parse it into a python list of lists and strings to be able to deal with the pattern more easily. The list of chars inside the list would denote an or between the characters.

Example:
"(ab)(bc)(ca)" converts to [["a","b"],["b","c"],["c","a"]]
"abc" converts to ["a","b","c"]
"(zyx)bc" converts to [["z","y","x"],"b","c"]
This way matching a word against a pattern would become trivial. traverse the word on character at a time and compare against the elements of the list. If the current index (e.g. char a in word) is against a nested list then a must be an element of that list to continue the match. If the character is against another character then they must me equal for the pattern to continue matching.
a month later
So I finally found some time to work on this last night. This is a nice problem and my solution is actually the same as yours:

1. Read the words into a list
2. Parse the patterns into a list of pattern, where each pattern is a list of tokens,
and a token can either be empty, a character, or a "choice" between several characters

3. To match a word against a pattern, make sure their lengths match,
then for each letter in the word, see if it matches the corresponding token in that word.

4. To match a letter against a token, if the token is empty there's no match, if it's a character, then just compare them,
and if it's a choice then see if the character is one of the options for that choice.

5. For each pattern in the list of patterns, match each word in the list of words against it and count the number of matches.

Of course, I resorted to Haskell. I don't think my code is clean or even efficient, but I'm working on getting my skill level up to writing "idiomatic" Haskell.
import System.IO
import Control.Monad

data Token  = Empty
            | Char Char
            | Choice String
            deriving (Show)

type Pattern = [Token]

main :: IO ()
main = do
    input   <- openFile "A-large-practice.in" ReadMode
    line    <- hGetLine input

    let (l, d, n) = inputConstraints line
       
    ws      <- replicateM d (hGetLine input)
    pStrs   <- replicateM n (hGetLine input)

    let ps          = patterns pStrs
        solutions   = zip [1..] $ solve ps ws

    forM_ solutions $ \(index, count) ->
        putStrLn $ "Case #" ++ show index ++ ": " ++ show count

    hClose input
    return ()

solve :: [Pattern] -> [String] -> [Int]
solve ps ws = map countMatches ps where
    countMatches p = length $ filter (`matches` p) ws

matches :: String -> Pattern -> Bool
matches word pattern
    | length word /= length pattern = False
    | otherwise = foldr match True $ zip word pattern
    
    where
        match (c, token) value
            | not value = False
            | otherwise = case token of
                Empty       -> False
                Char x      -> c == x
                Choice cs   -> c `elem` cs

patterns :: [String] -> [Pattern]
patterns = foldr parsePattern [] where
    parsePattern s zs = case pattern s of
                            Just p  -> p:zs
                            Nothing -> zs

pattern :: String -> Maybe Pattern
pattern s = case parseToken s of
    (Just Empty, _) -> Just []
    (Just t, rest)  -> case pattern rest of
                        Just ts -> Just (t:ts)
                        _       -> Nothing
    _               -> Nothing

parseToken :: String -> (Maybe Token, String)
parseToken ""   = (Just Empty, "")
parseToken s@(c:cs)
    | c == '('  = parseChoice s
    | otherwise = (Just $ Char c, cs)

parseChoice :: String -> (Maybe Token, String)
parseChoice ('(':cs) = tryChoice cs (Just $ Choice []) False where

    tryChoice [] value isClosed
        | isClosed  = (value, [])
        | otherwise = (Nothing, [])

    tryChoice s@(x:xs) value _
        | x == '('  = (Nothing, s)
        | x == ')'  = (value, xs)
        | otherwise = case value of
            Just (Choice ps) -> let newValue = (Just $ Choice (ps ++ [x]))
                                in  tryChoice xs newValue False

            _                -> (Nothing, s)

parseChoice s = (Nothing, s)

inputConstraints :: String -> (Int, Int, Int)
inputConstraints line = let xs = (map read $ words line) :: [Int]
                        in (head xs, xs !! 1, xs !! 2)
I'll try to improve my solution over time so I put it on my github at: https://github.com/saeidw/alien-lang
Nice Haskell work saeidw thanks for the contribution!
a month later
Well, I could definitely write a much much cleaner solution right now, but I'd like to 1-UP everyone here by having solved it 4 year ago(I'm at 2677). There's a link to download the solution.

Edit: for the lazy. Below is the code:
#!/usr/bin/python

import psyco
psyco.full()

import sys

def build_dict(words):
    d = {}
    for word in words:
        cur_dict = d
        for l in word:
            if l not in cur_dict:
                cur_dict[l] = {}
            cur_dict = cur_dict[l]
        cur_dict['word'] = word
    return d

def word_in_dict(word, dict):
    if len(word) == 0:
        if 'word' in dict:
            return 1
        return 0
    if word[0] != '(':
        if word[0] not in dict:
            return 0
        d = dict[word[0]]
        word = word[1:]
        return word_in_dict(word, d)
    last = word.index(')')
    possible = word[1:last]
    num = 0
    w = word[last+1:]
    for p in possible:
        if p in dict:
            d = dict[p]
            num += word_in_dict(w, d)
    return num

def main():
    L, D, N = map(int, raw_input().split())
    words = set()
    for i in range(D):
        words.add(raw_input().strip())
    sys.stderr.write("building\n")
    d = build_dict(words)
    sys.stderr.write("built\n")
    for i in range(1, N+1):
        word = raw_input().strip()
        n = word_in_dict(word, d)
        print "Case #%d: %d" % (i, n)
        sys.stderr.write(str(i) + "\n")

if __name__ == "__main__":
    main()
Edit 2: I notice that my solution is actually quite different from yours(and a bit simpler and shorter in my opinion). So I'll explain what I did.

First, I build a trie with the dictionary words. Then, with each given word, I recurse down the trie, trying all branches when there are multiple possible letters. I then sum the number of words found from each branch and that's the result.
That's an elegant solution, raja. I never would have considered using a trie, but it fits perfectly. I'm going to have to try using tries in different situations now!
from string import maketrans
from re import findall

f = open('A-large-practice.in')
lines = f.readlines()
f.close()

L, D, N = map(int,lines[0].rstrip('\n').split(' '))
words = [i.rstrip('\n') for i in lines[1:D+1]]
patterns = [i.rstrip('\n') for i in lines[D+1:D+N+1]]

all = "".join(i for i in words)

for i, line in enumerate(patterns):
	trans = line.translate(maketrans('()', '[]'))
	print 'Case #%d: %d' % (i, len(findall(trans, all)))
The idea is simple, converting parentheses to square brackets because [abc] => [a|b|c] which means any character from a,b or c. Then, we can apply that regex pattern and find all possibilities.
@m0ei I ran your code and it gave wrong answers.

The first wrong answer is in case #1. The correct answer is 1, yours is 3226. Did you try submitting your output to check for correctness? Maybe I'm simply running it wrong or have a different input file.
m0ei wins!
raja says no!
But the spirit of the solution is elegant.
While I agree that this is by far the most elegant way to do it(just run it through a regexp engine and let it do all the heavy lifting) he should get it to run properly first(give the correct answer) before we can say he wins :P

Edit: I just read the source code properly and the problem is that he's joining all the dict into one big line. So you can have a pattern that matches the end of a word and the start of the next and it will get counted by his algo when it shouldn't be.

Edit 2: Another error is his cases start at Case #0 instead of Case #1.

Below is a corrected version of the solution(I checked its output and it's correct now):

from string import maketrans
from re import match

f = open('A-large-practice.in')
lines = f.readlines()
f.close()

L, D, N = map(int,lines[0].rstrip('\n').split(' '))
words = [i.rstrip('\n') for i in lines[1:D+1]]
patterns = [i.rstrip('\n') for i in lines[D+1:D+N+1]]

for i, line in enumerate(patterns):
	trans = line.translate(maketrans('()', '[]'))
        print 'Case #%d: %d' % (i+1, len(filter(lambda w:match(trans,w) is not None, words)))

PS: This runs 3x slower than the original code(clocks in at 4.8 seconds vs 1.4 for incorrect version). Probably because it runs the regex on every single string from the dict instead of running it once on all of them concatenated into one big string
@raja:
from string import maketrans
from re import findall

f = open('A-large-practice.in')
lines = f.readlines()
f.close()

L, D, N = map(int,lines[0].rstrip('\n').split(' '))
words = [i.rstrip('\n') for i in lines[1:D+1]]
patterns = [i.rstrip('\n') for i in lines[D+1:D+N+1]]

all = " ".join(i for i in words)

for i, line in enumerate(patterns):
    trans = line.translate(maketrans('()', '[]'))
    print 'Case #%d: %d' % (i+1, len(findall(trans, all)))
I didn't pay attention to the missing space between the words, I added it now. The solution works fine now.

Updated.
You could just do " ".join(....) to get a space between the words.
Not sure how I missed it, thanks anyway :)