[Exercise] String Indexer

xterm · Feb 8, 2012

arithma wroteraja's interpretation and solution seem to be correct. That's how I intended the problem to be.
However this solution is correct, but I am not sure it is optimized. xterm, yours is definitely wrong. We can either use raja's solution as a reference or prove it wrong (which I doubt).

Then I take myself out of this exercise because i clearly cannot seem to understand the requirements. How is it possible that given an array of 10 strings of 33 characters where 4 of the strings diverge into different character sequence on the 15th character would result in merely 4 single character compares is beyond me.

xterm · Feb 8, 2012

rahmu wroteAnyone can brute force. I don't mean this negatively. I'm trying to make sense of the exercise. It is my understanding that arithma wanted to answer the following question:

Is there anything clever you can do to avoid bruteforcing the solution?.

I made it quite clear that this my code is a mere bruteforcing solution, not an optimized one. As i understood the exercise is "Calculate the least amount of single character compares needed to find the string".

The reason i wrote the brute force is to have a comparison measure when i write a better solution.

Rraja · Feb 8, 2012

xterm, this is my attempt at an explanation:

First consider the String Finder(tm) machine. This machine has a list of strings in its internal memory, and when you input into it a string, it will tell you where, in its memory, this input string resides.

Now, you are a maker of String Finder(tm) machines. When customers contact you, they give you the strings they want to have in your machine. Your job is to provide them with the machine that will be as fast as possible, given their spec.

Arithma's question is: A customer has contacted you and given you a set of strings. You build for them a String Finder(tm) machine. Given that a single character comparison takes 1s, how much time would the customer have to wait, in the worst case, after they input a string? You are allowed to take as much time as you want when building the String Finder(tm), however, it is not allowed to be slow when the customer uses it.

arithma · Feb 8, 2012

xterm wroteI presume there is limitation as well on working on the set prior to starting the character checks.

On the following input:

input = ["fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD",
"I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ",
"lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u",
"rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN",
"lGOmu7loH80DbOyfb6NuuWcEP81D7J5r",
"tEANjAjjNwvGfG3uhjjAtFfg3OmYXVK2",
"lGOmu7loH80DOyncyYpajmdnNIGH6HSu",
"zQUwjMzloRmzlR5jYuvaaa9sqo9AH09B",
"JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6",
"lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5",]
str = "lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5"
Your code gave an output of 4 compares, which would only mean that you might be acting too much on the given set prior to doing the compares. Whether this is allowed, is really beyond my understanding, we really need to get more requirements info from arithma.

My code gave 83 character compares with a straight up brute force algorithm.

I am a little skeptical about the ability to find an index using only 4 compares, so this might be worth checking.
What I meant earlier is that raja's interpretation is spot on, and after looking at the code, it does seem to capture the right way. I would guess it should fail to halt anytime soon on the above dataset, but again, this is all hunches. It may as well halt with 4.

xterm · Feb 8, 2012

Give me a moment while i explain the differences between raja's implementation and mine (which is insanely subtle)

arithma · Feb 8, 2012

xterm wrote
arithma wroteraja's interpretation and solution seem to be correct. That's how I intended the problem to be.
However this solution is correct, but I am not sure it is optimized. xterm, yours is definitely wrong. We can either use raja's solution as a reference or prove it wrong (which I doubt).
Then I take myself out of this exercise because i clearly cannot seem to understand the requirements. How is it possible that given an array of 10 strings of 33 characters where 4 of the strings diverge into different character sequence on the 15th character would result in merely 4 single character compares is beyond me.

You must understand this problem does not admit approximate solutions (since the output is just a single integer, not an integer and a machine and lax requirements). How it returned 4 is beyond me, but it could be possible.

Joe · Feb 8, 2012

Well, correct me if I'm wrong, but 4 comparisons wouldn't be enough to compare one char from each string, even taking into account common chars.

arithma · Feb 8, 2012

Four comparisons can ideally split a set into 16 (2^4) buckets. So it is possible. We're talking comparisons that are dependent on each other. If I get an "equal" on the first comparison it means I pruned the data set in a particular way..

Rraja · Feb 8, 2012

I'm instrumenting my code to give the sequence of comparisons that are necessary, this will take a bit but I'll have an answer on how the hell to do it with 4(or find potential bugs) :)

xterm · Feb 8, 2012

rahmu wroteWell, correct me if I'm wrong, but 4 comparisons wouldn't be enough to compare one char from each string, even taking into account common chars.

It is, but what i'm trying to point out, is that acting on the given set is also a time factor. The questionable logic at hand is clearly the pruning.

strings = prune(strings)

which invokes

def prune(strings):
    """
    Removes from the given strings all the parts common to all strings
    """
    result = [""] * len(strings)
    for i in range(len(strings[0])):
        if len(set([a[i] for a in strings])) > 1:
            for j, s in enumerate(strings):
                result[j] += s[i]
    return result

There is heavy work at stake here, which is the same factor as the single character comparison. That for some reason is allowed based on your requirements it seems.

The following input will output 2 in raja's code:

XXXXX0XXX
XXXXX00XX
XXXXX000X
XXXXX0000

for string XXXXX0000

How did we get that 2?

prune the strings:

XXXXX0XXX
XXXXX00XX
XXXXX000X
XXXXX0000

you're left with

XXX
0XX
00X
000

the only two that are valid compares based on his code are

00X
000

you compare twice before getting the index.

Do i make sense?

Rraja · Feb 8, 2012

Indeed, and why would the pruning not be allowed? Understand, the pruning isn't done at run-time by the machine, it is done at build-time by you, that is the distinction. So basically, given my explanation before, if a customer hands you strings that are all alike in some parts, your machine will ignore those and never compare to them, that seems logical.

Rraja · Feb 8, 2012

Okay, as promised, here's the algorithm to do it in 4 comparisons:

First, compare the input against the 5th letter of 'fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD'
Match:
This problem is reduced to max_compares of:
['fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD', 'tEANjAjjNwvGfG3uhjjAtFfg3OmYXVK2',
'zQUwjMzloRmzlR5jYuvaaa9sqo9AH09B']
Compare the input repeatedly against the first letter of those in turn
if it matches the first:
you're done at two comparisons,
Otherwise
Compare against the first letter of the second, you're done at 3 comparisons either way
Doesn't match:
The problem is reduced to max_compares of:
['JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6', 'rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN',
'I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ', 'lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u',
'lGOmu7loH80DbOyfb6NuuWcEP81D7J5r', 'lGOmu7loH80DOyncyYpajmdnNIGH6HSu',
'lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5']
Now, compare against the 11th letter of 'lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u' (that's 0).
Match:
Reduced to: ['lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u',
'lGOmu7loH80DbOyfb6NuuWcEP81D7J5r',
'lGOmu7loH80DOyncyYpajmdnNIGH6HSu',
'lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5']
Hit the 28th letter of 'lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u' (that is the last 'H')
Match:
Reduced to:
['lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u', 'lGOmu7loH80DOyncyYpajmdnNIGH6HSu']
One more comparison and done at 4 compares
No match:
Reduced to:
['lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5', 'lGOmu7loH80DbOyfb6NuuWcEP81D7J5r']
One more comparison and done at 4 compares
No Match:
Reduced to:
['rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN', 'I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ',
'JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6']
Compare in turn against the first letter of each, needs 2 compares to locate the string
Done at 4 compares

arithma · Feb 8, 2012

Awesome. Does it take long to figure this out? I am amazed how the exponentiality doesn't peek its head in this input.

xterm · Feb 8, 2012

I could've saved you the trouble since I did the changes manually to the original input. Sorry.

xterm · Feb 8, 2012

arithma wroteAwesome. Does it take long to figure this out? I am amazed how the exponentiality doesn't peek its head in this input.

That's because of the way the input was created. I generated 10 strings, copied part of one onto 4 others and just modified 3 or 4 characters.

arithma · Feb 8, 2012

Oh, so pruning saved raja's ass on this one. Cute.

arithma · Feb 10, 2012

@raja: I found a loophole, it seems your algorithm is still not yet complete:

max_comparisons(["aaa", "aab", "aba", "abb", "bcc", "ccd", "ddc", "edd"])
4

It should return 3 instead.
Consider comparing against a in the first index, then compare against the second and then third index accordingly. Try it on paper, you'll see what I mean.

Rraja · Feb 10, 2012

Right you are, sir.

I have no clue what I was thinking when I wrote that piece of code. Here is a much clearer and more correct version(having worked out more examples by hand I have a much better feel for how this algo should go now):

def prune(strings):
    """
    Removes from the given strings all the parts common to all strings
    """
    result = [""] * len(strings)
    for e in strings:
        break
    
    for i in range(len(e)):
        if len(set([a[i] for a in strings])) > 1:
            for j, s in enumerate(strings):
                result[j] += s[i]
    return set(result)

def classify(strings, n):
    """
    given an array of strings gives a map of nth letter => array of strings
    """
    result = {}
    for s in strings:
        if s[n] not in result:
            result[s[n]] = set()
        result[s[n]].add(s)
    return result

cache = {}
def max_comparisons(strings):
    l = len(strings)
    if l <= 3:
        return l-1
    
    key = tuple(sorted(strings))
    if key in cache:
        return cache[key]

    strings = prune(strings)

    for e in strings: # Hack to retrieve a single element from set
        break
    if len(e) == 1:
        return len(strings)-1
    
    
    m = None
    for i in range(len(e)):
        groups = classify(strings, i).values()
        for g in groups:
            left_side = max_comparisons(g)
            right_side = max_comparisons(strings-g)
            tmp = max(left_side, right_side) + 1
            if m is None or m > tmp:
                m = tmp
    cache[key] = m
    return m

input =   ["fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD",
             "I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ",
             "lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u",
             "rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN",
             "lGOmu7loH80DbOyfb6NuuWcEP81D7J5r",
             "tEANjAjjNwvGfG3uhjjAtFfg3OmYXVK2",
             "lGOmu7loH80DOyncyYpajmdnNIGH6HSu",
             "zQUwjMzloRmzlR5jYuvaaa9sqo9AH09B",
             "JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6",
             "lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5"]
print max_comparisons(input) # prints 4

input = ["aaa", "aab", "aba", "abb", "bcc", "ccd", "ddc", "edd"]
print max_comparisons(input) # prints 3

PS: The cache is necessary otherwise the smaller input takes ~3s and the larger one doesn't complete in less than 2min(I cut it off at that point). With the cache the small input finishes instantly(in less than 30ms, which is mostly taken up by the interpreter startup time) and the large one finishes in ~900ms.

The if len(strings) <= 3 is just a simplification(brings down the execution time of the large one from 1.1s to 900ms) and it is a safe assumption since:
1 string: no comparisons needed
2 strings: always 1 comparison needed(since they are unique)
3 strings: always 2 comparisons needed(a comparison will split the set into a part with 1 string and a part with 2)
If we have 4 or more strings it becomes impossible to simplify(4 strings may require 3 or 2 comparisons)

arithma · Feb 11, 2012

You don't need that classify stuff anymore it seems. I did understand the line of thought that led you to believe you needed it (in the previous algorithm). It was such a huge pain to create a challenge for it.

Rraja · Feb 11, 2012

Kudos on proving me wrong. I had a simplification that was wrong.

Classifying is still helpful though in that checking a single string of the same class is the same as checking the others, so it's an optimization. Furthermore, when I check against something from one class it is the strings from that class that are kept in case of a match and all the others that are kept otherwise.