[Exercise] String Indexer

arithma · Feb 3, 2012

Given a set of unique strings, all of the same size, determine the minimum number of compares needed to identify an input of the same size to one of the strings in the given set, and the string is always one of the given strings.

Cleaner Restatement
For a set of unique strings, all of the same size, there exists a machine that takes a string input, and only by doing single character equality comparisons will determine an index that identifies one of those strings.
Given a set of such unique strings, determine the minimum number of comparisons needed by such a machine to identify any of those strings.
Further Clarification
Please do note that the comparison constraint applies only to the mentioned machine and not to the algorithm itself. You may compare as you please within the algorithm.

Example 1
Set: ["a", "b"] (two strings as the given set)

Output: 1. (Only one comparison is needed to decided which of the two strings an input is equal to)

Example 2
Set: ["ab", "bc"]

Output: 1.

Example 3
Set: ["abc", "abd", "dbc"]

Output: 2.

Note: Doing the sample exercises is fun, but I am not sure whether this problem is actually hard or not. Worth the thought though.

P.S.: I just really, reaally hope that geek won't find it pretty to solve using Mathematica.

Jjsaade · Feb 4, 2012

I am not sure i understand it? It looks to me it will always be length-1

arithma · Feb 4, 2012

For 2^10 values having only bits, you would need 10 comparisons. For a subset of these 2^10 strings, I would imagine you need less comparisons.

arithma · Feb 5, 2012

I think I under defined the problem.

Comparisons are only equal operations to a single character at a single location that returns either true or false.

Has anyone found this exercise interesting?

Joe · Feb 5, 2012

I'm going to have to go with ZeRaW on this one. It seems that it's always going to be len(set) -1, unless I misunderstood the problem ... :-/

Jjsaade · Feb 5, 2012

@arithma, the character by character comparison made sense now.
I ll try to get involved in it tomorrow.

arithma · Feb 6, 2012

If it still doesn't make sense, I think this will help (I will merge all of this back to the original problem statement when I know what to say):

A machine is going to make a single equality comparison at a time on the input to determine the input's index in the original set. How many comparisons does an optimized machine need to do to guarantee its ability to output the index. Note that there is for every different set of strings a different machine altogether.

Example 4
Set: ["aa", "ab", "ba", "bb"]

Output: 2.
If the machine sees an "a" at index 0, then the input is either "aa", or "ab".

Example 5
Set: ["aa", "ab", "ba"]

Output: 2.

arithma · Feb 6, 2012

I updated the original post to include a cleaner problem statement.

Rraja · Feb 7, 2012

This is my solution. It runs in exponential time versus the length of the strings(though it does have some pruning, it is still exponential in the worst case).

If anyone needs explanation on how and why it works(or if you have counter examples) let me know :)

def prune(strings):
    """
    Removes from the given strings all the parts common to all strings
    """
    result = [""] * len(strings)
    for i in range(len(strings[0])):
        if len(set([a[i] for a in strings])) > 1:
            for j, s in enumerate(strings):
                result[j] += s[i]
    return result

def classify(strings, n):
    """
    given an array of strings gives a map of nth letter => array of strings
    """
    result = {}
    for s in strings:
        if s[n] not in result:
            result[s[n]] = []
        result[s[n]].append(s)
    return result

def max_comparisons(strings):
    if len(strings) == 1:
        return 0
    if len(strings[0]) == 1:
        return len(strings) - 1
    strings = prune(strings)
    m = None
    for i in range(len(strings[0])):
        groups = map(max_comparisons, classify(strings, i).values())
        groups.sort(reverse=True)
        groups[0] += 1
        for i in range(1, len(groups)):
            groups[i] += i
        if m is None:
            m = max(groups)
        m = min(m, max(groups))
    return m

xterm · Feb 7, 2012

this exercise seems somewhat similar to the super self insertion exercise.

Correct me if i'm wrong.

xterm · Feb 8, 2012

Simple quick solution. Test it here.

It's somewhat quadratic (On2).

Can we please define a given set of parameters so that we could all test against.

Perhaps:

input = ["bced","abed","abce","aceb","abcd"]
str = "abcd"

(Using quote to preserve indentation when copying to the console)

input = ["bced","abed","abce","aceb","abcd"]
str = "abcd"
checks = 0

kick = {s -> s.length()>1?s[1..-1]:null}
check = {s,c -> s[0] == c}

str.each { c ->
drop = []
input.eachWithIndex { s,i ->
print "$c : $s "
checks += 1
if(check(s,c)){
print "(caught)"
input = kick(s)
} else{
print "(not caught)"
drop << input
}
println()
}
drop.each { input.remove(it) }
}

println "Total single char checks : $checks"

Rraja · Feb 8, 2012

Hey there xterm, I ran your code(I'm assuming that's groovy) and it outputs "14". However, that is not correct. Given the input strings ["bced","abed","abce","aceb","abcd"] you can construct a machine which will uniquely identify one of them with 3 comparisons at most.

Here are the steps to uniquely identify a string from among those 5 strings.

compare the input string to the second letter of "bced"
If it matches:
We can eliminate "abed" "abce" and "abcd". Leaving us with 2 choices: "bced" and "aceb"
One further comparison is needed to distinguish those two, compare first letter of input with
first letter of "bced".
If it matches:
respond "bced"
Otherwise:
respond "aceb"
Total comparisons: 2
If it doesn't match:
We can eliminate "bced" and "aceb" leaving us with "abed" "abce" and "abcd"
We compare the input with the third letter of "abed".
If it matches:
respond "abed". Done, 2 comparisons
If it doesn't match:
We are left with "abce" and "abcd". Compare the last letter of the input with the last letter of "abce".
If it matches:
respond "abce". Done, 3 comparisons
Otherwise:
respond "abcd". Done, 3 comparisons

The answer is thus 3, since that is the number of comparisons an algorithm will need to make if it knows the set of all possible strings ahead of time and can devise a perfect search algorithm(which is how I understood this problem, though I'm open to correction)

PS: If my understanding of the problem is correct, it is quite obvious that the upper bound on the solution is the length of the input strings. Meaning if our input strings are of length 4, as in your example, that is the upper bound. Since you can, with each 1 comparison, eliminate all strings that are not equal at that position.

PPS: Given that observation, if one could devise an efficient solution to the following subproblem: "Is it possible to uniquely identify one of these strings in `n` comparisons?", it would be possible to find the solution via a binary search on the entire solution space which is 0=>length(strings[0])

xterm · Feb 8, 2012

My understanding is the total number of single character compares. My initial implementation mimicked a NxM grid that does a simple character by character "check and kick". I ammended the code with some debugging info for your consideration

(Yes this is groovy)

Seems like we still need more information arithma.

P.S.: Keep in mind raja that i simply opted for a quick solution, not the optimal one, but i presume you know that since i already mentioned that its quadratic.

Rraja · Feb 8, 2012

That is true, total number of single compares, but you're allowed to use your prior knowledge of the set of all possible strings(the input in your code). Given that knowledge, my explanation shows why your input should have an output of 3, no?

xterm · Feb 8, 2012

raja wroteThat is true, total number of single compares, but you're allowed to use your prior knowledge of the set of all possible strings(the input in your code). Given that knowledge, my explanation shows why your input should have an output of 3, no?

OH! You were giving a better algorithm, for some reason i thought you meant my code vs the output is incorrect. Yes, I'm pretty sure your pseudo is reasonable.

How many single character compares does your code gives on my input?

Rraja · Feb 8, 2012

My code outputs 3 compares. My code essentially analyzes the input set and determines the best algorithm(the one that minimizes compares) to uniquely identify any one of these strings. It then outputs the depth of the deepest comparison tree of this algorithm.

This is how I understood this problem. It would really be helpful if Arithma were to clear things up for us.

xterm · Feb 8, 2012

I presume there is limitation as well on working on the set prior to starting the character checks.

On the following input:

input = ["fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD",
"I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ",
"lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u",
"rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN",
"lGOmu7loH80DbOyfb6NuuWcEP81D7J5r",
"tEANjAjjNwvGfG3uhjjAtFfg3OmYXVK2",
"lGOmu7loH80DOyncyYpajmdnNIGH6HSu",
"zQUwjMzloRmzlR5jYuvaaa9sqo9AH09B",
"JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6",
"lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5",]
str = "lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5"

Your code gave an output of 4 compares, which would only mean that you might be acting too much on the given set prior to doing the compares. Whether this is allowed, is really beyond my understanding, we really need to get more requirements info from arithma.

My code gave 83 character compares with a straight up brute force algorithm.

Rraja · Feb 8, 2012

I don't think there is a limitation actually. From Arithma's statement:

For a set of unique strings, all of the same size, there exists a machine that takes a string input, and only by doing single character equality comparisons will determine an index that identifies one of those strings

That seems to very strongly imply that said machine is tailor-made to that input.

Now, as for your example, my answer may not be correct, but yours definitely is too high. Given that set of inputs, all you would need to do is compare the last letter of every string, in turn, to your input string. Given that the strings in that set all have unique endings, this will take at most 9(len(set)-1) comparisons.

arithma · Feb 8, 2012

raja's interpretation and solution seem to be correct. That's how I intended the problem to be.
However this solution is correct, but I am not sure it is optimized. xterm, yours is definitely wrong. We can either use raja's solution as a reference or prove it wrong (which I doubt).

BTW, raja, check your Private Message Inbox.

Rraja · Feb 8, 2012

Yes, I'm not sure it's optimized either. In my PPS above I outlined a possibly better approach though I haven't given it much further thought(I might later).

As for the PM, check your email inbox.