[Exercise] String Indexer

arithma · February 8 2012

xterm wrote:

arithma wrote:
raja's interpretation and solution seem to be correct. That's how I intended the problem to be.
However this solution is correct, but I am not sure it is optimized. xterm, yours is definitely wrong. We can either use raja's solution as a reference or prove it wrong (which I doubt).
Then I take myself out of this exercise because i clearly cannot seem to understand the requirements. How is it possible that given an array of 10 strings of 33 characters where 4 of the strings diverge into different character sequence on the 15th character would result in merely 4 single character compares is beyond me.

You must understand this problem does not admit approximate solutions (since the output is just a single integer, not an integer and a machine and lax requirements). How it returned 4 is beyond me, but it could be possible.

Joe · February 8 2012

Well, correct me if I'm wrong, but 4 comparisons wouldn't be enough to compare one char from each string, even taking into account common chars.

arithma · February 8 2012

Four comparisons can ideally split a set into 16 (2^4) buckets. So it is possible. We're talking comparisons that are dependent on each other. If I get an "equal" on the first comparison it means I pruned the data set in a particular way..

raja · February 8 2012

I'm instrumenting my code to give the sequence of comparisons that are necessary, this will take a bit but I'll have an answer on how the hell to do it with 4(or find potential bugs) :)

Last edited by raja (February 8 2012)

xterm · February 8 2012

rahmu wrote:

Well, correct me if I'm wrong, but 4 comparisons wouldn't be enough to compare one char from each string, even taking into account common chars.

It is, but what i'm trying to point out, is that acting on the given set is also a time factor. The questionable logic at hand is clearly the pruning.

strings = prune(strings)

which invokes

def prune(strings):
    """
    Removes from the given strings all the parts common to all strings
    """
    result = [""] * len(strings)
    for i in range(len(strings[0])):
        if len(set([a[i] for a in strings])) > 1:
            for j, s in enumerate(strings):
                result[j] += s[i]
    return result

There is heavy work at stake here, which is the same factor as the single character comparison. That for some reason is allowed based on your requirements it seems.

The following input will output 2 in raja's code:

XXXXX0XXX
XXXXX00XX
XXXXX000X
XXXXX0000

for string XXXXX0000

How did we get that 2?

prune the strings:

XXXXX0XXX
XXXXX00XX
XXXXX000X
XXXXX0000

you're left with

XXX
0XX
00X
000

the only two that are valid compares based on his code are

00X
000

you compare twice before getting the index.

Do i make sense?

raja · February 8 2012

Indeed, and why would the pruning not be allowed? Understand, the pruning isn't done at run-time by the machine, it is done at build-time by you, that is the distinction. So basically, given my explanation before, if a customer hands you strings that are all alike in some parts, your machine will ignore those and never compare to them, that seems logical.

raja · February 8 2012

Okay, as promised, here's the algorithm to do it in 4 comparisons:

First, compare the input against the 5th letter of 'fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD'
Match:
This problem is reduced to max_compares of:
['fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD', 'tEANjAjjNwvGfG3uhjjAtFfg3OmYXVK2',
'zQUwjMzloRmzlR5jYuvaaa9sqo9AH09B']
Compare the input repeatedly against the first letter of those in turn
if it matches the first:
you're done at two comparisons,
Otherwise
Compare against the first letter of the second, you're done at 3 comparisons either way
Doesn't match:
The problem is reduced to max_compares of:
['JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6', 'rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN',
'I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ', 'lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u',
'lGOmu7loH80DbOyfb6NuuWcEP81D7J5r', 'lGOmu7loH80DOyncyYpajmdnNIGH6HSu',
'lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5']
Now, compare against the 11th letter of 'lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u' (that's 0).
Match:
Reduced to: ['lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u',
'lGOmu7loH80DbOyfb6NuuWcEP81D7J5r',
'lGOmu7loH80DOyncyYpajmdnNIGH6HSu',
'lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5']
Hit the 28th letter of 'lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u' (that is the last 'H')
Match:
Reduced to:
['lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u', 'lGOmu7loH80DOyncyYpajmdnNIGH6HSu']
One more comparison and done at 4 compares
No match:
Reduced to:
['lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5', 'lGOmu7loH80DbOyfb6NuuWcEP81D7J5r']
One more comparison and done at 4 compares
No Match:
Reduced to:
['rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN', 'I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ',
'JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6']
Compare in turn against the first letter of each, needs 2 compares to locate the string
Done at 4 compares

Last edited by raja (February 8 2012)

arithma · February 8 2012

Awesome. Does it take long to figure this out? I am amazed how the exponentiality doesn't peek its head in this input.

xterm · February 8 2012

I could've saved you the trouble since I did the changes manually to the original input. Sorry.

xterm · February 8 2012

arithma wrote:

Awesome. Does it take long to figure this out? I am amazed how the exponentiality doesn't peek its head in this input.

That's because of the way the input was created. I generated 10 strings, copied part of one onto 4 others and just modified 3 or 4 characters.

arithma · February 8 2012

Oh, so pruning saved raja's ass on this one. Cute.

arithma · February 10 2012

@raja: I found a loophole, it seems your algorithm is still not yet complete:

max_comparisons(["aaa", "aab", "aba", "abb", "bcc", "ccd", "ddc", "edd"])
4

It should return 3 instead.
Consider comparing against a in the first index, then compare against the second and then third index accordingly. Try it on paper, you'll see what I mean.

raja · February 10 2012

Right you are, sir.

I have no clue what I was thinking when I wrote that piece of code. Here is a much clearer and more correct version(having worked out more examples by hand I have a much better feel for how this algo should go now):

def prune(strings):
    """
    Removes from the given strings all the parts common to all strings
    """
    result = [""] * len(strings)
    for e in strings:
        break
    
    for i in range(len(e)):
        if len(set([a[i] for a in strings])) > 1:
            for j, s in enumerate(strings):
                result[j] += s[i]
    return set(result)

def classify(strings, n):
    """
    given an array of strings gives a map of nth letter => array of strings
    """
    result = {}
    for s in strings:
        if s[n] not in result:
            result[s[n]] = set()
        result[s[n]].add(s)
    return result

cache = {}
def max_comparisons(strings):
    l = len(strings)
    if l <= 3:
        return l-1
    
    key = tuple(sorted(strings))
    if key in cache:
        return cache[key]

    strings = prune(strings)

    for e in strings: # Hack to retrieve a single element from set
        break
    if len(e) == 1:
        return len(strings)-1
    
    
    m = None
    for i in range(len(e)):
        groups = classify(strings, i).values()
        for g in groups:
            left_side = max_comparisons(g)
            right_side = max_comparisons(strings-g)
            tmp = max(left_side, right_side) + 1
            if m is None or m > tmp:
                m = tmp
    cache[key] = m
    return m

input =   ["fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD",
             "I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ",
             "lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u",
             "rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN",
             "lGOmu7loH80DbOyfb6NuuWcEP81D7J5r",
             "tEANjAjjNwvGfG3uhjjAtFfg3OmYXVK2",
             "lGOmu7loH80DOyncyYpajmdnNIGH6HSu",
             "zQUwjMzloRmzlR5jYuvaaa9sqo9AH09B",
             "JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6",
             "lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5"]
print max_comparisons(input) # prints 4

input = ["aaa", "aab", "aba", "abb", "bcc", "ccd", "ddc", "edd"]
print max_comparisons(input) # prints 3

PS: The cache is necessary otherwise the smaller input takes ~3s and the larger one doesn't complete in less than 2min(I cut it off at that point). With the cache the small input finishes instantly(in less than 30ms, which is mostly taken up by the interpreter startup time) and the large one finishes in ~900ms.

The if len(strings) <= 3 is just a simplification(brings down the execution time of the large one from 1.1s to 900ms) and it is a safe assumption since:
1 string: no comparisons needed
2 strings: always 1 comparison needed(since they are unique)
3 strings: always 2 comparisons needed(a comparison will split the set into a part with 1 string and a part with 2)
If we have 4 or more strings it becomes impossible to simplify(4 strings may require 3 or 2 comparisons)

Last edited by raja (February 10 2012)

arithma · February 11 2012

You don't need that classify stuff anymore it seems. I did understand the line of thought that led you to believe you needed it (in the previous algorithm). It was such a huge pain to create a challenge for it.

Last edited by arithma (February 11 2012)

raja · February 11 2012

Kudos on proving me wrong. I had a simplification that was wrong.

Classifying is still helpful though in that checking a single string of the same class is the same as checking the others, so it's an optimization. Furthermore, when I check against something from one class it is the strings from that class that are kept in case of a match and all the others that are kept otherwise.

arithma · February 12 2012

So here's my solution. It doesn't use caching yet.

Pastie Version: http://pastie.org/3368456

// MinComparisons
// arithma
//

#include <iostream>
#include <vector>
#include <string>
#include <map>
#include <set>
#include <algorithm>
#include <assert.h>
using namespace std;

typedef vector<string> vec;

struct bounds{
    int min, max;
};
bounds getBounds(vec const &in);
struct partition
{
    vec eq;
    vec neq;
    bounds bounds;
    
    void setBounds(){
        ::bounds beq = getBounds(eq);
        ::bounds bneq = getBounds(neq);
        bounds.max = ::max(beq.max, bneq.max);
        bounds.min = ::max(beq.min, bneq.min);
    }
    
    bool operator<(partition const & other) const {
        return bounds.max < other.bounds.max;
    }
};
typedef map<char, ::partition> charMap;
typedef vector< ::partition> parts;


int comparisons(vec const &in);
void prune(vec &in);
parts partitions(vec const &in, int idx);


int main ()
{
    string strs[] = {
        "fVq9jcBEMNrVl0M2A8VmfNPLgd7kK5JD",
        "I9FlZT90gu2n9ywNFBljC7jce2rR0SFJ",
        "lGOmu7loH80DbOyaYqzQBGwTPDwHlr8u",
        "rFqF5g9u5TXAAXM2sdxHpfZ4VzFBOgNN",
        "lGOmu7loH80DbOyfb6NuuWcEP81D7J5r",
        "tEANjAjjNwvGfG3uhjjAtFfg3OmYXVK2",
        "lGOmu7loH80DOyncyYpajmdnNIGH6HSu",
        "zQUwjMzloRmzlR5jYuvaaa9sqo9AH09B",
        "JsNn1Hu7J2FjMxoi8t49AogNogfPKlk6",
        "lGOmu7loH80DbOyFbtVa1K1m6UxQFvX5"};
    
    vec input(strs, strs + 10);
    
    /*
    string strs[] = {
        "aaa", "aab", "aba", "abb", "bcc", "bcd", "bdc", "bdd"
    };
    
    vec input(strs, strs + 8);
    */
    
    cout << comparisons(input) << endl;
    
    return 0;
}

int comparisons(vec const &in){
    vec c = in;
    prune(c);
    
    bounds b = getBounds(c);
    if(b.min == b.max) return b.min;
    
    parts p;
    int cols = (int)in[0].size();
    for(int col = 0; col < cols; col++){
        parts colPartitions = partitions(in, col);
        p.insert(p.end(), colPartitions.begin(), colPartitions.end());
    }
    sort(p.begin(), p.end());
    int min = b.max;
    for(parts::iterator itr = p.begin(); itr < p.end(); itr++){
        if(itr->bounds.min >= min)
            continue;
        int t = max(comparisons(itr->eq), comparisons(itr->neq));
        min = ::min(t, min);
    }
    
    return min + 1;
}

void prune(vec &in){
    int strings = (int)in.size();
    int cols = (int) in[0].size();
    vector<bool> same(cols, true);
    for(int col = 0; col < cols; col++){
        char firstColChar = in[0][col];
        for(int str = 1; str < strings; str++){
            if(in[str][col] != firstColChar){
                same[col] = false;
                break;
            }
        }
    }
    
    for(int col = cols - 1; col >= 0; col--){
        if(same[col]){
            for(int str = 0; str < strings; str++){
                in[str].erase(in[str].begin() + col);
            }
        }
    }
}

parts partitions(vec const &in, int idx){
    charMap charMap;
    set<char> chars;
    
    int strings = (int)in.size();
    for(int i = 0; i < strings; i++){
        chars.insert(in[i][idx]);
    }
    
    for(set<char>::iterator c = chars.begin(); c != chars.end(); c++){
        for(int i = 0; i < strings; i++){
            if(*c == in[i][idx]){
                charMap[*c].eq.push_back(in[i]);
            }
            else{
                charMap[*c].neq.push_back(in[i]);
            }
        }
        prune(charMap[*c].eq);
        prune(charMap[*c].neq);
        charMap[*c].setBounds();
    }
    
    parts result;
    for(charMap::iterator itr = charMap.begin(); itr != charMap.end(); itr++){
        result.push_back(itr->second);
    }
    
    return result;
}

bounds getBounds(vec const &in){
    bounds result;
    int height = (int)in.size();
    int width = (int)in[0].size();
    result.max = max(width, height-1);
    int min = 0;
    int pw2 = 1;
    while(pw2 < height){
        pw2 *= 2;
        min++;
    }
    result.min = min;
    
    assert(result.min <= result.max);
    return result;
}

LebGeeks

#26 February 8 2012

Re: [Exercise] String Indexer

#27 February 8 2012

Re: [Exercise] String Indexer

#28 February 8 2012

Re: [Exercise] String Indexer

#29 February 8 2012

Re: [Exercise] String Indexer

#30 February 8 2012

Re: [Exercise] String Indexer

#31 February 8 2012

Re: [Exercise] String Indexer

#32 February 8 2012

Re: [Exercise] String Indexer

#33 February 8 2012

Re: [Exercise] String Indexer

#34 February 8 2012

Re: [Exercise] String Indexer

#35 February 8 2012

Re: [Exercise] String Indexer

#36 February 8 2012

Re: [Exercise] String Indexer

#37 February 10 2012

Re: [Exercise] String Indexer

#38 February 10 2012

Re: [Exercise] String Indexer

#39 February 11 2012

Re: [Exercise] String Indexer

#40 February 11 2012

Re: [Exercise] String Indexer

#41 February 12 2012

Re: [Exercise] String Indexer

Board footer