[Exercise] Comparing 2 csv files

Joe · Mar 5, 2012

I chose CSV because they're particularly easy to manipulate, and almost all (dynamic) languages will include some sort of package to parse them easily ^^.

Consider the following file:

before.csv

A; ; B;
B; A; H;
C; ; D;
D; C; E G;
E; K D; F;
F; E; H;
G; D; ;
H; B F G;

And a modified version of the file:

after.csv

A; ; B;
B; A; H;
C; ; D;
D; ; E G;
E; K D; F;
F; E; H;
G; D; ;
K; ; E;

The first field of the CSV is a unique identifier of each line. The exercise consists of detecting the changes applied to the file, by comparing before and after.

There are 3 types of changes you should detect:

- ADDED (line is present in after.csv but not in before.csv)
- REMOVED (line is present in before.csv but not in after.csv)
- MODIFIED (line is present in both, but second and/or third field are modified)

In my example, there are three modifications:

- ADDED line (K)
- REMOVED line (H)
- MODIFIED line (G)

Added complexity: The script is supposed to run on very large files. Try avoiding using a quadratic solution like this one:

BEWARE!! Bad code, do not do this!

for line in before_csv:
    if not after_csv.hasline(line):
        print "REMOVED" + line

for line in after_csv:
    if not before_csv.hasline(line):
        print "ADDED" + line

Ayman · Mar 5, 2012

The files can have different number of rows?

Joe · Mar 6, 2012

@Ayman: Yes if there are added/deleted lines. Remember that each row is identified by the value of its first field (it's unique).

Here's my code. It could use a lot of polishing and can be simplified some more as well. As usual, I'm looking for comments and feedback, so please, be harsh.

import collections
import csv
import sys

class P_SIC(dict):
    fieldnames = ["name", "incond", "outcond"]
    
    def __init__(self, input):
        map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
                                              skipinitialspace=True))

    def readline(self, line):
        self[line["name"]] = line

    def get_elem(self, name):
        for i in self:
            if i == name:
                return self[i]

class Change:
    def __init__(self, *args):
        self.args=args
        
    def echo(self):
        print "\t".join(self.args)

class P_Comparator(collections.Counter):

    def __init__(self, in_psic, out_psic):
        self.change_list = []

        self.in_psic = in_psic
        self.out_psic = out_psic
        
        self.readfile(in_psic, 1)
        self.readfile(out_psic, -1)

    def readfile(self, file, factor):
        for key in file:
            self[key] += factor

    def scant(self):
        for i in self:
            if self[i] == -1:
                self.change_list.append(Change("ADD", i))
            elif self[i] == 1:
                self.change_list.append(Change("DELETE", i))
            else: # element exists in two files. Check if modified
                j = J_Comparator(self.in_psic.get_elem(i), self.out_psic.get_elem(i))
                if len(j) > 0:
                    self.change_list += j


class J_Comparator(list):
    def __init__(self, indict, outdict):
        for i in indict:
            if indict[i] != outdict[i]:
                self.append(Change("MODIFY", indict["name"], i, indict[i], "BECOMES", outdict[i]))
        if len(self) == 0:
            self = None


p = P_Comparator(P_SIC(open(sys.argv[1], "rb")), P_SIC(open(sys.argv[2], "rb")))
p.scant()

print "{} changes".format(len(p.change_list))
[c.echo() for c in p.change_list]

Ayman · Mar 7, 2012

Implementation turned out to be a bit similar to that of rahmu. I was thinking that this could be done in JS and have nice clean web interface, would be a useful visual tool. I'll do it when I have time.

C#

 static void Main(string[] args)
        {
            int i = 0;
            int j = 0;
            int compare = 0;

            List<string[]> before = parseCSV("before.csv");
            List<string[]> after = parseCSV("after.csv");

            while (i < after.Capacity || j < before.Capacity)
            {
                if (i >= after.Capacity) compare = -1;

                else if (j >= before.Capacity) compare = 1;

                else compare = before[j][0].CompareTo(after[i][0]);

                if (compare < 0)
                    Console.WriteLine(string.Format("REMOVED line ({0})",before[j++]));
                
                else if (compare > 0)
                    Console.WriteLine(string.Format("ADDED line ({0})",after[i++]));

                else 
                {
                    if (before[j][1] != after[i][1] || before[j][2] != after[i][2])
                        Console.WriteLine(string.Format("MODIFIED ({0})", before[j]));
                    
                    j++;  i++;
                 }
            }

 private static List<string[]> parseCSV(string path)
        {
            List<string[]> parsedData = new List<string[]>();
            try
            {
                StreamReader readFile = new StreamReader(path);
          
                string line;
                string[] row;

                while ((line = readFile.ReadLine()) != null)
                {
                    row = line.Split(';');
                    parsedData.Add(row);
                }
            }
            catch (Exception e){ Console.WriteLine(e.Message);}

            return parsedData;
           }
        }

xterm · Mar 7, 2012

Off topic

Ayman you could replace your entire parseCSV block with this:

private static List<string[]> parseCSV(string path)
{
       return new StreamReader(path).ReadToEnd().Split('\n').Select(line => line.Split(';')).ToList();
}

Ayman · Mar 7, 2012

@xterm thanks for the tip! You're right that would be much neater. I always tend to write old school style code.

A bit close to the topic, just found this really nice browser based tool for checking diffs between different files including csv.

Joe · Mar 7, 2012

Ayman, I'm not sure (I don't have a C# compiler handy), but I think you're assuming that both files will be sorted so that you comare:

 else compare = before[j][0].CompareTo(after[i][0]);

What if the order of the lines is shuffled?

Ayman · Mar 7, 2012

Yes exactly I am assuming the files are sorted, I thought they were supposed to be sorted actually. I was thinking of sorting the 2 2d arrays after loading them but that wont be much efficient in case of large files right?

Joe · Mar 7, 2012

Yeah probably won't be. Also, how would you deal with an after file like this one?

A; ; B;
AA; ; A;
B; A; H;
C; ; D;
D; ; E G;
E; K D; F;
F; E; H;
G; D; ;
K; ; E;

One more thing:

if (before[j][1] != after[i][1] || before[j][2] != after[i][2])

I would loop over all the fields. What if I added a new field (let alone 3 new fields)? You want your algorithm to still work.
But hey, this is just a personal opinion :)

Ayman · Mar 7, 2012

@rahmu yes good point regarding looping over all the fields, I forgot to.

Regarding your first question, an after file with an index that has more than one character it would still work, in this case it will detect that AA has been added. Its not comparing the characters but the whole string. The CompareTo method would return 1 for "AA".CompareTo("A") which means AA< A in sort order.