[Exercise] Word count 2

Joe · Dec 13, 2011

This is an implementation similar to the second one I presented, except it uses a 2D array instead of a dictionary, using the indexes as count.

(In other words, all the words in the list wlist[3] have appeared 3 times in the text). The solution is simple as it only relies on python lists.

import sys

class WCount():

    def __init__(self, filename):
        self.wlist = [[]]
        self.filename = filename

    def getcurrcount(self, word, count=0):
        if count +1 > len(self.wlist):
            return 0

        if word in self.wlist[count-1]:
            return (count - 1) % len(self.wlist)
        else:
            return self.getcurrcount(word, count+1)

    def incposition(self, word):
        n = self.getcurrcount(word)
        if n != 0:
            self.wlist[n].remove(word)

        if n == len(self.wlist) - 1:
            self.wlist.append([word])
        else:
            self.wlist[n+1].append(word)
            
    def readfile(self):
        with open(self.filename, "r") as f:
            for line in f:
                for w in line.split():
                    self.incposition(w)

    def pprint(self, count=1):
        if count == len(self.wlist):
            return
        for w in self.wlist[count]:
            print "{}\t{}".format(count, w)
        self.pprint (count+1)

w = WCount(sys.argv[1])
w.readfile()
w.pprint()

The code relies on a couple of recursive calls, which made it impossible for me to benchmark, especially on large files( RuntimeError: maximum recursion depth exceeded). I thought cpython would be able to optimize this.

I compared my three implementations on a moderate large file (60k). Here are the average results:

- first (map then sort): 1.76s
- second (map with indexes as key): 2.05s
- third (2D matrix): 2.70s

Note that the results of the third implementation could be greatly improved by removing the recursions. I just find them more pleasant to read :)

Jjsaade · Dec 13, 2011

same as yours in c++, test file 485K - 0.35seconds, no recursion

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <vector>
#include <list>
#include <time.h>
using namespace std;
size_t getCount(vector<list<string>> &arrayList, string word)
{
	if(!arrayList.size()) return 0;
	for(size_t i =0; i < arrayList.size(); ++i){
		if(find(arrayList[i].begin(), arrayList[i].end(), word) != arrayList[i].end()) return i;
	}
	return 0;
}
void main()
{
	string word;
	size_t maxSize = 40000;
    fstream file("file.txt",ios_base::in);
    vector<list<string>> arrayList(maxSize);
    while(file>>word){
       transform(word.begin(), word.end(), word.begin(), tolower);
       size_t count = getCount(arrayList,word);
	   if(!count) arrayList[1].push_back(word);
	   else {
		   arrayList[count++].remove(word);
		   arrayList[count].push_back(word);
	   }
    }
    file.close();
	//print
    for (size_t i = 1; i < arrayList.size(); ++i){
        
        if(arrayList[i].size() != 0){
            cout <<"Frequency: "<< i <<'\t'<<" Values: ";
            for (list<string>::const_iterator iter = arrayList[i].begin();iter != arrayList[i].end(); ++iter )
                cout<<*iter<<", ";
            cout<<endl;
        }
    }
}

Joe · Dec 13, 2011

ZeRaW wrotetest file 485K - 0.35seconds, no recursion

Can you please add the benchmark of your other implementation (map + occurrences as keys) with the same file on the same computer?

geek · Dec 13, 2011

another dick-size competition? here it goes: 27ms for 562KiB using my simple snippet.
what's with the meaningless numbers fellows? where's the insight? shotgun optimization will get you nowhere.

Jjsaade · Dec 13, 2011

geek wroteanother dick-size competition? here it goes: 27ms for 562KiB using my simple snippet.
what's with the meaningless numbers fellows? where's the insight? shotgun optimization will get you nowhere.

agree with you, but for the sake of fun
File 485KB
Method 1: using stl map, sorted alphabetically, 0.03s
Method 2: using indices as, hash table as elements, 0.08s
Method 3: 2D lists 0.35s
the 2D lists lose a lot of time with finding the count of the word.

Joe · Dec 13, 2011

another dick-size competition?

No dick size competition. Humbly comparing (my own) implementations.

I'm not interested in numbers, but in deltas between 2 versions.Notice I asked ZeRaW for measurements for the same file on the same computer. My goal is not to prove that my algorithm is better than others, but that storing everything in a hash table then sorting it cannot be an optimal idea.

Not that I have managed to do it so far ... :(

ZeRaW wrotethe 2D lists lose a lot of time with finding the count of the word.

I think that since wlist[1] is clearly larger than others (most words appear only once), you could improve this function by starting from the last array all the way to array wlist[1]. But I wonder if this kind of micro optimization matters, or even makes sense at all.

Rraja · Jan 13, 2012

Although I usually like to write clean and well documented code, this problem has so many good solutions that I decided to write something in as few lines as I could manage.

I'm still convinced that I can manage something in one line instead of the current three, but I'm giving up for now:

def word_count(fname):
    result = {}
    map(lambda a:result.__setitem__(a, result.setdefault(a, 0)+1), open(fname).read().split())
    return "\n".join(map(lambda row: row[0]+","+str(row[1]), sorted(result.items(), key=lambda a:a[1], reverse=True)[:10]))

print word_count("test_file.txt")

test_file.txt

$ cat test_file.txt 
this is
some 
random file
is is is

Result:

$ python wordcount.py 
is,4
this,1
random,1
some,1
file,1

Edit: Got it as a oneliner, here goes:

def word_count(fname):
    return map(lambda a: sorted([(a.count(word), word) for word in set(a)], reverse=True), [open(fname).read().split()])[0][:10]

Quite a few hacks to get this to work, and it's not at all efficient(goes over the whole file for every word, so it's O(n^2) where n is the number of words) but it works and it's one line.

Note that the map is there to be able to assign a value(in this case "a") to "open(fname).read().split()" so I don't have to read the file multiple times(I'm using it as the equivalent of a "let" essentially but python doesn't have that)

Joe · Aug 20, 2012

By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.

For fun I implemented McIlroy's solution in Python. Here's his shell code:

tr -cs A-Za-z '\n' |  
tr A-Z a-z |          
sort |                
uniq -c |             
sort -rn |            
sed ${1}q

Here's my Python version of it. I wrote it in an almost litterate style, as to make the comparison with the shell code more obvious:

import itertools
import sys

## Just like the Unix pipeline, the code reads a file f on the standard input
with sys.stdin as f:

## reads the input stream and splits it into words     #
    a = f.read().split() # tr -cs A-Za-z '\n' |        #

## lower case all characters                       #
    b = (x.lower() for x in a) # tr A-Z a-z | #

## sort the list alphabetically #
    c = sorted(b) # sort        #

## group items that are similar and store their count                          #
    d = map(lambda x: (x[0], len(list(x[1]))), itertools.groupby(c)) # uniq -c #

## sort items (in reverse order) according to count        #
    e = reversed(sorted(d, key=lambda x: x[1])) # sort -rn #

## print the first item        #
    print(e.next()) # sed {1}q #

Note that I purposefully gave bad names to the intermediate variables. In order to replicate the Unix pipeline more accurately we should think of the code as a single instruction that takes other instructions as arguments.
That's right, we're writing a one liner:

print(reversed(sorted(map(lambda x: (x[0], len(list(x[1]))), itertools.groupby(sorted((x.lower() for x in sys.stdin.read().split())))), key=lambda x: x[1])).next())

The goal of this code was to show similarities between the Unix pipeline and the functional programming paradigm.

Ppatrickas · Feb 23, 2013

Perl6 One liner for fun?

slurp('f.txt').comb(/\w+/).classify(*.lc).map({$_.key=>+$_.value}).sort(-*.value).say

I like that it works operation by operation left to right without much nesting.