[Exercise] Word count 2

Joe · December 13 2011

another dick-size competition?

No dick size competition. Humbly comparing (my own) implementations.

I'm not interested in numbers, but in deltas between 2 versions.Notice I asked ZeRaW for measurements for the same file on the same computer. My goal is not to prove that my algorithm is better than others, but that storing everything in a hash table then sorting it cannot be an optimal idea.

Not that I have managed to do it so far ... :(

ZeRaW wrote:

the 2D lists lose a lot of time with finding the count of the word.

I think that since wlist[1] is clearly larger than others (most words appear only once), you could improve this function by starting from the last array all the way to array wlist[1]. But I wonder if this kind of micro optimization matters, or even makes sense at all.

raja · January 13 2012

Although I usually like to write clean and well documented code, this problem has so many good solutions that I decided to write something in as few lines as I could manage.

I'm still convinced that I can manage something in one line instead of the current three, but I'm giving up for now:

def word_count(fname):
    result = {}
    map(lambda a:result.__setitem__(a, result.setdefault(a, 0)+1), open(fname).read().split())
    return "\n".join(map(lambda row: row[0]+","+str(row[1]), sorted(result.items(), key=lambda a:a[1], reverse=True)[:10]))

print word_count("test_file.txt")

test_file.txt

$ cat test_file.txt 
this is
some 
random file
is is is

Result:

$ python wordcount.py 
is,4
this,1
random,1
some,1
file,1

Edit: Got it as a oneliner, here goes:

def word_count(fname):
    return map(lambda a: sorted([(a.count(word), word) for word in set(a)], reverse=True), [open(fname).read().split()])[0][:10]

Quite a few hacks to get this to work, and it's not at all efficient(goes over the whole file for every word, so it's O(n^2) where n is the number of words) but it works and it's one line.

Note that the map is there to be able to assign a value(in this case "a") to "open(fname).read().split()" so I don't have to read the file multiple times(I'm using it as the equivalent of a "let" essentially but python doesn't have that)

Last edited by raja (January 13 2012)

Joe · August 20 2012

By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.

For fun I implemented McIlroy's solution in Python. Here's his shell code:

tr -cs A-Za-z '\n' |  
tr A-Z a-z |          
sort |                
uniq -c |             
sort -rn |            
sed ${1}q

Here's my Python version of it. I wrote it in an almost litterate style, as to make the comparison with the shell code more obvious:

import itertools
import sys

## Just like the Unix pipeline, the code reads a file f on the standard input
with sys.stdin as f:

## reads the input stream and splits it into words     #
    a = f.read().split() # tr -cs A-Za-z '\n' |        #

## lower case all characters                       #
    b = (x.lower() for x in a) # tr A-Z a-z | #

## sort the list alphabetically #
    c = sorted(b) # sort        #

## group items that are similar and store their count                          #
    d = map(lambda x: (x[0], len(list(x[1]))), itertools.groupby(c)) # uniq -c #

## sort items (in reverse order) according to count        #
    e = reversed(sorted(d, key=lambda x: x[1])) # sort -rn #

## print the first item        #
    print(e.next()) # sed {1}q #

Note that I purposefully gave bad names to the intermediate variables. In order to replicate the Unix pipeline more accurately we should think of the code as a single instruction that takes other instructions as arguments.
That's right, we're writing a one liner:

print(reversed(sorted(map(lambda x: (x[0], len(list(x[1]))), itertools.groupby(sorted((x.lower() for x in sys.stdin.read().split())))), key=lambda x: x[1])).next())

The goal of this code was to show similarities between the Unix pipeline and the functional programming paradigm.

patrickas · February 23 2013

Perl6 One liner for fun?

slurp('f.txt').comb(/\w+/).classify(*.lc).map({$_.key=>+$_.value}).sort(-*.value).say

I like that it works operation by operation left to right without much nesting.

LebGeeks

#26 December 13 2011

Re: [Exercise] Word count 2

#27 January 13 2012

Re: [Exercise] Word count 2

#28 August 20 2012

Re: [Exercise] Word count 2

#29 February 23 2013

Re: [Exercise] Word count 2

Board footer