You are not logged in.
another dick-size competition?
No dick size competition. Humbly comparing (my own) implementations.
I'm not interested in numbers, but in deltas between 2 versions.Notice I asked ZeRaW for measurements for the same file on the same computer. My goal is not to prove that my algorithm is better than others, but that storing everything in a hash table then sorting it cannot be an optimal idea.
Not that I have managed to do it so far ... :(
the 2D lists lose a lot of time with finding the count of the word.
I think that since wlist[1] is clearly larger than others (most words appear only once), you could improve this function by starting from the last array all the way to array wlist[1]. But I wonder if this kind of micro optimization matters, or even makes sense at all.
Although I usually like to write clean and well documented code, this problem has so many good solutions that I decided to write something in as few lines as I could manage.
I'm still convinced that I can manage something in one line instead of the current three, but I'm giving up for now:
def word_count(fname):
result = {}
map(lambda a:result.__setitem__(a, result.setdefault(a, 0)+1), open(fname).read().split())
return "\n".join(map(lambda row: row[0]+","+str(row[1]), sorted(result.items(), key=lambda a:a[1], reverse=True)[:10]))
print word_count("test_file.txt")
test_file.txt
$ cat test_file.txt
this is
some
random file
is is is
Result:
$ python wordcount.py
is,4
this,1
random,1
some,1
file,1
Edit: Got it as a oneliner, here goes:
def word_count(fname):
return map(lambda a: sorted([(a.count(word), word) for word in set(a)], reverse=True), [open(fname).read().split()])[0][:10]
Quite a few hacks to get this to work, and it's not at all efficient(goes over the whole file for every word, so it's O(n^2) where n is the number of words) but it works and it's one line.
Note that the map is there to be able to assign a value(in this case "a") to "open(fname).read().split()" so I don't have to read the file multiple times(I'm using it as the equivalent of a "let" essentially but python doesn't have that)
Last edited by raja (January 13 2012)
By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.
For fun I implemented McIlroy's solution in Python. Here's his shell code:
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
Here's my Python version of it. I wrote it in an almost litterate style, as to make the comparison with the shell code more obvious:
import itertools
import sys
## Just like the Unix pipeline, the code reads a file f on the standard input
with sys.stdin as f:
## reads the input stream and splits it into words #
a = f.read().split() # tr -cs A-Za-z '\n' | #
## lower case all characters #
b = (x.lower() for x in a) # tr A-Z a-z | #
## sort the list alphabetically #
c = sorted(b) # sort #
## group items that are similar and store their count #
d = map(lambda x: (x[0], len(list(x[1]))), itertools.groupby(c)) # uniq -c #
## sort items (in reverse order) according to count #
e = reversed(sorted(d, key=lambda x: x[1])) # sort -rn #
## print the first item #
print(e.next()) # sed {1}q #
Note that I purposefully gave bad names to the intermediate variables. In order to replicate the Unix pipeline more accurately we should think of the code as a single instruction that takes other instructions as arguments.
That's right, we're writing a one liner:
print(reversed(sorted(map(lambda x: (x[0], len(list(x[1]))), itertools.groupby(sorted((x.lower() for x in sys.stdin.read().split())))), key=lambda x: x[1])).next())
The goal of this code was to show similarities between the Unix pipeline and the functional programming paradigm.
Perl6 One liner for fun?
slurp('f.txt').comb(/\w+/).classify(*.lc).map({$_.key=>+$_.value}).sort(-*.value).say
I like that it works operation by operation left to right without much nesting.