[Exercise] Word count 2

Joe · December 10 2011

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

AxisFlame · December 10 2011

can you provide the text? or should we just get a random one?

Joe · December 10 2011

it doesn't really matter you can chose randomly. I should be able to test it on a test of my liking.

AxisFlame · December 10 2011

ok, on it

Joe · December 10 2011

Easy solution in python

#!/usr/local/bin/python2.7

class WCount:

    def __init__(self, filename=None):
        self.filename = filename
        self._dict = {}

    def wordcount(self):
        with open (self.filename, "r") as f:
            try:
                while True:
                    _l = f.next().strip("\n")
                    for word in _l.split():
                        self.addcount(word)
            except:
                self.pprint()


    def addcount(self, word):
        if self._dict.has_key(word):
            self._dict[word] += 1
        else:
            self._dict[word] = 1 

    def pprint(self):
        for word in sorted(self._dict, key=self._dict.get, reverse=True):
            print "{}\t{}".format(word, self._dict[word])
    

def main():
    import sys 

    w = WCount(sys.argv[1])
    w.wordcount()

if __name__ == "__main__":
    main()

AxisFlame · December 10 2011

well here is mine!!!

words = input("Pease insert your text: ").replace(",", " ").replace(".", " ").replace(":", " ").replace(";", " ").replace("!", " ").replace("?", " ").lower().split()

print(words)

listOfWords = []
freqOfWords = []

for i in words:
    if i not in listOfWords:
        listOfWords.append(i)
        freqOfWords.append(words.count(i))

freqAndWords = []

for i in range(len(freqOfWords)):
    freqAndWords.append(str(freqOfWords[i]) + " = " + listOfWords[i])

freqAndWords.sort()


for i in range(len(freqAndWords)):
    print(freqAndWords[-(i+1)])

Last edited by AxisFlame (December 10 2011)

saeidw · December 10 2011

PERL!

#!/usr/bin/perl

use strict;
use warnings;

my ($n, $f) = @ARGV;
my @words;
my %unique;
open IN, "<", $f or die;
push(@words, split(' ', lc $_)) while (<IN>);

foreach my $word (@words) {
    $word =~ s/[[:punct:]]//g;
    if (grep {$_ eq $word} keys(%unique)) { $unique{$word}++; }
    else { $unique{$word} = 1; }
}

my @freqs = sort {$b <=> $a} values(%unique);
foreach my $i (0 .. $n - 1) {
    while (my ($k, $v) = each(%unique)) {
	if ($v == $freqs[$i]) { print "$k: $v\n"; last; }
    }
}

I'm sure it can be condensed into a one-liner but I'm not smart enough to do it.

By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.

Last edited by saeidw (December 10 2011)

geek · December 10 2011

Sort[Tally[StringSplit[StringReplace[ToLowerCase@Import["lipsum.txt"], RegularExpression["\\W"] -> " "]]], #1[[2]] > #2[[2]] &] // TableForm

Joe · December 11 2011

geek, what language is this? Its use of square brackets weirds me out ...

samer · December 11 2011

Ruby

text = File.open("/path/to/file", "r").read
words = text.split
bucket = {}
words.each do |w|
    bucket[:"#{w}"].nil? ? bucket[:"#{w}"] = 1 : bucket[:"#{w}"] += 1
end
bucket.sort_by { |word, fq| fq }

rahmu wrote:

geek, what language is this? Its use of square brackets weirds me out ...

It smells like Mathematica. Correct me if I'm wrong.

Ra8 · December 11 2011

<?php
$f='filename.txt';
$h=fopen($f, 'r');
$words = explode(" ",strtolower(fread($h, filesize($f))));
fclose($h);
foreach($words as $word){
	@$freq[$word]?$freq[$word]++:$freq[$word]=1;
}
arsort($freq);
var_dump($freq);
?>

mesa177 · December 11 2011

samer wrote:

It smells like Mathematica. Correct me if I'm wrong.

It is, nice one geek :)

jsaade · December 11 2011

the sorting should be by frequency or by word?

rolf · December 11 2011

<?php
$text = file_get_contents("./file.txt");
$words = explode(" ", $text);
$index = array();
foreach($word in $words) {
    $tWord = strtolower(trim($word)); // remove needless spaces and change to lower case
    if (!empty($tWord)) {
       isset($index[$tWord])? $index[$tWord] ++ : $index[$tWord] = 1 ;
    }
}
ksort($index);
?>
<pre>
<? print_r($index); ?>
</pre>

Last edited by rolf (December 11 2011)

jsaade · December 12 2011

no c++?

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
using namespace std;
void main()
{
	fstream file("file.txt",ios_base::in);
	string word;
	map<string, int> dictionary;
	map<string, int>::iterator p;
	while(!file.eof()){
		file>>word;
		transform(word.begin(), word.end(), word.begin(), tolower);
		p = dictionary.find(word);
		if(p!= dictionary.end()) p->second++;
		else dictionary.insert(make_pair(word,1));
	}
	file.close();
	for (map< string, int>::const_iterator iter = dictionary.begin();iter != dictionary.end(); ++iter )
      cout << iter->first << '\t' << iter->second << '\n';
}

Last edited by ZeRaW (December 12 2011)

Joe · December 12 2011

@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:

file.txt
j jj j

output:

j 3
jj 1

jsaade · December 12 2011

rahmu wrote:

@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:
file.txt
j jj j
output:
j 3
jj 1

stl map sorts by key so it sorts alphabetically by word, this was my first question, are you sorting by frequency or alphabetically :)

in anycase my output for j jj j is: [vs 2010]

j       2
jj      1

I am not sure what compiler you are using, but the default behavior of stl::map is to sort.

Smaller code :)

void main()
{
	fstream file("file.txt",ios_base::in);
    string word;
    map<string, int> words;
    while(file>>word){
       words[word]++;
    }
    file.close();
    for (map<string, int>::const_iterator iter = words.begin();iter != words.end(); ++iter )
      cout << iter->first << '\t' << iter->second << '\n';
}

Last edited by ZeRaW (December 12 2011)

Joe · December 12 2011

So all the solutions are the same: create a dictionary {word, nboccurrences} then sort by value and print.
(I'm not sure about geek's solution, I don't really speak .. mathematica ?)

Basically, we're all relying on the "sort()" functions provided by our favorite languages. Where's the challenge?

In the spirit of "thinking outside the box", I tried to come up with something different. I store the words in a dictionary that has "numbers" as the keys, and the values will be "lists containing the corresponding words".

Here's the code, notice mainly the "addcount" function. The rest is syntaxic sugar.

#!/usr/local/bin/python2.7

class WCount:

    def __init__(self, filename=None):
        self.filename = filename
        self._dict = {}

    def wordcount(self):
        with open (self.filename, "r") as f:
            try:
                while True:
                    _l = f.next().strip("\n")
                    for word in _l.split():
                        self.addcount(word)
            except:
                self.pprint()


    def addcount(self, word):
        _tmpindex = 0
        for n in self._dict.keys():
            if word in self._dict[n]:
                _tmpindex = n

        if _tmpindex != 0:
            self._dict[_tmpindex].remove(word)
        else:
            self._dict[_tmpindex] = []

        if not self._dict.has_key(_tmpindex + 1):
            self._dict[_tmpindex + 1] = []

        self._dict[_tmpindex + 1].append(word)

    def pprint(self):
        for num in sorted(self._dict.iterkeys(), reverse=True):
            if len(self._dict[num]) != 0:
                print "{}\t{}".format(num, self._dict[num])
    

def main():
    import sys 

    w = WCount(sys.argv[1])
    w.wordcount()

if __name__ == "__main__":
    main()

The code still needs a lot of polishing, but it basically implements my idea. I've never studied algorithms before, so if anyone has any criticism, I'm all ears (well, eyes).

Maybe we can come up with any better solution? Are there any Knuths in the house? ;)

EDIT Just tested it, no time difference for small files. For very large files, my second solution is extremely slow. Maybe there's something wrong with my implementation. Or my reasoning. Probably both.

jsaade · December 12 2011

Map is self balanced binary search tree all oprations are O(log n)

@ rahmu i am not sure about your logic you are trying to avoid sorting
But in the implementation the append/remove operations could be costly.
As a c++ implementation i would pre-allocate a large enough frequency array
And minimize append/remove to address operations i ll try to implement something :)

Last edited by ZeRaW (December 12 2011)

jsaade · December 12 2011

Fast implementation for rahmu's algorithm:
Some notes:
- using stl::map as a hash table
- preallocating some values to avoid dynamic allocations
- I am pretty sure this performs slowly for large files (and is expensive)

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <vector>
#include <list>
using namespace std;
void main()
{
	string word;
	const size_t maxSize = 4000;//max of frequency per word
	size_t maxCurrent = 0;
    fstream file("file.txt",ios_base::in);
	map<string,size_t> words;
	vector<list<string>> arrayList(maxSize);
    while(file>>word){
        transform(word.begin(), word.end(), word.begin(), tolower);
		arrayList[words[word]].remove(word);
		words[word]++;
		arrayList[words[word]].push_back(word);
    }
    file.close();
	//print
	for (size_t i = 1; i < arrayList.size(); ++i){
		
		if(arrayList[i].size() != 0){
			cout <<"Frequency: "<< i <<'\t'<<" Values: ";
			for (list<string>::const_iterator iter = arrayList[i].begin();iter != arrayList[i].end(); ++iter )
				cout<<*iter<<", ";
			cout<<endl;
		}
	}
}

As expected for a large file (lorem ipsum 1.52MB) my first implementation takes 0.2-0.3 seconds and this one 0.8-0.9 seconds.

Last edited by ZeRaW (December 12 2011)

Joe · December 13 2011

This is an implementation similar to the second one I presented, except it uses a 2D array instead of a dictionary, using the indexes as count.

(In other words, all the words in the list wlist[3] have appeared 3 times in the text). The solution is simple as it only relies on python lists.

import sys

class WCount():

    def __init__(self, filename):
        self.wlist = [[]]
        self.filename = filename

    def getcurrcount(self, word, count=0):
        if count +1 > len(self.wlist):
            return 0

        if word in self.wlist[count-1]:
            return (count - 1) % len(self.wlist)
        else:
            return self.getcurrcount(word, count+1)

    def incposition(self, word):
        n = self.getcurrcount(word)
        if n != 0:
            self.wlist[n].remove(word)

        if n == len(self.wlist) - 1:
            self.wlist.append([word])
        else:
            self.wlist[n+1].append(word)
            
    def readfile(self):
        with open(self.filename, "r") as f:
            for line in f:
                for w in line.split():
                    self.incposition(w)

    def pprint(self, count=1):
        if count == len(self.wlist):
            return
        for w in self.wlist[count]:
            print "{}\t{}".format(count, w)
        self.pprint (count+1)

w = WCount(sys.argv[1])
w.readfile()
w.pprint()

The code relies on a couple of recursive calls, which made it impossible for me to benchmark, especially on large files( RuntimeError: maximum recursion depth exceeded). I thought cpython would be able to optimize this.

I compared my three implementations on a moderate large file (60k). Here are the average results:

- first (map then sort): 1.76s
- second (map with indexes as key): 2.05s
- third (2D matrix): 2.70s

Note that the results of the third implementation could be greatly improved by removing the recursions. I just find them more pleasant to read :)

jsaade · December 13 2011

same as yours in c++, test file 485K - 0.35seconds, no recursion

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <vector>
#include <list>
#include <time.h>
using namespace std;
size_t getCount(vector<list<string>> &arrayList, string word)
{
	if(!arrayList.size()) return 0;
	for(size_t i =0; i < arrayList.size(); ++i){
		if(find(arrayList[i].begin(), arrayList[i].end(), word) != arrayList[i].end()) return i;
	}
	return 0;
}
void main()
{
	string word;
	size_t maxSize = 40000;
    fstream file("file.txt",ios_base::in);
    vector<list<string>> arrayList(maxSize);
    while(file>>word){
       transform(word.begin(), word.end(), word.begin(), tolower);
       size_t count = getCount(arrayList,word);
	   if(!count) arrayList[1].push_back(word);
	   else {
		   arrayList[count++].remove(word);
		   arrayList[count].push_back(word);
	   }
    }
    file.close();
	//print
    for (size_t i = 1; i < arrayList.size(); ++i){
        
        if(arrayList[i].size() != 0){
            cout <<"Frequency: "<< i <<'\t'<<" Values: ";
            for (list<string>::const_iterator iter = arrayList[i].begin();iter != arrayList[i].end(); ++iter )
                cout<<*iter<<", ";
            cout<<endl;
        }
    }
}

Joe · December 13 2011

ZeRaW wrote:

test file 485K - 0.35seconds, no recursion

Can you please add the benchmark of your other implementation (map + occurrences as keys) with the same file on the same computer?

geek · December 13 2011

another dick-size competition? here it goes: 27ms for 562KiB using my simple snippet.
what's with the meaningless numbers fellows? where's the insight? shotgun optimization will get you nowhere.

jsaade · December 13 2011

geek wrote:

another dick-size competition? here it goes: 27ms for 562KiB using my simple snippet.
what's with the meaningless numbers fellows? where's the insight? shotgun optimization will get you nowhere.

agree with you, but for the sake of fun
File 485KB
Method 1: using stl map, sorted alphabetically, 0.03s
Method 2: using indices as, hash table as elements, 0.08s
Method 3: 2D lists 0.35s
the 2D lists lose a lot of time with finding the count of the word.

Last edited by ZeRaW (December 13 2011)

LebGeeks

#1 December 10 2011

[Exercise] Word count 2

#2 December 10 2011

Re: [Exercise] Word count 2

#3 December 10 2011

Re: [Exercise] Word count 2

#4 December 10 2011

Re: [Exercise] Word count 2

#5 December 10 2011

Re: [Exercise] Word count 2

#6 December 10 2011

Re: [Exercise] Word count 2

#7 December 10 2011

Re: [Exercise] Word count 2

#8 December 10 2011

Re: [Exercise] Word count 2

#9 December 11 2011

Re: [Exercise] Word count 2

#10 December 11 2011

Re: [Exercise] Word count 2

#11 December 11 2011

Re: [Exercise] Word count 2

#12 December 11 2011

Re: [Exercise] Word count 2

#13 December 11 2011

Re: [Exercise] Word count 2

#14 December 11 2011

Re: [Exercise] Word count 2

#15 December 12 2011

Re: [Exercise] Word count 2

#16 December 12 2011

Re: [Exercise] Word count 2

#17 December 12 2011

Re: [Exercise] Word count 2

#18 December 12 2011

Re: [Exercise] Word count 2

#19 December 12 2011

Re: [Exercise] Word count 2

#20 December 12 2011

Re: [Exercise] Word count 2

#21 December 13 2011

Re: [Exercise] Word count 2

#22 December 13 2011

Re: [Exercise] Word count 2

#23 December 13 2011

Re: [Exercise] Word count 2

#24 December 13 2011

Re: [Exercise] Word count 2

#25 December 13 2011

Re: [Exercise] Word count 2

Board footer