[Exercise] Word count 2

Joe · Dec 10, 2011

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

AxisFlame · Dec 10, 2011

can you provide the text? or should we just get a random one?

Joe · Dec 10, 2011

it doesn't really matter you can chose randomly. I should be able to test it on a test of my liking.

AxisFlame · Dec 10, 2011

ok, on it :D

Joe · Dec 10, 2011

Easy solution in python

#!/usr/local/bin/python2.7

class WCount:

    def __init__(self, filename=None):
        self.filename = filename
        self._dict = {}

    def wordcount(self):
        with open (self.filename, "r") as f:
            try:
                while True:
                    _l = f.next().strip("\n")
                    for word in _l.split():
                        self.addcount(word)
            except:
                self.pprint()


    def addcount(self, word):
        if self._dict.has_key(word):
            self._dict[word] += 1
        else:
            self._dict[word] = 1 

    def pprint(self):
        for word in sorted(self._dict, key=self._dict.get, reverse=True):
            print "{}\t{}".format(word, self._dict[word])
    

def main():
    import sys 

    w = WCount(sys.argv[1])
    w.wordcount()

if __name__ == "__main__":
    main()

AxisFlame · Dec 10, 2011

well here is mine!!!

words = input("Pease insert your text: ").replace(",", " ").replace(".", " ").replace(":", " ").replace(";", " ").replace("!", " ").replace("?", " ").lower().split()

print(words)

listOfWords = []
freqOfWords = []

for i in words:
    if i not in listOfWords:
        listOfWords.append(i)
        freqOfWords.append(words.count(i))

freqAndWords = []

for i in range(len(freqOfWords)):
    freqAndWords.append(str(freqOfWords[i]) + " = " + listOfWords[i])

freqAndWords.sort()


for i in range(len(freqAndWords)):
    print(freqAndWords[-(i+1)])

Ssaeidw · Dec 10, 2011

PERL!

#!/usr/bin/perl

use strict;
use warnings;

my ($n, $f) = @ARGV;
my @words;
my %unique;
open IN, "<", $f or die;
push(@words, split(' ', lc $_)) while (<IN>);

foreach my $word (@words) {
    $word =~ s/[[:punct:]]//g;
    if (grep {$_ eq $word} keys(%unique)) { $unique{$word}++; }
    else { $unique{$word} = 1; }
}

my @freqs = sort {$b <=> $a} values(%unique);
foreach my $i (0 .. $n - 1) {
    while (my ($k, $v) = each(%unique)) {
	if ($v == $freqs[$i]) { print "$k: $v\n"; last; }
    }
}

I'm sure it can be condensed into a one-liner but I'm not smart enough to do it.

By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.

geek · Dec 10, 2011

Sort[Tally[StringSplit[StringReplace[ToLowerCase@Import["lipsum.txt"], RegularExpression["\\W"] -> " "]]], #1[[2]] > #2[[2]] &] // TableForm

Joe · Dec 11, 2011

geek, what language is this? Its use of square brackets weirds me out ...

samer · Dec 11, 2011

Ruby

text = File.open("/path/to/file", "r").read
words = text.split
bucket = {}
words.each do |w|
    bucket[:"#{w}"].nil? ? bucket[:"#{w}"] = 1 : bucket[:"#{w}"] += 1
end
bucket.sort_by { |word, fq| fq }

rahmu wrotegeek, what language is this? Its use of square brackets weirds me out ...

It smells like Mathematica. Correct me if I'm wrong.

Ra8 · Dec 11, 2011

<?php
$f='filename.txt';
$h=fopen($f, 'r');
$words = explode(" ",strtolower(fread($h, filesize($f))));
fclose($h);
foreach($words as $word){
	@$freq[$word]?$freq[$word]++:$freq[$word]=1;
}
arsort($freq);
var_dump($freq);
?>

mesa177 · Dec 11, 2011

samer wroteIt smells like Mathematica. Correct me if I'm wrong.

It is, nice one geek :)

Jjsaade · Dec 11, 2011

the sorting should be by frequency or by word?

rolf · Dec 11, 2011

<?php
$text = file_get_contents("./file.txt");
$words = explode(" ", $text);
$index = array();
foreach($word in $words) {
    $tWord = strtolower(trim($word)); // remove needless spaces and change to lower case
    if (!empty($tWord)) {
       isset($index[$tWord])? $index[$tWord] ++ : $index[$tWord] = 1 ;
    }
}
ksort($index);
?>
<pre>
<? print_r($index); ?>
</pre>

Jjsaade · Dec 12, 2011

no c++?

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
using namespace std;
void main()
{
	fstream file("file.txt",ios_base::in);
	string word;
	map<string, int> dictionary;
	map<string, int>::iterator p;
	while(!file.eof()){
		file>>word;
		transform(word.begin(), word.end(), word.begin(), tolower);
		p = dictionary.find(word);
		if(p!= dictionary.end()) p->second++;
		else dictionary.insert(make_pair(word,1));
	}
	file.close();
	for (map< string, int>::const_iterator iter = dictionary.begin();iter != dictionary.end(); ++iter )
      cout << iter->first << '\t' << iter->second << '\n';
}

Joe · Dec 12, 2011

@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:

file.txt
j jj j

output:

j 3
jj 1

Jjsaade · Dec 12, 2011

rahmu wrote@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:
file.txt
j jj j
output:

j 3
jj 1

stl map sorts by key so it sorts alphabetically by word, this was my first question, are you sorting by frequency or alphabetically :)

in anycase my output for j jj j is: [vs 2010]

j       2
jj      1

I am not sure what compiler you are using, but the default behavior of stl::map is to sort.

Smaller code :)

void main()
{
	fstream file("file.txt",ios_base::in);
    string word;
    map<string, int> words;
    while(file>>word){
       words[word]++;
    }
    file.close();
    for (map<string, int>::const_iterator iter = words.begin();iter != words.end(); ++iter )
      cout << iter->first << '\t' << iter->second << '\n';
}

Joe · Dec 12, 2011

So all the solutions are the same: create a dictionary {word, nboccurrences} then sort by value and print.
(I'm not sure about geek's solution, I don't really speak .. mathematica ?)

Basically, we're all relying on the "sort()" functions provided by our favorite languages. Where's the challenge?

In the spirit of "thinking outside the box", I tried to come up with something different. I store the words in a dictionary that has "numbers" as the keys, and the values will be "lists containing the corresponding words".

Here's the code, notice mainly the "addcount" function. The rest is syntaxic sugar.

#!/usr/local/bin/python2.7

class WCount:

    def __init__(self, filename=None):
        self.filename = filename
        self._dict = {}

    def wordcount(self):
        with open (self.filename, "r") as f:
            try:
                while True:
                    _l = f.next().strip("\n")
                    for word in _l.split():
                        self.addcount(word)
            except:
                self.pprint()


    def addcount(self, word):
        _tmpindex = 0
        for n in self._dict.keys():
            if word in self._dict[n]:
                _tmpindex = n

        if _tmpindex != 0:
            self._dict[_tmpindex].remove(word)
        else:
            self._dict[_tmpindex] = []

        if not self._dict.has_key(_tmpindex + 1):
            self._dict[_tmpindex + 1] = []

        self._dict[_tmpindex + 1].append(word)

    def pprint(self):
        for num in sorted(self._dict.iterkeys(), reverse=True):
            if len(self._dict[num]) != 0:
                print "{}\t{}".format(num, self._dict[num])
    

def main():
    import sys 

    w = WCount(sys.argv[1])
    w.wordcount()

if __name__ == "__main__":
    main()

The code still needs a lot of polishing, but it basically implements my idea. I've never studied algorithms before, so if anyone has any criticism, I'm all ears (well, eyes).

Maybe we can come up with any better solution? Are there any Knuths in the house? ;)

EDIT Just tested it, no time difference for small files. For very large files, my second solution is extremely slow. Maybe there's something wrong with my implementation. Or my reasoning. Probably both.

Jjsaade · Dec 12, 2011

Map is self balanced binary search tree all oprations are O(log n)

@ rahmu i am not sure about your logic you are trying to avoid sorting
But in the implementation the append/remove operations could be costly.
As a c++ implementation i would pre-allocate a large enough frequency array
And minimize append/remove to address operations i ll try to implement something :)

Jjsaade · Dec 12, 2011

Fast implementation for rahmu's algorithm:
Some notes:
- using stl::map as a hash table
- preallocating some values to avoid dynamic allocations
- I am pretty sure this performs slowly for large files (and is expensive)

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <vector>
#include <list>
using namespace std;
void main()
{
	string word;
	const size_t maxSize = 4000;//max of frequency per word
	size_t maxCurrent = 0;
    fstream file("file.txt",ios_base::in);
	map<string,size_t> words;
	vector<list<string>> arrayList(maxSize);
    while(file>>word){
        transform(word.begin(), word.end(), word.begin(), tolower);
		arrayList[words[word]].remove(word);
		words[word]++;
		arrayList[words[word]].push_back(word);
    }
    file.close();
	//print
	for (size_t i = 1; i < arrayList.size(); ++i){
		
		if(arrayList[i].size() != 0){
			cout <<"Frequency: "<< i <<'\t'<<" Values: ";
			for (list<string>::const_iterator iter = arrayList[i].begin();iter != arrayList[i].end(); ++iter )
				cout<<*iter<<", ";
			cout<<endl;
		}
	}
}

As expected for a large file (lorem ipsum 1.52MB) my first implementation takes 0.2-0.3 seconds and this one 0.8-0.9 seconds.