Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
[Exercise] Word count 2
can you provide the text? or should we just get a random one?
it doesn't really matter you can chose randomly. I should be able to test it on a test of my liking.
ok, on it :D
Easy solution in python
#!/usr/local/bin/python2.7
class WCount:
def __init__(self, filename=None):
self.filename = filename
self._dict = {}
def wordcount(self):
with open (self.filename, "r") as f:
try:
while True:
_l = f.next().strip("\n")
for word in _l.split():
self.addcount(word)
except:
self.pprint()
def addcount(self, word):
if self._dict.has_key(word):
self._dict[word] += 1
else:
self._dict[word] = 1
def pprint(self):
for word in sorted(self._dict, key=self._dict.get, reverse=True):
print "{}\t{}".format(word, self._dict[word])
def main():
import sys
w = WCount(sys.argv[1])
w.wordcount()
if __name__ == "__main__":
main()
- Edited
well here is mine!!!
words = input("Pease insert your text: ").replace(",", " ").replace(".", " ").replace(":", " ").replace(";", " ").replace("!", " ").replace("?", " ").lower().split()
print(words)
listOfWords = []
freqOfWords = []
for i in words:
if i not in listOfWords:
listOfWords.append(i)
freqOfWords.append(words.count(i))
freqAndWords = []
for i in range(len(freqOfWords)):
freqAndWords.append(str(freqOfWords[i]) + " = " + listOfWords[i])
freqAndWords.sort()
for i in range(len(freqAndWords)):
print(freqAndWords[-(i+1)])
- Edited
PERL!
By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.
#!/usr/bin/perl
use strict;
use warnings;
my ($n, $f) = @ARGV;
my @words;
my %unique;
open IN, "<", $f or die;
push(@words, split(' ', lc $_)) while (<IN>);
foreach my $word (@words) {
$word =~ s/[[:punct:]]//g;
if (grep {$_ eq $word} keys(%unique)) { $unique{$word}++; }
else { $unique{$word} = 1; }
}
my @freqs = sort {$b <=> $a} values(%unique);
foreach my $i (0 .. $n - 1) {
while (my ($k, $v) = each(%unique)) {
if ($v == $freqs[$i]) { print "$k: $v\n"; last; }
}
}
I'm sure it can be condensed into a one-liner but I'm not smart enough to do it.By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.
Sort[Tally[StringSplit[StringReplace[ToLowerCase@Import["lipsum.txt"], RegularExpression["\\W"] -> " "]]], #1[[2]] > #2[[2]] &] // TableForm
geek, what language is this? Its use of square brackets weirds me out ...
Ruby
text = File.open("/path/to/file", "r").read
words = text.split
bucket = {}
words.each do |w|
bucket[:"#{w}"].nil? ? bucket[:"#{w}"] = 1 : bucket[:"#{w}"] += 1
end
bucket.sort_by { |word, fq| fq }
It smells like Mathematica. Correct me if I'm wrong.rahmu wrotegeek, what language is this? Its use of square brackets weirds me out ...
<?php
$f='filename.txt';
$h=fopen($f, 'r');
$words = explode(" ",strtolower(fread($h, filesize($f))));
fclose($h);
foreach($words as $word){
@$freq[$word]?$freq[$word]++:$freq[$word]=1;
}
arsort($freq);
var_dump($freq);
?>
It is, nice one geek :)samer wroteIt smells like Mathematica. Correct me if I'm wrong.
the sorting should be by frequency or by word?
- Edited
<?php
$text = file_get_contents("./file.txt");
$words = explode(" ", $text);
$index = array();
foreach($word in $words) {
$tWord = strtolower(trim($word)); // remove needless spaces and change to lower case
if (!empty($tWord)) {
isset($index[$tWord])? $index[$tWord] ++ : $index[$tWord] = 1 ;
}
}
ksort($index);
?>
<pre>
<? print_r($index); ?>
</pre>
- Edited
no c++?
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
using namespace std;
void main()
{
fstream file("file.txt",ios_base::in);
string word;
map<string, int> dictionary;
map<string, int>::iterator p;
while(!file.eof()){
file>>word;
transform(word.begin(), word.end(), word.begin(), tolower);
p = dictionary.find(word);
if(p!= dictionary.end()) p->second++;
else dictionary.insert(make_pair(word,1));
}
file.close();
for (map< string, int>::const_iterator iter = dictionary.begin();iter != dictionary.end(); ++iter )
cout << iter->first << '\t' << iter->second << '\n';
}
@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:
j 3
jj 1
Also, it seems that it bugs on certain case. Try putting the following in the file:
file.txt
j jj j
output:j 3
jj 1
- Edited
stl map sorts by key so it sorts alphabetically by word, this was my first question, are you sorting by frequency or alphabetically :)rahmu wrote@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:
output:file.txt j jj j
j 3
jj 1
in anycase my output for j jj j is: [vs 2010]
j 2
jj 1
I am not sure what compiler you are using, but the default behavior of stl::map is to sort.Smaller code :)
void main()
{
fstream file("file.txt",ios_base::in);
string word;
map<string, int> words;
while(file>>word){
words[word]++;
}
file.close();
for (map<string, int>::const_iterator iter = words.begin();iter != words.end(); ++iter )
cout << iter->first << '\t' << iter->second << '\n';
}
So all the solutions are the same: create a dictionary {word, nboccurrences} then sort by value and print.
(I'm not sure about geek's solution, I don't really speak .. mathematica ?)
Basically, we're all relying on the "sort()" functions provided by our favorite languages. Where's the challenge?
In the spirit of "thinking outside the box", I tried to come up with something different. I store the words in a dictionary that has "numbers" as the keys, and the values will be "lists containing the corresponding words".
Here's the code, notice mainly the "addcount" function. The rest is syntaxic sugar.
Maybe we can come up with any better solution? Are there any Knuths in the house? ;)
EDIT Just tested it, no time difference for small files. For very large files, my second solution is extremely slow. Maybe there's something wrong with my implementation. Or my reasoning. Probably both.
(I'm not sure about geek's solution, I don't really speak .. mathematica ?)
Basically, we're all relying on the "sort()" functions provided by our favorite languages. Where's the challenge?
In the spirit of "thinking outside the box", I tried to come up with something different. I store the words in a dictionary that has "numbers" as the keys, and the values will be "lists containing the corresponding words".
Here's the code, notice mainly the "addcount" function. The rest is syntaxic sugar.
#!/usr/local/bin/python2.7
class WCount:
def __init__(self, filename=None):
self.filename = filename
self._dict = {}
def wordcount(self):
with open (self.filename, "r") as f:
try:
while True:
_l = f.next().strip("\n")
for word in _l.split():
self.addcount(word)
except:
self.pprint()
def addcount(self, word):
_tmpindex = 0
for n in self._dict.keys():
if word in self._dict[n]:
_tmpindex = n
if _tmpindex != 0:
self._dict[_tmpindex].remove(word)
else:
self._dict[_tmpindex] = []
if not self._dict.has_key(_tmpindex + 1):
self._dict[_tmpindex + 1] = []
self._dict[_tmpindex + 1].append(word)
def pprint(self):
for num in sorted(self._dict.iterkeys(), reverse=True):
if len(self._dict[num]) != 0:
print "{}\t{}".format(num, self._dict[num])
def main():
import sys
w = WCount(sys.argv[1])
w.wordcount()
if __name__ == "__main__":
main()
The code still needs a lot of polishing, but it basically implements my idea. I've never studied algorithms before, so if anyone has any criticism, I'm all ears (well, eyes).Maybe we can come up with any better solution? Are there any Knuths in the house? ;)
EDIT Just tested it, no time difference for small files. For very large files, my second solution is extremely slow. Maybe there's something wrong with my implementation. Or my reasoning. Probably both.
- Edited
Map is self balanced binary search tree all oprations are O(log n)
@ rahmu i am not sure about your logic you are trying to avoid sorting
But in the implementation the append/remove operations could be costly.
As a c++ implementation i would pre-allocate a large enough frequency array
And minimize append/remove to address operations i ll try to implement something :)
@ rahmu i am not sure about your logic you are trying to avoid sorting
But in the implementation the append/remove operations could be costly.
As a c++ implementation i would pre-allocate a large enough frequency array
And minimize append/remove to address operations i ll try to implement something :)
- Edited
Fast implementation for rahmu's algorithm:
Some notes:
- using stl::map as a hash table
- preallocating some values to avoid dynamic allocations
- I am pretty sure this performs slowly for large files (and is expensive)
Some notes:
- using stl::map as a hash table
- preallocating some values to avoid dynamic allocations
- I am pretty sure this performs slowly for large files (and is expensive)
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <vector>
#include <list>
using namespace std;
void main()
{
string word;
const size_t maxSize = 4000;//max of frequency per word
size_t maxCurrent = 0;
fstream file("file.txt",ios_base::in);
map<string,size_t> words;
vector<list<string>> arrayList(maxSize);
while(file>>word){
transform(word.begin(), word.end(), word.begin(), tolower);
arrayList[words[word]].remove(word);
words[word]++;
arrayList[words[word]].push_back(word);
}
file.close();
//print
for (size_t i = 1; i < arrayList.size(); ++i){
if(arrayList[i].size() != 0){
cout <<"Frequency: "<< i <<'\t'<<" Values: ";
for (list<string>::const_iterator iter = arrayList[i].begin();iter != arrayList[i].end(); ++iter )
cout<<*iter<<", ";
cout<<endl;
}
}
}
As expected for a large file (lorem ipsum 1.52MB) my first implementation takes 0.2-0.3 seconds and this one 0.8-0.9 seconds.