You are not logged in.
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
can you provide the text? or should we just get a random one?
it doesn't really matter you can chose randomly. I should be able to test it on a test of my liking.
ok, on it
Easy solution in python
#!/usr/local/bin/python2.7
class WCount:
def __init__(self, filename=None):
self.filename = filename
self._dict = {}
def wordcount(self):
with open (self.filename, "r") as f:
try:
while True:
_l = f.next().strip("\n")
for word in _l.split():
self.addcount(word)
except:
self.pprint()
def addcount(self, word):
if self._dict.has_key(word):
self._dict[word] += 1
else:
self._dict[word] = 1
def pprint(self):
for word in sorted(self._dict, key=self._dict.get, reverse=True):
print "{}\t{}".format(word, self._dict[word])
def main():
import sys
w = WCount(sys.argv[1])
w.wordcount()
if __name__ == "__main__":
main()
well here is mine!!!
words = input("Pease insert your text: ").replace(",", " ").replace(".", " ").replace(":", " ").replace(";", " ").replace("!", " ").replace("?", " ").lower().split()
print(words)
listOfWords = []
freqOfWords = []
for i in words:
if i not in listOfWords:
listOfWords.append(i)
freqOfWords.append(words.count(i))
freqAndWords = []
for i in range(len(freqOfWords)):
freqAndWords.append(str(freqOfWords[i]) + " = " + listOfWords[i])
freqAndWords.sort()
for i in range(len(freqAndWords)):
print(freqAndWords[-(i+1)])
Last edited by AxisFlame (December 10 2011)
#!/usr/bin/perl
use strict;
use warnings;
my ($n, $f) = @ARGV;
my @words;
my %unique;
open IN, "<", $f or die;
push(@words, split(' ', lc $_)) while (<IN>);
foreach my $word (@words) {
$word =~ s/[[:punct:]]//g;
if (grep {$_ eq $word} keys(%unique)) { $unique{$word}++; }
else { $unique{$word} = 1; }
}
my @freqs = sort {$b <=> $a} values(%unique);
foreach my $i (0 .. $n - 1) {
while (my ($k, $v) = each(%unique)) {
if ($v == $freqs[$i]) { print "$k: $v\n"; last; }
}
}
I'm sure it can be condensed into a one-liner but I'm not smart enough to do it.
By the way, I recognized this problem from this article on Donald Knuth and Dough McIlroy. McIlroy's solution is absolutely brilliant, and shows the real power of the Unix style.
Last edited by saeidw (December 10 2011)
Sort[Tally[StringSplit[StringReplace[ToLowerCase@Import["lipsum.txt"], RegularExpression["\\W"] -> " "]]], #1[[2]] > #2[[2]] &] // TableForm
geek, what language is this? Its use of square brackets weirds me out ...
Ruby
text = File.open("/path/to/file", "r").read
words = text.split
bucket = {}
words.each do |w|
bucket[:"#{w}"].nil? ? bucket[:"#{w}"] = 1 : bucket[:"#{w}"] += 1
end
bucket.sort_by { |word, fq| fq }
geek, what language is this? Its use of square brackets weirds me out ...
It smells like Mathematica. Correct me if I'm wrong.
<?php
$f='filename.txt';
$h=fopen($f, 'r');
$words = explode(" ",strtolower(fread($h, filesize($f))));
fclose($h);
foreach($words as $word){
@$freq[$word]?$freq[$word]++:$freq[$word]=1;
}
arsort($freq);
var_dump($freq);
?>
It smells like Mathematica. Correct me if I'm wrong.
It is, nice one geek :)
the sorting should be by frequency or by word?
<?php
$text = file_get_contents("./file.txt");
$words = explode(" ", $text);
$index = array();
foreach($word in $words) {
$tWord = strtolower(trim($word)); // remove needless spaces and change to lower case
if (!empty($tWord)) {
isset($index[$tWord])? $index[$tWord] ++ : $index[$tWord] = 1 ;
}
}
ksort($index);
?>
<pre>
<? print_r($index); ?>
</pre>
Last edited by rolf (December 11 2011)
no c++?
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
using namespace std;
void main()
{
fstream file("file.txt",ios_base::in);
string word;
map<string, int> dictionary;
map<string, int>::iterator p;
while(!file.eof()){
file>>word;
transform(word.begin(), word.end(), word.begin(), tolower);
p = dictionary.find(word);
if(p!= dictionary.end()) p->second++;
else dictionary.insert(make_pair(word,1));
}
file.close();
for (map< string, int>::const_iterator iter = dictionary.begin();iter != dictionary.end(); ++iter )
cout << iter->first << '\t' << iter->second << '\n';
}
Last edited by ZeRaW (December 12 2011)
@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:
file.txt
j jj j
output:
j 3
jj 1
@ZeRaW: Your solution doesn't sort the output, it merely stores all the issues in a dictionary.
Also, it seems that it bugs on certain case. Try putting the following in the file:file.txt j jj j
output:
j 3
jj 1
stl map sorts by key so it sorts alphabetically by word, this was my first question, are you sorting by frequency or alphabetically :)
in anycase my output for j jj j is: [vs 2010]
j 2
jj 1
I am not sure what compiler you are using, but the default behavior of stl::map is to sort.
Smaller code :)
void main()
{
fstream file("file.txt",ios_base::in);
string word;
map<string, int> words;
while(file>>word){
words[word]++;
}
file.close();
for (map<string, int>::const_iterator iter = words.begin();iter != words.end(); ++iter )
cout << iter->first << '\t' << iter->second << '\n';
}
Last edited by ZeRaW (December 12 2011)
So all the solutions are the same: create a dictionary {word, nboccurrences} then sort by value and print.
(I'm not sure about geek's solution, I don't really speak .. mathematica ?)
Basically, we're all relying on the "sort()" functions provided by our favorite languages. Where's the challenge?
In the spirit of "thinking outside the box", I tried to come up with something different. I store the words in a dictionary that has "numbers" as the keys, and the values will be "lists containing the corresponding words".
Here's the code, notice mainly the "addcount" function. The rest is syntaxic sugar.
#!/usr/local/bin/python2.7
class WCount:
def __init__(self, filename=None):
self.filename = filename
self._dict = {}
def wordcount(self):
with open (self.filename, "r") as f:
try:
while True:
_l = f.next().strip("\n")
for word in _l.split():
self.addcount(word)
except:
self.pprint()
def addcount(self, word):
_tmpindex = 0
for n in self._dict.keys():
if word in self._dict[n]:
_tmpindex = n
if _tmpindex != 0:
self._dict[_tmpindex].remove(word)
else:
self._dict[_tmpindex] = []
if not self._dict.has_key(_tmpindex + 1):
self._dict[_tmpindex + 1] = []
self._dict[_tmpindex + 1].append(word)
def pprint(self):
for num in sorted(self._dict.iterkeys(), reverse=True):
if len(self._dict[num]) != 0:
print "{}\t{}".format(num, self._dict[num])
def main():
import sys
w = WCount(sys.argv[1])
w.wordcount()
if __name__ == "__main__":
main()
The code still needs a lot of polishing, but it basically implements my idea. I've never studied algorithms before, so if anyone has any criticism, I'm all ears (well, eyes).
Maybe we can come up with any better solution? Are there any Knuths in the house? ;)
EDIT Just tested it, no time difference for small files. For very large files, my second solution is extremely slow. Maybe there's something wrong with my implementation. Or my reasoning. Probably both.
Map is self balanced binary search tree all oprations are O(log n)
@ rahmu i am not sure about your logic you are trying to avoid sorting
But in the implementation the append/remove operations could be costly.
As a c++ implementation i would pre-allocate a large enough frequency array
And minimize append/remove to address operations i ll try to implement something :)
Last edited by ZeRaW (December 12 2011)
Fast implementation for rahmu's algorithm:
Some notes:
- using stl::map as a hash table
- preallocating some values to avoid dynamic allocations
- I am pretty sure this performs slowly for large files (and is expensive)
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <vector>
#include <list>
using namespace std;
void main()
{
string word;
const size_t maxSize = 4000;//max of frequency per word
size_t maxCurrent = 0;
fstream file("file.txt",ios_base::in);
map<string,size_t> words;
vector<list<string>> arrayList(maxSize);
while(file>>word){
transform(word.begin(), word.end(), word.begin(), tolower);
arrayList[words[word]].remove(word);
words[word]++;
arrayList[words[word]].push_back(word);
}
file.close();
//print
for (size_t i = 1; i < arrayList.size(); ++i){
if(arrayList[i].size() != 0){
cout <<"Frequency: "<< i <<'\t'<<" Values: ";
for (list<string>::const_iterator iter = arrayList[i].begin();iter != arrayList[i].end(); ++iter )
cout<<*iter<<", ";
cout<<endl;
}
}
}
As expected for a large file (lorem ipsum 1.52MB) my first implementation takes 0.2-0.3 seconds and this one 0.8-0.9 seconds.
Last edited by ZeRaW (December 12 2011)
This is an implementation similar to the second one I presented, except it uses a 2D array instead of a dictionary, using the indexes as count.
(In other words, all the words in the list wlist[3] have appeared 3 times in the text). The solution is simple as it only relies on python lists.
import sys
class WCount():
def __init__(self, filename):
self.wlist = [[]]
self.filename = filename
def getcurrcount(self, word, count=0):
if count +1 > len(self.wlist):
return 0
if word in self.wlist[count-1]:
return (count - 1) % len(self.wlist)
else:
return self.getcurrcount(word, count+1)
def incposition(self, word):
n = self.getcurrcount(word)
if n != 0:
self.wlist[n].remove(word)
if n == len(self.wlist) - 1:
self.wlist.append([word])
else:
self.wlist[n+1].append(word)
def readfile(self):
with open(self.filename, "r") as f:
for line in f:
for w in line.split():
self.incposition(w)
def pprint(self, count=1):
if count == len(self.wlist):
return
for w in self.wlist[count]:
print "{}\t{}".format(count, w)
self.pprint (count+1)
w = WCount(sys.argv[1])
w.readfile()
w.pprint()
The code relies on a couple of recursive calls, which made it impossible for me to benchmark, especially on large files( RuntimeError: maximum recursion depth exceeded). I thought cpython would be able to optimize this.
I compared my three implementations on a moderate large file (60k). Here are the average results:
- first (map then sort): 1.76s
- second (map with indexes as key): 2.05s
- third (2D matrix): 2.70s
Note that the results of the third implementation could be greatly improved by removing the recursions. I just find them more pleasant to read :)
same as yours in c++, test file 485K - 0.35seconds, no recursion
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <vector>
#include <list>
#include <time.h>
using namespace std;
size_t getCount(vector<list<string>> &arrayList, string word)
{
if(!arrayList.size()) return 0;
for(size_t i =0; i < arrayList.size(); ++i){
if(find(arrayList[i].begin(), arrayList[i].end(), word) != arrayList[i].end()) return i;
}
return 0;
}
void main()
{
string word;
size_t maxSize = 40000;
fstream file("file.txt",ios_base::in);
vector<list<string>> arrayList(maxSize);
while(file>>word){
transform(word.begin(), word.end(), word.begin(), tolower);
size_t count = getCount(arrayList,word);
if(!count) arrayList[1].push_back(word);
else {
arrayList[count++].remove(word);
arrayList[count].push_back(word);
}
}
file.close();
//print
for (size_t i = 1; i < arrayList.size(); ++i){
if(arrayList[i].size() != 0){
cout <<"Frequency: "<< i <<'\t'<<" Values: ";
for (list<string>::const_iterator iter = arrayList[i].begin();iter != arrayList[i].end(); ++iter )
cout<<*iter<<", ";
cout<<endl;
}
}
}
test file 485K - 0.35seconds, no recursion
Can you please add the benchmark of your other implementation (map + occurrences as keys) with the same file on the same computer?
another dick-size competition? here it goes: 27ms for 562KiB using my simple snippet.
what's with the meaningless numbers fellows? where's the insight? shotgun optimization will get you nowhere.
another dick-size competition? here it goes: 27ms for 562KiB using my simple snippet.
what's with the meaningless numbers fellows? where's the insight? shotgun optimization will get you nowhere.
agree with you, but for the sake of fun
File 485KB
Method 1: using stl map, sorted alphabetically, 0.03s
Method 2: using indices as, hash table as elements, 0.08s
Method 3: 2D lists 0.35s
the 2D lists lose a lot of time with finding the count of the word.
Last edited by ZeRaW (December 13 2011)