This is an implementation similar to the second one I presented, except it uses a 2D array instead of a dictionary, using the indexes as count.
(In other words, all the words in the list wlist[3] have appeared 3 times in the text). The solution is simple as it only relies on python lists.
I compared my three implementations on a moderate large file (60k). Here are the average results:
- first (map then sort): 1.76s
- second (map with indexes as key): 2.05s
- third (2D matrix): 2.70s
Note that the results of the third implementation could be greatly improved by removing the recursions. I just find them more pleasant to read :)
(In other words, all the words in the list wlist[3] have appeared 3 times in the text). The solution is simple as it only relies on python lists.
import sys
class WCount():
def __init__(self, filename):
self.wlist = [[]]
self.filename = filename
def getcurrcount(self, word, count=0):
if count +1 > len(self.wlist):
return 0
if word in self.wlist[count-1]:
return (count - 1) % len(self.wlist)
else:
return self.getcurrcount(word, count+1)
def incposition(self, word):
n = self.getcurrcount(word)
if n != 0:
self.wlist[n].remove(word)
if n == len(self.wlist) - 1:
self.wlist.append([word])
else:
self.wlist[n+1].append(word)
def readfile(self):
with open(self.filename, "r") as f:
for line in f:
for w in line.split():
self.incposition(w)
def pprint(self, count=1):
if count == len(self.wlist):
return
for w in self.wlist[count]:
print "{}\t{}".format(count, w)
self.pprint (count+1)
w = WCount(sys.argv[1])
w.readfile()
w.pprint()
The code relies on a couple of recursive calls, which made it impossible for me to benchmark, especially on large files( RuntimeError: maximum recursion depth exceeded). I thought cpython would be able to optimize this.I compared my three implementations on a moderate large file (60k). Here are the average results:
- first (map then sort): 1.76s
- second (map with indexes as key): 2.05s
- third (2D matrix): 2.70s
Note that the results of the third implementation could be greatly improved by removing the recursions. I just find them more pleasant to read :)