Exercise - WordCount

Joe · Sep 20, 2010

This will be the first of a series of Programming exercises I'm counting on proposing on the forum. The idea is simple: I propose a subject, anyone who wants to participate can post his code. It is not a competition. The idea is to get code review from other developers. So here goes:

WordCount

You are asked to create a program that counts the number of words in a text file.

Definition of a word: The longest continuous sequence of alphanumerical characters. Words can be separated by space character (" "), \t, \r or \n.

For the sake of simplicity, we won't go into the validation of the input, considering that it will always be a valid text file (as opposed to a bitmap for example).

Oh and one last thing, program should be coded in C.

I'm gonna start working on it, and post my code as soon as I'm done. Do not hesitate to post your code, questions or whatever.

Next exercise will be Java :)

Have fun.

N.B: It is not a competition, don't try to post your code super fast. Take your time to develop a pretty (I want to say artistic) solution.

rolf · Sep 21, 2010

<?php

$file = "somedir/somefile.txt";
$count =str_word_count(file_get_contents($file));
echo "$file contains $count words";

?>

Then autogenerate C code from that using HipHop.

Sorry, couldn't resist :-)
When I will have free time, later this week hopefully, I'll get into the real problem, because I want to get into C.

mrmat · Sep 21, 2010

Here's mine. Not much of a C/C++ guy, but concepts are all the same.

By the way rahmu, why not disallowing the use of libraries, such as RegEx etc...

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>



bool isDelimiter (char Chr){
int i;
char word_delimiters[] = {'\n', '\r', '\t',' ',EOF};

    for(i=0;i<sizeof(word_delimiters);i++)
    {
        if (Chr == word_delimiters[i]) {
            return true;
        }
    }
return false;
}


int getWordsCount() {
    FILE *fp;
    int curChar,prevChar;
    int NOW = 0;

    fp = fopen("C:\\words.txt", "r");
    prevChar = '\n';


    while((curChar= fgetc(fp))) {

        if (isDelimiter(curChar) && !isDelimiter(prevChar))
       {
          NOW++;
       }

       if (curChar == EOF) {
       break;
       }

      prevChar = curChar;
    }


    fclose(fp);
   return NOW;
}

int main()
{


printf("Number of Words = %i",getWordsCount());


    return 0;
}

The above code assumes that non alphanumeric chars such as brackets etc... are part of a word. But can be changed, by adding them to the [word_delimiters] array, or the whole code can be changed to check whether the character is alphanumeric or not, rather than checking for a delimiter.

Joe · Sep 21, 2010

Nice one mrmat!

I have simply one remark: I would suggest you stay away from stdbool.h and the bool type when coding in C. It was introduced as part of C99, and is still not included in all compilers. As a general rule, I tend to ignore this library, and consider that bool have been introduced with C++ (which is true to a certain extent).

Also, I consider the line

prevChar = '\n'

to be misleading. I understand that it's meant for the first iteration of the loop, but some clearer code should be considered. Typically, this is one of those times where I would use a do ... while loop instead of a simple while.

All in all good code. I'm glad I got quality code as a response to my first exercise.

I'll post mine tomorrow, not using bool, you'll tell me then what you think ;-)

mrmat · Sep 22, 2010

Thanks for the bool warning.

I'm aware that the line

prevChar = '\n'

is sort of misleading. But the main purpose is to set the beginning of the file to a delimiter, so it wouldn't be counted as a word, have a look closely at the code and you'll understand. Tried using do ... while but as far as i remember it failed somewhere. i'm sure that a better approach can be used.

Anyway waiting impatiently to see what you'd come up with :D

Joe · Sep 22, 2010

Here's my code:

#include <stdio.h>
#include <stdlib.h>

enum {out, in}; /* In this case enum is more appropriate than bool */

void wordCount (FILE* myFile)
{
    int myChar, wordNumber, state;

    /* Initializing variables */
    myChar = 0;
    wordNumber = 0;
    state = out; /* Starting out of a word */

    while ((myChar = fgetc(myFile)) != EOF)
    {
        printf("%c", myChar);
        if (myChar == ' ' || myChar == '\t' || myChar == '\n' || myChar == '\r')
        {
            state = out;
        }
        else if (state == out) /* We enter a new word */
        {
            state = in;
            wordNumber++;
        }
    }
    
    printf("\n%d word(s)\n", wordNumber);
}


int main (int argc, char *argv[]) 
{
    FILE* myFile = NULL;

    myFile = fopen(argv[1], "r");

    if (myFile != NULL) /* Testing if fopen worked */
    {
        wordCount (myFile);
        fclose(myFile);
    }
    else if (argc == 1) /* Is the file path supplied ? */
    {
        printf ("No file supplied as an argument.");
    }
    else
    {
        printf("There was a problem accessing the file. Please make sure the file exists and is available.");
    }

    return 0;
}

It doesn't use bool, instead uses enum, which have many advantages (the most important of which being improving source code readability). I kow that enum was added to the original C later on. I wonder if it follows the C89 norm.

Instead of enums, defines could be used.

Also, in this code, file is supplied as a command line argument, allowing you not to change the source code (and recompile), every time you need to change files.

I hope the code is clear enough. Do not hesitate to give any remarks and criticism, I'm listening :)

Padre · Sep 22, 2010

wouldn't have written it this way, but i guess it works :)

Joe · Sep 22, 2010

Show us some code :)

Joe · May 3, 2012

My first one liner!

# filename is a string with your file's name.
with open(filename) as f: reduce (lambda x, y: x+y, [len (x.split()) for x in f])

The lambda could (should) be replaced by a call to operator.add() but the import statement would've made this not be a one liner :P

Also it's stupid that you cannot just use '+'.

xterm · May 3, 2012

>.<

len(open(filename).read().split())

Off-topic: Your code reminds me of when I got so excited with reduce, that I used it to join a list of strings instead of simply using join.

Joe · May 3, 2012

ARRRGH you win!

:)

Off-topic: Yeah, I guess reduce tends to do that to you. You want to use it everywhere, just because you can.

NuclearVision · May 4, 2012

I'm not into objective-C, but if i was i would write a code that only counts spaces and separation signs.
I can write a python code.
"Soit l'expression", "How are you mate?" let s be the number of spaces, and n the number of words, n=s+1, it is noticeable. So that my code in Python2 will be:

expression = raw_input("Enter the expression: ")
i = map(str,list(expression))
def words(list):
    s=list.count(" ")
    return s+1
print words(i)

or

from urllib2 import *
expression = read("%file%path%")
i = map(str,list(expression))
def words(list):
    s=list.count(" ")
    return s+1
print words(i)

Joe · May 4, 2012

hmmm are you sure?

How many words do you count in the following sentence :)

Hello          Lebgeeks!

Johnaudi · Jun 19, 2014

while(str.Contains("  ")) str = str.Replace("  ", " ");
Int n_words = str.Trim().Split(' ').Length;

I could have used lambda, but I'm really bad at understanding them. I will once I finish reading a few.

NuclearVision · Jun 19, 2014

j=0;s='your text  here   '+'.'
for i in xrange(0,len(s)+1):
  try:
   if s[i]!=" " and s[i+1]==' ':
      j+=1
  except IndexError: pass
print j

xterm · Jun 19, 2014

Johnaudi, NuclearVision,

The requirements state that you need to handle \n \r \t as well.

Johnaudi · Jun 19, 2014

xterm wroteJohnaudi, NuclearVision,

The requirements state that you need to handle \n \r \t as well.

Alright:

while(str.Contains("  ") || str.Contains('\n') || str.Contains('\r') || str.Contains('\t')) {
  str = str.Replace("  ", " ") = str.Replace("  ", '\n') = str.Replace("  ", '\r') = str.Replace("  ", '\t');
}
Int n_words = str.Trim().Split(' ').Length;

NuclearVision · Jun 19, 2014

from string import replace
j=0;s='your text  here   '+'.'
for i in ["\t","\r","\n"]:
    s=replace(s,i," ")
for i in xrange(0,len(s)+1):
  try:
   if s[i]!=" " and s[i+1]==' ':
      j+=1
  except IndexError: pass
print j

Thanks xterm I missed it :)

xterm · Jun 20, 2014

If you want to manually iterate over the given string, there's no reason in doing two passes (replace then split) over the content. It would be better if you modified your logic to collect the word count in the first loop. Otherwise;

The python version as seen above would be:

len(text.split())

It's exact C# implementation would be:

text.Split(
    new char[]{' ', '\t', '\r', '\n'},
    StringSplitOptions.RemoveEmptyEntries
).Length;

NuclearVision · Jun 20, 2014

Thanks xterm I didn't know that split ignores space characters!