A nut to crack?

BashLogic · May 21, 2014

Hello,

i was contemplating on different means and methods on how someone can count the number of lines in a flat file in the most efficient and quick way. the file size could be just about any size, 100mb, 10gb or even 1tb.
how would you count the number of lines in such a file?

regards
BL

PS: the underlying system would be a unix/linux with shell tools at your disposal (with limitations)

Joe · May 21, 2014

Are you limiting yourself to one disk on a single system? This answer considers copying the file on multiple disks and counting chuncks separately. You could even consider distributing your tasks over a cluster using something like Hadoop.

If you want to limit yourself to the one disk, I cannot think of anything faster than wc -l.

BashLogic · May 21, 2014

some caveat to the puzzle..
- the underlying disk or storage is of whatever model or combination.
- there would be a single filesystem
- the file itself can not be split
- the counting of lines can either be a single thread implementation or then again a multithread implementation thru a single host or multiple hosts.
- character encoding is UTF-8

PS: the example that rahmu presented is one of the most potential implemenation, a simple unix wizzardry that can split the computing load across multiple shell instances or multiple servers (presuming the file is on a nas share). there is a down side to that, and that has to do with locking and the underlying filesystem/storage implementation. but lets not go that deep at this point.

Rahmus pointed link is one of the most viable options.

NuclearVision · May 21, 2014

Maybe count "\n"?

Joe · May 21, 2014

there is a down side to that, and that has to do with locking and the underlying filesystem/storage implementation. but lets not go that deep at this point.

There's that, and there are also far more subtle and annoying issues like disk cache that messes up all my experimentation. You could avoid most of these issues anyway by copying the file (or chunks of the file) into many disks.

Something else that could work if you have enough memory: Have you tried copying the file into a ramdisk?

Why do you say you cannot split the file?

@NuclearVision: How long do you think it'd take to count "\n" on a 10GB file? How about a 1TB file? How do you even count the occurrences of a character in a file that is larger than your computer's memory?

Joe · May 22, 2014

Update: I tried running the line count in parallel, both in a multi-threaded script and using non-blocking IO loop. It looks like concurrent access to the same disk is too slow. In all my tests, a single 'wc -l' is faster...

arithma · May 22, 2014

Some bad example of GNU code!
https://www.gnu.org/software/cflow/manual/html_node/Source-of-wc-command.html

Amazingly weird decisions, possibly because of age, and because of scope of the little app.
Example: COUNTER macro that references variables from the global state..

Joe · May 22, 2014

That's a bit unfair arithma. Obviously the implementation was kept simpler for readability, it's an extract from a manual. The actual implementation manipulates local variable.

Unless I completely misunderstood what you mean?

arithma · May 22, 2014

They still use global (file level static variables)
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/wc.c#n58

The source in the manual seemed legit, not specifically for "illustration" purposes (they could have done that better.) That's why I was thrown-off to thinking it's the modern source.

Anyway, I didn't mean to say GNU programming is bad. It's just that very sane decisions even just 10 years ago might seem weird practice today.
GNU can get a lot of that behind them since most of those cmdline programs are self contained little things, so global variables can only really reach that much. The whole program is basically a single function, so how much damage can global variables really do. This goes right in the face of our modern "rules." This is what's interesting to me. Though it might all be naval gazing.

Joe · May 25, 2014

Concurrent access
The first solution that pops to mind is to have multiple concurrent access to the file. The idea is to have multiple workers counting lines at different parts of the file at the same time. Here's what I've tried.

Let's use the shell!
The Unix shell comes with it own scripting language that is quirky, if not flat out annoying sometimes, but there's one thing it does well. Concurrency.

Asyncronous shell programming
The basis of concurrency in the shell is using the job control features. I'll show a very simple example:

You can use the & character to send a job to the background.

  function hello_world_async () {
      sleep 1 && echo hello world &
  }

  time hello_world_async

Here's the output of the code above:

  $ bash async.sh

  real  0m0.001s
  user    0m0.000s
  sys     0m0.000s
  $ hello world

We can see the problem here: even though the function sleeps for 1s, the timer indicates it exucted in 0.001s. Moreover, if you pay close attention, you'll notice that the string "hello world" was printed on the screen after the shell printed its prompt (**$**), so after my program exited.

In order to avoid this, we'll use the shell builtin wait. You may have guessed from its name, this statement will block any further execution of the code until the background jobs have all finished their execution. Using this, we can now measure the execution time of the function:

  time {
      hello_world_async
      wait
  }

  $ bash async.sh
  hello world

  real    0m1.004s
  user    0m0.001s
  sys     0m0.002s
  $

The measurement is now correct!

Control group
Let's go back to the problem of reading a large file. In order to run some primary tests, we can use this simple function:

  function async_countline_chunk {
      file=$1
      blocksize=$2
      start=$3
      count=$4

      dd if=$file bs=$blocksize skip=$start count=$count | wc -l &
  }

What this does is to read a number of chunks from a file and count the number of
lines it gets in them asynchronously.

In order to run some tests I would need to have an input file large enough. I created a large one by piping output from /dev/random. It can take some time, think of running this write job concurrently too ;)

Finally I'm running these tests on a MacBook Air laptop sporting an SSD an a 4 core 2GHz Intel i7 CPU.

As for the control group, I'll count the lines in the first 2GiB of the file. I assume we know beforehand the size of the file and I won't deal with the notion of unknown size or infinite streams.

  time {
      async_countline_chunk /tmp/large 4k 0 250000
      echo "waiting..."
      wait
  }

Note: It appeared to me when running my tests that the result is always higher during the first run. Beware of disk caching. Usually ignoring the first result is enough to get coherent results. I'll only be showing the results of meaningful runs from now on.

And here's the result when I run it:

$ bash large_countline
waiting...
500000+0 records in
500000+0 records out
2048000000 bytes transferred in 2.219873 secs (922575222 bytes/sec)
 7998585


real	0m2.224s
user	0m1.925s
sys	0m1.957s

2 Parallel jobs

time {
    async_countline_chunk /tmp/large 4k 0 250000
    async_countline_chunk /tmp/large 4k 250000 250000
    echo "waiting..."
    wait
}

waiting...
250000+0 records in
250000+0 records out
1024000000 bytes transferred in 1.658643 secs (617372150 bytes/sec)
 3999882
done
250000+0 records in
250000+0 records out
1024000000 bytes transferred in 1.663097 secs (615718693 bytes/sec)
 3998703
done

real	0m1.667s
user	0m2.908s
sys	0m2.903s

The result is an improvement of about 25%. It may be better than the previous result, however we must also notice that each job is running at a slower pace than before. It now takes 1.6s to read half the file we read in 2.1s.

4 Parallel jobs

time {
    async_countline_chunk /tmp/large 4k 0 125000
    async_countline_chunk /tmp/large 4k 125000 125000
    async_countline_chunk /tmp/large 4k 250000 125000
    async_countline_chunk /tmp/large 4k 375000 125000
    echo "waiting..."
    wait
}

waiting...
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.318498 secs (388320680 bytes/sec)
 1999784
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.355170 secs (377812375 bytes/sec)
 1999439
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.353463 secs (378288897 bytes/sec)
 2000098
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.367312 secs (374457342 bytes/sec)
 1999264

real	0m1.371s
user	0m2.961s
sys	0m2.338s

Again, better overall result but the speed of each individual worker is worse.

Distributing the load
The issue of workers getting slower is most probably due to several workers trying to access the same file. To give different processes access to a single file, the OS usually has to set up costly locking mechanisms. There are several ideas one can explore to improve this:

The locking mechanism is usually implemented at the level of the file system. You could experiment with different file systems.
Depending on the file system, it could be interesting to split the file into multiple files, bypassing locking mechanisms. This may or may not be possible.
Copying the files on separate disks removes the need for locking altogether. However it adds a new layer of complexity to the job. The copy operation might be as costly as the counting operation, so this could be a solution for more complicated processes.
For extremely large files, or more complex operations, it might be worth considering distributing the tasks on a Hadoop cluster.

Use memory
Memory is notoriously a lot faster than disks (even SSD, yes). However, it's also considerably more expensive. I'm going to assume for now that we have infinite memory and we're able to hold the whole file in there.

The easiest way to start playing is to create a ramdisk, a bit of memory acting like block storage and mounted on our file system. The creation of ramdisks will vary according to your operating system. Mac users, this is how I created one on my machine.

Now let's read 2GiB from the file in memory:

$ time dd if=/tmp/rdisk/large bs=4k count=500000| wc -l
500000+0 records in
500000+0 records out
2048000000 bytes transferred in 2.200442 secs (930722068 bytes/sec)
 7998585
dd if=/tmp/rdisk/large bs=4k count=500000  0.12s user 1.55s system 75% cpu 2.204 total
wc -l  1.76s user 0.43s system 99% cpu 2.203 total

2.2 seconds. This is actually the same value we get when reading the file from disk after the disk has done its caching.

Again, caching itself might be a costly operation (how many times are you going to count the lines of a file before it changes? More than once?) but the good news here is that your file system will know how to take advantage of memory on its own, without asking the user for any extra help.

CPU oversubscription
What happens if I spawn more workers than I have CPU cores on this machine?

time {
    async_countline_chunk /tmp/large 4k 0 62500
    async_countline_chunk /tmp/large 4k 62500 62500
    async_countline_chunk /tmp/large 4k 125000 62500
    async_countline_chunk /tmp/large 4k 187500 62500
    async_countline_chunk /tmp/large 4k 250000 62500
    async_countline_chunk /tmp/large 4k 312500 62500
    async_countline_chunk /tmp/large 4k 375000 62500
    async_countline_chunk /tmp/large 4k 437500 62500
    echo "waiting..."
    wait
}

waiting...
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.231143 secs (207936853 bytes/sec)
  999367
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.241617 secs (206182750 bytes/sec)
 1000072
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.255201 secs (203951383 bytes/sec)
 1000646
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.260712 secs (203059835 bytes/sec)
  999452
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.280079 secs (199987675 bytes/sec)
  998961
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.295565 secs (197597204 bytes/sec)
  999836
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.306251 secs (195980704 bytes/sec)
  999428
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.363523 secs (187748941 bytes/sec)
 1000823

real	0m1.373s
user	0m2.961s
sys	0m2.297s

1.3s. Same result as if I had 4 workers.

Recap

The shell is great for running simple tasks in parallel
You can considerably improve your result by distributing the load
Modern file systems will deal with in-memory caching implicitly
It's no use running more workers than you have CPUs

Going further

What if you don't know the size of the file, or you're dealing with an infinite stream (like a socket)?
What if the file is larger than all the memory you have?
Run some tests by separating workers more and more: working on different files, on different partitions, on different disks, on different machines?
It could be interesting to study how different file systems behave. I'm particularly curious to know how distributed file systems like GlusterFS deal with concurrent access.
We've only done concurrent read. Concurrent write might be even more complicated.
UTF-8 is a bit more complicated. Since it's a variable size encoding, splitting is not a trivial task. I suggest trying to define the size of each chunk as a number of character, not a number of blocks.
I did not deal with proper result aggregation. In the example of counting lines, the shell could do something simple enough, but should you want any more complex operation, possibly with the workers needing to communicate together, you're better off using a more modern framework like Twisted or node.js.
I only have an SSD available at the moment. I don't know if a physical disk would behave any differently.