Concurrent access
The first solution that pops to mind is to have multiple concurrent access to the file. The idea is to have multiple workers counting lines at different parts of the file at the same time. Here's what I've tried.
Let's use the shell!
The Unix shell comes with it own scripting language that is quirky, if not flat out annoying sometimes, but there's one thing it does well. Concurrency.
Asyncronous shell programming
The basis of concurrency in the shell is using the job control features. I'll show a very simple example:
You can use the
& character to send a job to the background.
function hello_world_async () {
sleep 1 && echo hello world &
}
time hello_world_async
Here's the output of the code above:
$ bash async.sh
real 0m0.001s
user 0m0.000s
sys 0m0.000s
$ hello world
We can see the problem here: even though the function sleeps for 1s, the timer indicates it exucted in 0.001s. Moreover, if you pay close attention, you'll notice that the string "hello world" was printed on the screen after the shell printed its prompt (**$**), so after my program exited.
In order to avoid this, we'll use the shell builtin
wait. You may have guessed from its name, this statement will block any further execution of the code until the background jobs have all finished their execution. Using this, we can now measure the execution time of the function:
time {
hello_world_async
wait
}
$ bash async.sh
hello world
real 0m1.004s
user 0m0.001s
sys 0m0.002s
$
The measurement is now correct!
Control group
Let's go back to the problem of reading a large file. In order to run some primary tests, we can use this simple function:
function async_countline_chunk {
file=$1
blocksize=$2
start=$3
count=$4
dd if=$file bs=$blocksize skip=$start count=$count | wc -l &
}
What this does is to read a number of chunks from a file and count the number of
lines it gets in them asynchronously.
In order to run some tests I would need to have an input file large enough. I created a large one by piping output from /dev/random. It can take some time, think of running this write job concurrently too ;)
Finally I'm running these tests on a MacBook Air laptop sporting an SSD an a 4 core 2GHz Intel i7 CPU.
As for the control group, I'll count the lines in the first 2GiB of the file. I assume we know beforehand the size of the file and I won't deal with the notion of unknown size or infinite streams.
time {
async_countline_chunk /tmp/large 4k 0 250000
echo "waiting..."
wait
}
Note: It appeared to me when running my tests that the result is always higher during the first run. Beware of disk caching. Usually ignoring the first result is enough to get coherent results. I'll only be showing the results of meaningful runs from now on.
And here's the result when I run it:
$ bash large_countline
waiting...
500000+0 records in
500000+0 records out
2048000000 bytes transferred in 2.219873 secs (922575222 bytes/sec)
7998585
real 0m2.224s
user 0m1.925s
sys 0m1.957s
2 Parallel jobs
time {
async_countline_chunk /tmp/large 4k 0 250000
async_countline_chunk /tmp/large 4k 250000 250000
echo "waiting..."
wait
}
waiting...
250000+0 records in
250000+0 records out
1024000000 bytes transferred in 1.658643 secs (617372150 bytes/sec)
3999882
done
250000+0 records in
250000+0 records out
1024000000 bytes transferred in 1.663097 secs (615718693 bytes/sec)
3998703
done
real 0m1.667s
user 0m2.908s
sys 0m2.903s
The result is an improvement of about 25%. It may be better than the previous result, however we must also notice that each job is running at a slower pace than before. It now takes 1.6s to read half the file we read in 2.1s.
4 Parallel jobs
time {
async_countline_chunk /tmp/large 4k 0 125000
async_countline_chunk /tmp/large 4k 125000 125000
async_countline_chunk /tmp/large 4k 250000 125000
async_countline_chunk /tmp/large 4k 375000 125000
echo "waiting..."
wait
}
waiting...
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.318498 secs (388320680 bytes/sec)
1999784
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.355170 secs (377812375 bytes/sec)
1999439
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.353463 secs (378288897 bytes/sec)
2000098
125000+0 records in
125000+0 records out
512000000 bytes transferred in 1.367312 secs (374457342 bytes/sec)
1999264
real 0m1.371s
user 0m2.961s
sys 0m2.338s
Again, better overall result but the speed of each individual worker is worse.
Distributing the load
The issue of workers getting slower is most probably due to several workers trying to access the same file. To give different processes access to a single file, the OS usually has to set up costly locking mechanisms. There are several ideas one can explore to improve this:
- The locking mechanism is usually implemented at the level of the file system. You could experiment with different file systems.
- Depending on the file system, it could be interesting to split the file into multiple files, bypassing locking mechanisms. This may or may not be possible.
- Copying the files on separate disks removes the need for locking altogether. However it adds a new layer of complexity to the job. The copy operation might be as costly as the counting operation, so this could be a solution for more complicated processes.
- For extremely large files, or more complex operations, it might be worth considering distributing the tasks on a Hadoop cluster.
Use memory
Memory is notoriously a lot faster than disks (even SSD, yes). However, it's also considerably more expensive. I'm going to assume for now that we have infinite memory and we're able to hold the whole file in there.
The easiest way to start playing is to create a
ramdisk, a bit of memory acting like block storage and mounted on our file system. The creation of ramdisks will vary according to your operating system. Mac users,
this is how I created one on my machine.
Now let's read 2GiB from the file in memory:
$ time dd if=/tmp/rdisk/large bs=4k count=500000| wc -l
500000+0 records in
500000+0 records out
2048000000 bytes transferred in 2.200442 secs (930722068 bytes/sec)
7998585
dd if=/tmp/rdisk/large bs=4k count=500000 0.12s user 1.55s system 75% cpu 2.204 total
wc -l 1.76s user 0.43s system 99% cpu 2.203 total
2.2 seconds. This is actually the same value we get when reading the file from disk
after the disk has done its caching.
Again, caching itself might be a costly operation (how many times are you going to count the lines of a file before it changes? More than once?) but the good news here is that your file system will know how to take advantage of memory on its own, without asking the user for any extra help.
CPU oversubscription
What happens if I spawn more workers than I have CPU cores on this machine?
time {
async_countline_chunk /tmp/large 4k 0 62500
async_countline_chunk /tmp/large 4k 62500 62500
async_countline_chunk /tmp/large 4k 125000 62500
async_countline_chunk /tmp/large 4k 187500 62500
async_countline_chunk /tmp/large 4k 250000 62500
async_countline_chunk /tmp/large 4k 312500 62500
async_countline_chunk /tmp/large 4k 375000 62500
async_countline_chunk /tmp/large 4k 437500 62500
echo "waiting..."
wait
}
waiting...
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.231143 secs (207936853 bytes/sec)
999367
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.241617 secs (206182750 bytes/sec)
1000072
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.255201 secs (203951383 bytes/sec)
1000646
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.260712 secs (203059835 bytes/sec)
999452
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.280079 secs (199987675 bytes/sec)
998961
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.295565 secs (197597204 bytes/sec)
999836
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.306251 secs (195980704 bytes/sec)
999428
62500+0 records in
62500+0 records out
256000000 bytes transferred in 1.363523 secs (187748941 bytes/sec)
1000823
real 0m1.373s
user 0m2.961s
sys 0m2.297s
1.3s. Same result as if I had 4 workers.
Recap
- The shell is great for running simple tasks in parallel
- You can considerably improve your result by distributing the load
- Modern file systems will deal with in-memory caching implicitly
- It's no use running more workers than you have CPUs
Going further
- What if you don't know the size of the file, or you're dealing with an infinite stream (like a socket)?
- What if the file is larger than all the memory you have?
- Run some tests by separating workers more and more: working on different files, on different partitions, on different disks, on different machines?
- It could be interesting to study how different file systems behave. I'm particularly curious to know how distributed file systems like GlusterFS deal with concurrent access.
- We've only done concurrent read. Concurrent write might be even more complicated.
- UTF-8 is a bit more complicated. Since it's a variable size encoding, splitting is not a trivial task. I suggest trying to define the size of each chunk as a number of character, not a number of blocks.
- I did not deal with proper result aggregation. In the example of counting lines, the shell could do something simple enough, but should you want any more complex operation, possibly with the workers needing to communicate together, you're better off using a more modern framework like Twisted or node.js.
- I only have an SSD available at the moment. I don't know if a physical disk would behave any differently.