Introduction
This exercise is longer than what we usually deal with. It will be spread over 3 parts which I will publish gradually. Our ultimate goal is to build a basic
version control system (VCS), a tool used by programmers in order to keep their sanity. If you've ever heard of Git, SVN, Mercurial or CVS (just to name a few), you know what I'm talking about. A VCS helps manage several versions of a file (or a group of files). Here are the notions I'm hoping to explore through this exercise:
- Algorithms
- Manipulating files
- Storing data (as plain text and/or relational database)
- Managing large(r) projects (I estimate it's going to be several hundred lines of codes)
Now without further ado, let's introduce our first part.
diff
The first part (and probably most difficult) part of the exercise is to write a
diff program which compares two versions of a file (or simply two files) and outputs the difference between them. The Wikipedia article does a terrific job at explaining how
diff works and you're strongly encouraged to read it. However, for the sake of brevity, I will lay out simple guidelines to help you start. Keep in mind that we're not trying to create a modern full-fledged
diff program. The GNU version of diff is over 11000 (yes,
eleven thousand) lines of C code, clearly that's not what we're going after.
Here's the simple program we want:
- The program takes to arguments: old, the old version of the file, and new the new version. You're free to decide how to feed it these arguments, however I recommend you make it simple as we're going to use diff extensively in subsequent parts of the exercise.
- There are 2 types of modifications our program can detect: add and remove. Your modification object must hold info on the location of the change as well as its content (if needed).
- Our program operates on full lines. If a single letter has changed in a line, it means that the whole line has been replaced.
- Try to keep the number of modifications as low as possible.
Examples
Take a look at this poem by Edgar Allan Poe: A dream within a dream.
|--------+---------------------------------+---------------------------------|
| line # | old | new |
|--------+---------------------------------+---------------------------------|
| 1 | Take this kiss upon the brow! | Take this kiss upon the brow! |
| 2 | And, in parting from you now, | And, in parting from you now, |
| 3 | Thus much let me avow-- | Thus much let me avow-- |
| 4 | You are not wrong, who deem | You are not wrong, who deem |
| 5 | That my days have been a dream; | That my days have been a dream; |
| 6 | Yet if hope has flown away | In a night, or in a day, |
| 7 | In a night, or in a day, | In a vision, or in none, |
| 8 | In a vision, or in none, | Is it therefore the less gone? |
| 9 | Is it therefore the less gone? | Something changed in the file |
| 10 | All that we see or seem | All this we see or seem |
| 11 | Is but a dream within a dream. | Is but a dream within a dream. |
|--------+---------------------------------+---------------------------------|
The old version is the original poem, new is the one that I have modified. Your program should detect 4 changes:
- Removed line 6.
- Added "Something changed in the file" after line 9.
- Removed line 10.
- Added "All this we see or seem" after line 9.
Remarks
Please note the following:
- Each modification object is specifying the location (line) where it happens. Line is always given relative to the "old" file.
- In my example, modification 2 and modification 4 are both applied to line number 9. This assumes that there's an order amongst modifications. If we add "All this we see ..." before "Something changed ...", we will end up with a different result.
- Optionally, you can consider that modif. 3 (remove line 10) and modif. 4 (add "..." after line 9) can be considered a single modification of a new type: change. It's up to you to decide if it's worth the extra effort or not.
- Make sure the output of your program is easy to parse, as the upcoming Part 2 of the exercise will consist of applying these modifications to old files in order to retrieve new files.
Extra reading
I already mentioned the Wikipedia page (and I really insist you take a few minutes reading it). Here are other resources which might help you in this exercise:
- This paper describes the algorithm generally implemented in modern applications. Here you can find a interesting explanation of it.
- This blog post defines how you could format the output of your diff. You can also check what GNU has defined in their documentation.
- Here's an implementation that focuses on simplicity and ease-of-understanding at the cost of performance. Definitely worth checking out
Upcoming
As mentioned, the exercise is not done. I plan on adding more steps until we can create a small functional VCS. If you guys show enough interest, I'll post Part 2 soon!