This is based on a real-life scenario I came across at my job recently. Our scheduler is the program responsible for batch management. It basically manages the dependencies between jobs.
Let's take the following example:
A---B
\---.---.---H
/
/---E---F
C---D
\---G
You can understand this wonderful drawing like this:
A and C will start immediately in parallel.
B will execute only after A is finished. D waits for C
E and G run in parallel after D
F runs once E is done.
H waits for F and B to finish.
Internally (I'm super-simplifying), the program handles dependency by assigning COND to each job:
- IN COND if the job should wait for another one to finish.
- OUT COND if the job should notify another waiting one that it's done.
To make things clearer, let's take a look at this CSV extract of the database.
# Jobname; Incond; Outcuond;
A; ; B;
B; A; H;
C; ; D;
D; C; E G;
E; K D; F;
F; E; H;
G; D; ;
H; B F G;
- Incond and outcond are supposed to be a list of jobnames separated by a whitespace (look at job D and H).
- In order to create the dependency A---B, we need to define
Job A: OUTCOND B
Job B: INCOND A (look at the first two lines of the CSV).
- Job H has 3 INCONDS, one of them is G. But G doesn't have H as OUTCOND so the dependency is not created.
- Job E has K as INCOND. But job K does not exist. The condition is ignored.
Side note
This situation happens a lot. Usually, job K existed before but got deleted in a terribly unelegant way.
Unfortunately, we often have to modify that file by hand, and many of us make mistakes. It's easy to forget an existing condition when deleting a job. Our work CSV file has over 6500 entries, uses the worst naming convention of all time (ALL_CAPS_WITH_NUMBERS_004); it can get messy.
Exercise:
The goal of this exercise is to write a script that would parse the given CSV file and detect all the non valid conditions. A condition is considered valid if it is symmetrically matched.
Bonus points if your script can make the difference between a missing condition and a deleted job.