Dependencies and Task StatusΒΆ
Learning Objectives
- Introduce task
uptodate
- Introduce file dependencies
If you run doit
again, you’ll notice that you get the same output as
before – the dot and task name indicating that the task was run again.
Preferably, we’d like the task not to run once it already has (the file
might be large, for example). pydoit handles this two ways: through
dependencies and the uptodate
function. File dependencies are
quite intuitive. The task names the files it depends on, and if any of
those change, the task is rerun. Our download command already adds a bit
of complexity though, because it does not depend on any files. This is
where uptodate
comes in.
uptodate
is another entry in the task dictionary. It is a list of
booleans, callables (that is, functions), or strings specifying shell
commands. If any of the uptodate
items evaluates to False
, the
task is out-of-date and run.
For our download task, we are going to use a function which is included
in the doit library, run_once
. As one might expect, this makes sure
the task is run at least once. Let’s add it to our dodo.py
.
from doit.tools import run_once
def task_download_data():
return {'actions': ['curl -OL https://s3.amazonaws.com/pydoit-intermediate/Melee_data.csv.gz'],
'targets': ['Melee_data.csv.gz'],
'uptodate': [run_once]}
Now if we run doit
again (twice more, actually), our output will
change.
-- download_data
The dash-dash indicates that the task was determined to be up to date, and was not executed.
By default, the task name will be taken from the function defining it. We can also define our own task names with thename
entry in the task dictionary.
doit’s system for determining whether a task is up to date is actually somewhat complex. The documentation provides a more detailed list of conditions where a task is not up to date:
- An uptodate item is (or evaluates to)
False
- A file is added to or removed from file_dep
- A file_dep changed since last successful execution
- A target path does not exist
- A task has no file_dep and uptodate item equal to
True
The last of these explains why we need to include the uptodate
entry
in our download task; it has no file dependencies, and so will always be
considered out of date, even if the target exists.
Experimenting with uptodate
What would happen if we changed
uptodate
to[True]
? How about[False]
?
Now that we know more about how tasks are considered up to date, let’s introduce a file dependency. The file we downloaded was a gzip archive, so we’ll write a task to extract it. The command we would run in the shell might be:
$ gunzip Melee_data.csv.gz
This would produce a file called Melee_data.csv
. We can see then
that we have a target (Melee_data.csv
), an action (running
gunzip
), and a file dependency (Melee_data.csv.gz
). Let’s add
the task to our dodo.py
.
def task_gunzip_data():
return {'actions': ['gunzip -c %(dependencies)s > %(targets)s'],
'targets': ['Melee_data.csv'],
'file_dep': ['Melee_data.csv.gz']}
On top of the file dependency, this task also introduces automatic
variables. These are in the actions
string, and are recognized by
the task creator. This removes redundancy and saves us some code.
When we run doit
, we get output showing that only the gunzip task
was executed.
-- download_data
. gunzip_data