Python TasksΒΆ
Learning Objectives
- Write a Python task
- Expand on automatic variables
So far, we have downloaded the data and gunzipped it. Both of those tasks were simple shell commands. However, pydoit can do much more than that; actions can also be arbitrary python code. We will take advantage of that to do some “analysis” of our Super Smash Bros data and generate a plot.
Python tasks are defined in the same way as any other task, but the actions entry will include a function name instead. Python lets us define functions within functions and access variables from the outer function’s namespace (there are called closures, which are beyond the scope of this workshop); to make things simpler, we’ll define our task this way.
def task_plot_heatmap():
def do_plot(dependencies, targets):
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# Read the data in a DataFrame
data = pd.read_csv(list(dependencies)[0], index_col=0)
# Make a heatmap and dendrogram with seaborn
clst = sns.clustermap(data, linewidths=.5, figsize=(8, 8), square=True,
method='ward', z_score=0, row_cluster=False)
clst.savefig(targets[0])
return {'actions': [do_plot],
'file_dep': ['Melee_data.csv'],
'targets': ['Melee_data.csv.heatmap.pdf']}
The python action takes two parameters – file_dep
and targets
.
These behave similarly to the automatic variables we accessed earlier,
but instead the actual python objects are passed to the function and can
be accessed. It is important to note that only the task function
task_plot_heatmap
is executed immediately when we run the pipeline;
the do_plot
function will be defined, and then only executed when
and if the task is determined to be out of date.
Run it and take a look at the output.
Well that sucks.
It’s likely that your labels are all garbled and overlapping. Let’s add some code to fix them and rerun it.
def task_plot_heatmap():
def do_plot(dependencies, targets):
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
data = pd.read_csv(list(dependencies)[0], index_col=0)
clst = sns.clustermap(data, linewidths=.5, figsize=(8, 8), square=True,
method='ward', z_score=0, row_cluster=False)
# We like pretty charts, so rotate the labels
plt.setp(clst.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.setp(clst.ax_heatmap.xaxis.get_majorticklabels(), rotation=90)
clst.savefig(targets[0])
return {'actions': [do_plot],
'file_dep': ['Melee_data.csv'],
'targets': ['Melee_data.csv.heatmap.pdf']}
It didn’t run! That’s because we didn’t change any of the targets or
dependencies, so as far as doit is concerned, nothing has changed. Not
having the dodo file be a dependency is a design decision defended in
the documentation; in order to regenerate the plot, you’ll have to
rm
the PDF file and run doit again.