Python TasksΒΆ

Learning Objectives

  • Write a Python task
  • Expand on automatic variables

So far, we have downloaded the data and gunzipped it. Both of those tasks were simple shell commands. However, pydoit can do much more than that; actions can also be arbitrary python code. We will take advantage of that to do some “analysis” of our Super Smash Bros data and generate a plot.

Python tasks are defined in the same way as any other task, but the actions entry will include a function name instead. Python lets us define functions within functions and access variables from the outer function’s namespace (there are called closures, which are beyond the scope of this workshop); to make things simpler, we’ll define our task this way.

def task_plot_heatmap():

    def do_plot(dependencies, targets):
        import matplotlib.pyplot as plt
        import pandas as pd
        import seaborn as sns

        # Read the data in a DataFrame
        data = pd.read_csv(list(dependencies)[0], index_col=0)
        # Make a heatmap and dendrogram with seaborn
        clst = sns.clustermap(data, linewidths=.5, figsize=(8, 8), square=True,
                              method='ward', z_score=0, row_cluster=False)
        clst.savefig(targets[0])

    return {'actions': [do_plot],
            'file_dep': ['Melee_data.csv'],
            'targets': ['Melee_data.csv.heatmap.pdf']}

The python action takes two parameters – file_dep and targets. These behave similarly to the automatic variables we accessed earlier, but instead the actual python objects are passed to the function and can be accessed. It is important to note that only the task function task_plot_heatmap is executed immediately when we run the pipeline; the do_plot function will be defined, and then only executed when and if the task is determined to be out of date.

Run it and take a look at the output.

Well that sucks.

It’s likely that your labels are all garbled and overlapping. Let’s add some code to fix them and rerun it.

def task_plot_heatmap():

    def do_plot(dependencies, targets):
        import matplotlib.pyplot as plt
        import pandas as pd
        import seaborn as sns

        data = pd.read_csv(list(dependencies)[0], index_col=0)
        clst = sns.clustermap(data, linewidths=.5, figsize=(8, 8), square=True,
                              method='ward', z_score=0, row_cluster=False)
        # We like pretty charts, so rotate the labels
        plt.setp(clst.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
        plt.setp(clst.ax_heatmap.xaxis.get_majorticklabels(), rotation=90)
        clst.savefig(targets[0])

    return {'actions': [do_plot],
            'file_dep': ['Melee_data.csv'],
            'targets': ['Melee_data.csv.heatmap.pdf']}

It didn’t run! That’s because we didn’t change any of the targets or dependencies, so as far as doit is concerned, nothing has changed. Not having the dodo file be a dependency is a design decision defended in the documentation; in order to regenerate the plot, you’ll have to rm the PDF file and run doit again.