---
title: Parallelize R code on a Slurm cluster
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Parallelize R code on a Slurm cluster}
  %\VignetteEngine{knitr::rmarkdown_notangle}
  %\VignetteEncoding{UTF-8}
---

Many computing-intensive processes in R involve the repeated evaluation of a
function over many items or parameter sets. These so-called [embarrassingly
parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) calculations
can be run serially with the `lapply` or `Map` function, or in parallel on a
single machine with `mclapply` or `mcMap` (from the `parallel` package).

The rslurm package simplifies the process of distributing this type of
calculation across a computing cluster that uses the
[Slurm](http://slurm.schedmd.com/) workload manager. Its main function,
`slurm_apply` (and the related `slurm_map`) automatically divide the computation 
over multiple nodes and write the necessary submission scripts. 
The package also includes functions to retrieve
and combine the output from different nodes, as well as wrappers for common
Slurm commands.

### Table of contents

- [Basic example](#basic-example)
- [Single function evaluation](#single-function-evaluation)
- [Applying a function to a list of complex objects](#applying-a-function-to-a-list-of-complex-objects)
- [Adding auxiliary data and functions](#adding-auxiliary-data-and-functions)
- [Configuring Slurm options](#configuring-slurm-options)
- [Generating scripts for later submission](#generating-scripts-for-later-submission)
- [How it works / advanced customization](#how-it-works-advanced-customization)

## Basic example

To illustrate a typical rslurm workflow, we use a simple function that takes
a mean and standard deviation as parameters, generates a million normal deviates
and returns the sample mean and standard deviation.

```{r}
test_func <- function(par_mu, par_sd) {
    samp <- rnorm(10^6, par_mu, par_sd)
    c(s_mu = mean(samp), s_sd = sd(samp))
}
```

We then create a parameter data frame where each row is a parameter set and each
column matches an argument of the function.

```{r}
pars <- data.frame(par_mu = 1:10,
                   par_sd = seq(0.1, 1, length.out = 10))
head(pars, 3)
```

We can now pass that function and the parameters data frame to `slurm_apply`,
specifiying the number of cluster nodes to use and the number of CPUs per node.
The latter (`cpus_per_node`) determines how many processes will be forked on
each node, as the `mc.cores` argument of `parallel::mcMap`.

```{r}
library(rslurm)
sjob <- slurm_apply(test_func, pars, jobname = 'test_apply',
                    nodes = 2, cpus_per_node = 2, submit = FALSE)
```
The output of `slurm_apply` is a `slurm_job` object that stores a few pieces of 
information (job name, job ID, and the number of nodes) needed to retrieve the
job's output.

The default argument `submit = TRUE` would submit a generated script to the 
Slurm cluster and print a message confirming the job has been submitted to 
Slurm, assuming your are running R on a Slurm head node. When working from a R 
session without direct access to the cluster, you must set `submit = FALSE`. 
Either way, the function creates a folder called `\_rslurm\_[jobname]` in the 
working directory that contains scripts and data files. This folder may be moved
to a Slurm head node, the shell command `sbatch submit.sh` run from within the 
folder, and the folder moved back to your working directory. The contents of the
`\_rslurm\_[jobname]` folder after completion of the `test_apply` job, i.e.
following either manual or automatic (i.e. with `submit = TRUE`) submission to
the cluster, includes one `results_*.RDS` file for each node:

```{r}
list.files('_rslurm_test_apply', 'results')
```

The results from all the nodes can be read back into R with the
`get_slurm_out()` function. In this example, `wait = FALSE`, 
but if you use the default argument `wait = TRUE`, execution will 
be paused until the Slurm job finishes running.

```{r}
res <- get_slurm_out(sjob, outtype = 'table', wait = FALSE)
head(res, 3)
```

The utility function `print_job_status` displays the status of a submitted job
(i.e. in queue, running or completed), and `cancel_slurm` will remove a job from
the queue, aborting its execution if necessary. These functions are R wrappers
for the Slurm command line functions `squeue` and `scancel`, respectively.

When `outtype = 'table'`, the outputs from each function evaluation are 
row-bound into a single data frame; this is an appropriate format when the 
function returns a simple vector. The default `outtype = 'raw'` combines the
outputs into a list and can thus handle arbitrarily complex return objects.

```{r}
res_raw <- get_slurm_out(sjob, outtype = 'raw', wait = FALSE)
res_raw[1:3]
```

The utility function `cleanup_files` deletes the temporary folder for the
specified Slurm job.

```{r eval = FALSE}
cleanup_files(sjob)
```

## Single function evaluation

In addition to `slurm_apply`, rslurm also defines a `slurm_call` function, which
sends a single function call to the cluster. It is analogous in syntax to the 
base R function `do.call`, accepting a function and a named list of parameters
as arguments.

```{r}
sjob <- slurm_call(test_func, jobname = 'test_call',
                   list(par_mu = 5, par_sd = 1), submit = FALSE)
```

Because `slurm_call` involves a single process on a single node, it does not
recognize the `nodes` and `cpus_per_node` arguments; otherwise, it accepts the
same additional arguments (detailed in the sections below) as `slurm_apply`.

```{r eval = FALSE}
cleanup_files(sjob)
```

## Applying a function to a list of complex objects

The function passed to `slurm_apply` can only receive atomic parameters stored 
within a data frame. Suppose we want instead to apply a function `func` to a list
of complex R objects, `obj_list`. In that case we can use the function `slurm_map`,
which is similar in syntax to `lapply` from base R and `map` from the `purrr` package. 
Its first argument is a list which can contain objects of any type, and its second 
argument is a function that acts on a single element of the list.

```{r eval = FALSE}
sjob <- slurm_map(obj_list,
                  func,
                  nodes = 2, cpus_per_node = 2)
```

The output generated by `slurm_map` is structured the same way as `slurm_apply`.
The procedures for checking the job status, extracting the results of the job, and
cleaning up job files are also the same as described above.

## Adding auxiliary data and functions

Each of the tasks started by `slurm_apply` and `slurm_map` begin by default in an
"empty" R environment containing only the function to be evaluated and its parameters.
If we want to pass additional arguments to the function that do not vary with each
task, we can simply add them as additional arguments to `slurm_apply` or `slurm_map`,
like in this example, where we want to take the logarithm of many integers but always
use `log(x, base = 2)`.

```{r eval = FALSE}
sjob <- slurm_apply(log,
                    data.frame(x = 1:10000),
                    base = 2,
                    nodes = 2, cpus_per_node = 2)
```

To pass additional objects to the jobs that aren't explicitly included as arguments
to the function passed to `slurm_apply` or `slurm_map`, we can use the argument
`global_objects`. For example we might want to use an inline function that calls
two other previously defined functions.

```{r eval = FALSE}
sjob <- slurm_apply(function(a, b) c(func1(a),func2(b)), 
                    data.frame(a, b),
                    global_objects = c("func1", "func2"),
                    nodes = 2, cpus_per_node = 2)
```

The `global_objects` argument specifies the names of any R objects (besides the 
parameters data frame) that must be accessed by the function passed to 
`slurm_apply`. These objects are saved to a `.RDS` file that is loaded
on each cluster node prior to evaluating the function in parallel.

By default, all R packages attached to the current R session will also be 
attached (with `library`) on each cluster node, though this can be modified with
the optional `pkgs` argument.

## Configuring Slurm options

Particular clusters may require the specification of additional Slurm options, 
such as time and memory limits for the job. The `slurm_options` argument allows
you to set any of the command line options ([view
list](http://slurm.schedmd.com/sbatch.html)) recognized by the Slurm `sbatch`
command. It should be formatted as a named list, using the long names of each
option (e.g. "time" rather than "t"). Flags, i.e. command line options that are
toggled rather than set to a particular value, should be set to `TRUE` in
`slurm_options`. For example, the following code sets the command line options
`--time=1:00:00 --share`.

```{r eval = FALSE}
sopt <- list(time = '1:00:00', share = TRUE)
sjob <- slurm_apply(test_func, pars, slurm_options = sopt)
```

## How it works / advanced customization

As mentioned above, the `slurm_apply` function creates a job-specific folder. 
This folder contains the parameters as a RDS file and (if applicable) the
objects specified as `global_objects` saved together in a RData file. The function
also generates a R script (`slurm_run.R`) to be run on each cluster node, as
well as a Bash script (`submit.sh`) to submit the job to Slurm.

More specifically, the Bash script tells Slurm to create a job array and the R
script takes advantage of the unique `SLURM\_ARRAY\_TASK\_ID` environment
variable that Slurm will set on each cluster node. This variable is read by
`slurm_run.R`, which allows each instance of the script to operate on a
different parameter subset and write its output to a different results file. The
R script calls `parallel::mcMap` to parallelize calculations on each node.

Additionally, the `--dependency` option can be utilized by taking the job ID from the
`slurm_job` object returned by `slurm_apply`, `slurm_map`, and `slurm_call` functions.  
The ID can be manually added to the slurm options. 
In the following example, the job ID of `sjob1` is used to ensure that `sjob2` does not 
begin running until after `sjob1` finishes.

```{r eval = FALSE}
# Job1
sopt1 <- list(time = '1:00:00', share = TRUE)
sjob1 <- slurm_apply(test_func, pars, slurm_options = sopt1)

# Job2 depends on Job1
pars2 <- data.frame(par_mu = 2:20,
                   par_sd = seq(0.2, 2, length.out = 20))
sopt2 <- c(sopt1, list(dependency=sprintf("afterany:%s", sjob1$jobid)))
sjob2 <- slurm_apply(test_func2, pars2, slurm_options = sopt2)
```

Both `slurm_run.R` and `submit.sh` are generated from templates, using the 
[`whisker`](https://cran.r-project.org/package=whisker) package; these templates
can be found in the `rslurm/templates` subfolder in your R package library.
There are two templates for each script, one for `slurm_apply` and the other
(with the word "single"" in its title) for `slurm_call`.

While you should avoid changing any existing lines in the template scripts, you 
may want to add `#SBATCH` lines to the `submit.sh` templates in order to 
permanently set certain Slurm command line options and thus customize the
package to your particular cluster setup.