Many computing-intensive processes
in R involve the repeated evaluation of a function over many items or
parameter sets. These so-called embarrassingly
parallel calculations can be run serially with the
lapply
or Map
function, or in parallel on a
single machine with mclapply
or mcMap
(from
the parallel
package).
The rslurm package simplifies the process of distributing this type
of calculation across a computing cluster that uses the Slurm workload manager. Its main
function, slurm_apply
(and the related
slurm_map
) automatically divide the computation over
multiple nodes and write the necessary submission scripts. The package
also includes functions to retrieve and combine the output from
different nodes, as well as wrappers for common Slurm commands.
To illustrate a typical rslurm workflow, we use a simple function that takes a mean and standard deviation as parameters, generates a million normal deviates and returns the sample mean and standard deviation.
test_func <- function(par_mu, par_sd) {
samp <- rnorm(10^6, par_mu, par_sd)
c(s_mu = mean(samp), s_sd = sd(samp))
}
We then create a parameter data frame where each row is a parameter set and each column matches an argument of the function.
## par_mu par_sd
## 1 1 0.1
## 2 2 0.2
## 3 3 0.3
We can now pass that function and the parameters data frame to
slurm_apply
, specifiying the number of cluster nodes to use
and the number of CPUs per node. The latter (cpus_per_node
)
determines how many processes will be forked on each node, as the
mc.cores
argument of parallel::mcMap
.
library(rslurm)
sjob <- slurm_apply(test_func, pars, jobname = 'test_apply',
nodes = 2, cpus_per_node = 2, submit = FALSE)
## Submission scripts output in directory _rslurm_test_apply
The output of slurm_apply
is a slurm_job
object that stores a few pieces of information (job name, job ID, and
the number of nodes) needed to retrieve the job’s output.
The default argument submit = TRUE
would submit a
generated script to the Slurm cluster and print a message confirming the
job has been submitted to Slurm, assuming your are running R on a Slurm
head node. When working from a R session without direct access to the
cluster, you must set submit = FALSE
. Either way, the
function creates a folder called \_rslurm\_[jobname]
in the
working directory that contains scripts and data files. This folder may
be moved to a Slurm head node, the shell command
sbatch submit.sh
run from within the folder, and the folder
moved back to your working directory. The contents of the
\_rslurm\_[jobname]
folder after completion of the
test_apply
job, i.e. following either manual or automatic
(i.e. with submit = TRUE
) submission to the cluster,
includes one results_*.RDS
file for each node:
## [1] "results_0.RDS" "results_1.RDS"
The results from all the nodes can be read back into R with the
get_slurm_out()
function. In this example,
wait = FALSE
, but if you use the default argument
wait = TRUE
, execution will be paused until the Slurm job
finishes running.
## s_mu s_sd
## 1 1.000137 0.09995552
## 2 2.000144 0.19988175
## 3 2.999822 0.30030102
The utility function print_job_status
displays the
status of a submitted job (i.e. in queue, running or completed), and
cancel_slurm
will remove a job from the queue, aborting its
execution if necessary. These functions are R wrappers for the Slurm
command line functions squeue
and scancel
,
respectively.
When outtype = 'table'
, the outputs from each function
evaluation are row-bound into a single data frame; this is an
appropriate format when the function returns a simple vector. The
default outtype = 'raw'
combines the outputs into a list
and can thus handle arbitrarily complex return objects.
## [[1]]
## s_mu s_sd
## 1.00013690 0.09995552
##
## [[2]]
## s_mu s_sd
## 2.0001445 0.1998817
##
## [[3]]
## s_mu s_sd
## 2.999822 0.300301
The utility function cleanup_files
deletes the temporary
folder for the specified Slurm job.
In addition to slurm_apply
, rslurm also defines a
slurm_call
function, which sends a single function call to
the cluster. It is analogous in syntax to the base R function
do.call
, accepting a function and a named list of
parameters as arguments.
## Submission scripts output in directory _rslurm_test_call
Because slurm_call
involves a single process on a single
node, it does not recognize the nodes
and
cpus_per_node
arguments; otherwise, it accepts the same
additional arguments (detailed in the sections below) as
slurm_apply
.
The function passed to slurm_apply
can only receive
atomic parameters stored within a data frame. Suppose we want instead to
apply a function func
to a list of complex R objects,
obj_list
. In that case we can use the function
slurm_map
, which is similar in syntax to
lapply
from base R and map
from the
purrr
package. Its first argument is a list which can
contain objects of any type, and its second argument is a function that
acts on a single element of the list.
The output generated by slurm_map
is structured the same
way as slurm_apply
. The procedures for checking the job
status, extracting the results of the job, and cleaning up job files are
also the same as described above.
Each of the tasks started by slurm_apply
and
slurm_map
begin by default in an “empty” R environment
containing only the function to be evaluated and its parameters. If we
want to pass additional arguments to the function that do not vary with
each task, we can simply add them as additional arguments to
slurm_apply
or slurm_map
, like in this
example, where we want to take the logarithm of many integers but always
use log(x, base = 2)
.
To pass additional objects to the jobs that aren’t explicitly
included as arguments to the function passed to slurm_apply
or slurm_map
, we can use the argument
global_objects
. For example we might want to use an inline
function that calls two other previously defined functions.
sjob <- slurm_apply(function(a, b) c(func1(a),func2(b)),
data.frame(a, b),
global_objects = c("func1", "func2"),
nodes = 2, cpus_per_node = 2)
The global_objects
argument specifies the names of any R
objects (besides the parameters data frame) that must be accessed by the
function passed to slurm_apply
. These objects are saved to
a .RDS
file that is loaded on each cluster node prior to
evaluating the function in parallel.
By default, all R packages attached to the current R session will
also be attached (with library
) on each cluster node,
though this can be modified with the optional pkgs
argument.
Particular clusters may require the specification of additional Slurm
options, such as time and memory limits for the job. The
slurm_options
argument allows you to set any of the command
line options (view
list) recognized by the Slurm sbatch
command. It should
be formatted as a named list, using the long names of each option
(e.g. “time” rather than “t”). Flags, i.e. command line options that are
toggled rather than set to a particular value, should be set to
TRUE
in slurm_options
. For example, the
following code sets the command line options
--time=1:00:00 --share
.
As mentioned above, the slurm_apply
function creates a
job-specific folder. This folder contains the parameters as a RDS file
and (if applicable) the objects specified as global_objects
saved together in a RData file. The function also generates a R script
(slurm_run.R
) to be run on each cluster node, as well as a
Bash script (submit.sh
) to submit the job to Slurm.
More specifically, the Bash script tells Slurm to create a job array
and the R script takes advantage of the unique
SLURM\_ARRAY\_TASK\_ID
environment variable that Slurm will
set on each cluster node. This variable is read by
slurm_run.R
, which allows each instance of the script to
operate on a different parameter subset and write its output to a
different results file. The R script calls parallel::mcMap
to parallelize calculations on each node.
Additionally, the --dependency
option can be utilized by
taking the job ID from the slurm_job
object returned by
slurm_apply
, slurm_map
, and
slurm_call
functions.
The ID can be manually added to the slurm options. In the following
example, the job ID of sjob1
is used to ensure that
sjob2
does not begin running until after sjob1
finishes.
# Job1
sopt1 <- list(time = '1:00:00', share = TRUE)
sjob1 <- slurm_apply(test_func, pars, slurm_options = sopt1)
# Job2 depends on Job1
pars2 <- data.frame(par_mu = 2:20,
par_sd = seq(0.2, 2, length.out = 20))
sopt2 <- c(sopt1, list(dependency=sprintf("afterany:%s", sjob1$jobid)))
sjob2 <- slurm_apply(test_func2, pars2, slurm_options = sopt2)
Both slurm_run.R
and submit.sh
are
generated from templates, using the whisker
package; these templates can be found in the
rslurm/templates
subfolder in your R package library. There
are two templates for each script, one for slurm_apply
and
the other (with the word “single”” in its title) for
slurm_call
.
While you should avoid changing any existing lines in the template
scripts, you may want to add #SBATCH
lines to the
submit.sh
templates in order to permanently set certain
Slurm command line options and thus customize the package to your
particular cluster setup.