Parallelise a for loop inside an R-script using slurm - r

I have thousands of data frames and I want to parallelise their analysis into slurm.
Here I am providing a simplified example:
I have an Rscript that I call: test.R
test.R contains these commands:
library(tidyverse)
df1 <- tibble(col1=c(1,2,3),col2=c(4,5,6))
df2 <- tibble(col1=c(7,8,9),col2=c(10,11,12))
files <- list(df1,df2)
for(i in 1:length(files)){
df3 <- as.data.frame(files[1]) %>%
summarise(across(everything(), list(mean=mean,sd=sd)))
write.table(df3, paste0("df",i))
}
Created on 2022-04-15 by the reprex package (v2.0.1)
I want to parallelise the for loop and analyse each data frame as a separate job.
Any help, guidance, and tutorials are appreciated.
Would the array command help?
#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mem=16G
#SBATCH --array=1-2
module load R/4.1.3
Rscript test.R $SLURM_ARRAY_TASK_ID

If the dataframes are stored in separate files with numerical identifiers, you can create a list of these files as a bash variable and submit the array of identifying numbers in your sbatch command. The $SLURM_ARRAY_TASK_ID can be used as an input argument for your R code to point to that specific dataframe's file.
Just for the purpose of the following example, say the dataframes are stored as .csv files with names dataframe_1.csv, dataframe_2.csv, ... dataframe_100. Your command to run the parallel jobs would be something like this:
sbatch -a 1-100 jobscript.sh
And jobscript.sh would resemble your sample code in your question:
#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mem=16G
module load R/4.1.3
Rscript test.R "dataframe_${SLURM_ARRAY_TASK_ID}.csv"
Note that you may need to break up these sbatch jobs to allow for processing thousands of dataframes. If a file name has four digits in its numeric ID, you can prepend the first digit before the ${SLURM_ARRAY_TASK_ID} in the last line of the jobscript.sh:
Rscript test.R "dataframe_1${SLURM_ARRAY_TASK_ID}.csv"
or if you want to avoid making multiple scripts, pass that number as an argument to jobscript.sh:
sbatch -a 1-100 jobscript.sh 1
#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mem=16G
prepend=$1
module load R/4.1.3
Rscript test.R "dataframe_${prepend}${SLURM_ARRAY_TASK_ID}.csv"

Related

Executing all the scripts within a directory using a bash script

I'd like to run all the r scripts located in a directory called scripts using a bash script. How would you do it? My script so far (not working) looks as follow:
#!/usr/bin/bash
#SBATCH --job-name=name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
module load R/4.0.2-gnu9.1
module load sqlite/3.26.0
Rscript $R_SCRIPT

Create multiple similar jobs and submit them to cloud computing platform with the same bash script

I have a R script that takes in a string and compares it with other strings. And then I will submit a bash script to use the R script. But there are about 3000 strings that I want to take in. I don't want to manually submit each job. How can I automate the job submission? Basically my question is how can I submit multiple jobs that uses the same bash script?
I want to take in the first line of each file and use that string to do the comparison.
My R script looks similar to this:
sfile <- commandArgs(trailingOnly = TRUE)
print(sfile == another_string)
My bash script looks like this:
#!/bin/bash
#SBATCH -J BV1
#SBATCH --account=def-*****
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G # 4GiB of memery
#SBATCH -t 0-10:00 # Running time of 10 hr
module load r
Rscript --vanilla $HOME/projects/def-*****/h*****/****/mappabilityprofile/mappabilityprofile.R $1 > $HOME/projects/def-*****/h*****/****/Rout/testRunOutput.$1.Rout 2>&1
The codes that I tried to use to submit the automated jobs to the server in command line is this:
for ii in /path/to/files; do
> line=$(head -n 1 $f)
> sbatch mappabilityprofile.sh line
> done
This doesn't really work because it only submits one job when I want it to submit multiple jobs according to each file.
Is there any way that I could achieve what I want it to do?
I found out that I can use
while read first; do read second; sbatch mappabilityprofile.sh "$second"; done
thanks to this post!

Slurm job array error: slurmstepd: error: execve(): Rscript: No such file or directory

I am trying to get a very basic job array script working using Slurm job scheduler on a HPC. I am getting the error:
slurmstepd: error: execve(): Rscript: No such file or directory
This is similar to this but I am not using any export commands so this isn't the cause here. Some sources say it could be something to do with creating these scripts in Windows so the line ends will not work for Unix. Could this be the issue? If so how can I check for this?
My shell script:
#!/bin/bash
# Example of running R script with a job array
#SBATCH --nodes=1
#SBATCH --array=1-10 # how many tasks in the array
#SBATCH --ntasks-per-node=10
#SBATCH -o hello-%j-%a.txt
#SBATCH --mail-user=user#email.address
# Load software
module load R/4.0.0-foss-2020a
# Run R script with a command line argument
srun Rscript hello.R $SLURM_ARRAY_TASK_ID
The script -hello.R is:
#!/usr/bin/env
# Rscript
# accept command line arguments and save them in a list called args
args = commandArgs(trailingOnly=TRUE)
# print task number
print(paste0('Hello! I am a task number: ', args[1]))
Thanks to some offline help I have got an answer to my question. Seems I didn't need to use srun in the shell script and needed to include Rscript in the shebang line in hello.R
Shell script is:
#!/bin/bash
# Example of running R script with a job array
#SBATCH --nodes=1
#SBATCH --array=1-10 # how many tasks in the array
#SBATCH --ntasks-per-node=10
#SBATCH -o hello-%j-%a.txt
#SBATCH --mail-user=tclewis1#sheffield.ac.uk
# Load software
module load R/4.0.0-foss-2020a
# Run R script with a command line argument
Rscript hello.R $SLURM_ARRAY_TASK_ID
And hello.R is now:
#!/usr/bin/env/Rscript
# accept command line arguments and save them in a list called args
args = commandArgs(trailingOnly=TRUE)
# print task number
print(paste0('Hello! I am a task number: ', args[1]))

debugging R code when using slurm

I am running simulations in R on a cluster. Each R file contains 100 models. Each model analyses a different data set. Cluster commands are included in a slurm file, shown below.
A small percentage of models apparently do not converge well enough to estimate the Hessian and an error is generated for these models. The errors are placed in an error log file. However, I cannot determine from looking at the parameter estimates, the error log file and the output log file which of the 100 models are generating the errors.
Here is an example of an error message
Error in chol.default(fit$hessian) :
the leading minor of order 3 is not positive definite
Calls: chol2inv -> chol -> chol.default
Parameter estimates are returned despite these errors. Some SE's are huge, but I think the SE's can be large sometimes even when an error message is not returned.
Is it possible to include an additional line in my slurm file below that will generate a log file containing both the errors and the rest of the output with the errors shown in their original location (for example, the location in which they are shown on my Windows laptop). That way I would be able to determine quickly which models were generating the errors by looking at the log file. I have been trying to think of a work-around, but have not been able to come up with anything so far.
Here is a slurm file:
#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH -e JS_N200_301_400_Oct31_17c.err
#SBATCH --mail-user markwm#myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R
Not sure if this is what you want, but R option error allows to control what should happen with errors (that you don't catch otherwise). For instance, setting
options(error = function() {
traceback(2L)
dump.frames(dumpto = "last.dump", to.file = TRUE)
})
at the beginning of your *.R script, or in a .Rprofile startup script, will (a) output the traceback if there's an error, but more importantly, it'll also (b) dump the call stack to file last.dump.rda, which you can load in a fresh R session as:
dump <- get(load("last.dump.rda"))
Note, that get(load( is not a mistake. Here dump is an object of class dump.frames which allows you to inspect the call stack and its content.
You can of course customize error to do other things.
I learned from an IT person in charge of the cluster that I can have the error messages added to the output log by simply removing the reference to the error log in the slurm file. See below. It seems to be good enough.
I plan to also output the model number into the log at the beginning and the end of each model's output for added clarity (which I should have been doing from the start).
#!/bin/bash
#SBATCH -J JS_N200_301_400_Oct31_17c.R
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 2000
#SBATCH -p community.q
#SBATCH -o JS_N200_301_400_Oct31_17c.out
#SBATCH --mail-user markwm#myuniversity.edu
#SBATCH --mail-type ALL
Rscript JS_N200_301_400_Oct31_17c.R

How can I load my data file in R using command lines?

I would like to run job using R.
I wrote r script as follow
#!/bin/sh
#SBATCH --time=168:00:00
#SBATCH --mem=50gb
#SBATCH --ntasks=1
#SBATCH --job-name=StrAuto
#SBATCH --error=R.%J.err
#SBATCH --output=R.%J.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=asallam#unl.edu
module load R/3.3
R CMD BATCH AF1.R
Then I worte the command lines in AF1.R file as follow
outer(names(m[,-1]), names(m[,-1]), function(x,y){colSums((m[,x]-m[,y])**2/156854,na.rm=TRUE)})
Now I would like to ask how I load my data m.txt. what is the command that I should write before outer(.......
Thanks in advance

Resources