How can I load my data file in R using command lines? - r

I would like to run job using R.
I wrote r script as follow
#!/bin/sh
#SBATCH --time=168:00:00
#SBATCH --mem=50gb
#SBATCH --ntasks=1
#SBATCH --job-name=StrAuto
#SBATCH --error=R.%J.err
#SBATCH --output=R.%J.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=asallam#unl.edu
module load R/3.3
R CMD BATCH AF1.R
Then I worte the command lines in AF1.R file as follow
outer(names(m[,-1]), names(m[,-1]), function(x,y){colSums((m[,x]-m[,y])**2/156854,na.rm=TRUE)})
Now I would like to ask how I load my data m.txt. what is the command that I should write before outer(.......
Thanks in advance

Related

didn't get any gpu using sbatch when submitting a job script through slurm

Here is my slurm job script. I requested 4 gpu and 1 computing node. My script is as follows:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-gpu=12
#SBATCH --mem-per-gpu=40G
#SBATCH --time=0:15:00
module use /ifs/opt_cuda/modulefiles
module load python/gcc/3.10
module load cuda11.1/toolkit cuda11.1/blas cuda11.1/fft cudnn8.0-cuda11.1 tensorrt-cuda11.1/7.2.3.4
# activate TF venv
source /ifs/groups/rweberGrp/venvs/py310-tf210/bin/activate
python -c "import torch;print(torch.cuda.device_count())"
so the torch.cuda.device_count() should give me 4 but actually the output is 0
0
I have no idea why this is happening. Anyone has any idea? Thanks

Executing all the scripts within a directory using a bash script

I'd like to run all the r scripts located in a directory called scripts using a bash script. How would you do it? My script so far (not working) looks as follow:
#!/usr/bin/bash
#SBATCH --job-name=name
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
module load R/4.0.2-gnu9.1
module load sqlite/3.26.0
Rscript $R_SCRIPT

Slurm wont run mpi Job with specific number of nodes

I am currenntly trying to run calculalations that require large amounts of memory per core on a HPC cluster.
I am using a single node/machine with 512 GB ram. I have 28 cores per machine, but every process needs more than 512/28 GB ram.
I have no issue using 12 or 2 processes (which means I don't saturate the node intentionally) but whenever I try to use 6 or 7 I get:
srun: error: node058: tasks 3-5: Exited with exit code 255
The relevant part of my slurm script is:
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --tasks-per-node=6
#SBATCH --hint=nomultithread
#SBATCH --partition=mem512
#SBATCH --time=1008:00:00
#SBATCH --mail-type=NONE
#SBATCH --job-name=$NAME
#SBATCH --exclusive
#SBATCH --export=NONE
export SLURM_EXPORT_ENV=ALL
export I_MPI_DEBUG=5
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so.0
#set period as decimal point
export LC_NUMERIC=C
ulimit -s unlimited
ulimit -l hard
export TMPDIR=/scratch/$SLURM_JOB_USER/$SLURM_JOBID
srun --cpu-bind=cores some_program < input 1> $SLURM_SUBMIT_DIR/output 2>error
Thank you for reading,
Cheers!

Parallelise a for loop inside an R-script using slurm

I have thousands of data frames and I want to parallelise their analysis into slurm.
Here I am providing a simplified example:
I have an Rscript that I call: test.R
test.R contains these commands:
library(tidyverse)
df1 <- tibble(col1=c(1,2,3),col2=c(4,5,6))
df2 <- tibble(col1=c(7,8,9),col2=c(10,11,12))
files <- list(df1,df2)
for(i in 1:length(files)){
df3 <- as.data.frame(files[1]) %>%
summarise(across(everything(), list(mean=mean,sd=sd)))
write.table(df3, paste0("df",i))
}
Created on 2022-04-15 by the reprex package (v2.0.1)
I want to parallelise the for loop and analyse each data frame as a separate job.
Any help, guidance, and tutorials are appreciated.
Would the array command help?
#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mem=16G
#SBATCH --array=1-2
module load R/4.1.3
Rscript test.R $SLURM_ARRAY_TASK_ID
If the dataframes are stored in separate files with numerical identifiers, you can create a list of these files as a bash variable and submit the array of identifying numbers in your sbatch command. The $SLURM_ARRAY_TASK_ID can be used as an input argument for your R code to point to that specific dataframe's file.
Just for the purpose of the following example, say the dataframes are stored as .csv files with names dataframe_1.csv, dataframe_2.csv, ... dataframe_100. Your command to run the parallel jobs would be something like this:
sbatch -a 1-100 jobscript.sh
And jobscript.sh would resemble your sample code in your question:
#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mem=16G
module load R/4.1.3
Rscript test.R "dataframe_${SLURM_ARRAY_TASK_ID}.csv"
Note that you may need to break up these sbatch jobs to allow for processing thousands of dataframes. If a file name has four digits in its numeric ID, you can prepend the first digit before the ${SLURM_ARRAY_TASK_ID} in the last line of the jobscript.sh:
Rscript test.R "dataframe_1${SLURM_ARRAY_TASK_ID}.csv"
or if you want to avoid making multiple scripts, pass that number as an argument to jobscript.sh:
sbatch -a 1-100 jobscript.sh 1
#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mem=16G
prepend=$1
module load R/4.1.3
Rscript test.R "dataframe_${prepend}${SLURM_ARRAY_TASK_ID}.csv"

Slurm job array error: slurmstepd: error: execve(): Rscript: No such file or directory

I am trying to get a very basic job array script working using Slurm job scheduler on a HPC. I am getting the error:
slurmstepd: error: execve(): Rscript: No such file or directory
This is similar to this but I am not using any export commands so this isn't the cause here. Some sources say it could be something to do with creating these scripts in Windows so the line ends will not work for Unix. Could this be the issue? If so how can I check for this?
My shell script:
#!/bin/bash
# Example of running R script with a job array
#SBATCH --nodes=1
#SBATCH --array=1-10 # how many tasks in the array
#SBATCH --ntasks-per-node=10
#SBATCH -o hello-%j-%a.txt
#SBATCH --mail-user=user#email.address
# Load software
module load R/4.0.0-foss-2020a
# Run R script with a command line argument
srun Rscript hello.R $SLURM_ARRAY_TASK_ID
The script -hello.R is:
#!/usr/bin/env
# Rscript
# accept command line arguments and save them in a list called args
args = commandArgs(trailingOnly=TRUE)
# print task number
print(paste0('Hello! I am a task number: ', args[1]))
Thanks to some offline help I have got an answer to my question. Seems I didn't need to use srun in the shell script and needed to include Rscript in the shebang line in hello.R
Shell script is:
#!/bin/bash
# Example of running R script with a job array
#SBATCH --nodes=1
#SBATCH --array=1-10 # how many tasks in the array
#SBATCH --ntasks-per-node=10
#SBATCH -o hello-%j-%a.txt
#SBATCH --mail-user=tclewis1#sheffield.ac.uk
# Load software
module load R/4.0.0-foss-2020a
# Run R script with a command line argument
Rscript hello.R $SLURM_ARRAY_TASK_ID
And hello.R is now:
#!/usr/bin/env/Rscript
# accept command line arguments and save them in a list called args
args = commandArgs(trailingOnly=TRUE)
# print task number
print(paste0('Hello! I am a task number: ', args[1]))

Resources