Saving history for script run Rscript through terminal/console - r

for my work, I run scripts on virtual machines on a computer cluster. These jobs are typically large and have a big output. What I'd like to do is to run a script via the terminal. In the end, the script creates a duplicate of itself so that it contains every line that was part of the script (minus the last if necessary). This is quite vital for replicability and debugging in my work-life because I sometimes can't see which parameters or variables a particular job included as I submit the same script repeatedly just with slightly different parameters and the folders can't be version controlled.
Imagine this file test.R:
a <- rnorm(100)
#test
# Saving history for reproducibility to folder
savehistory(file = 'test2.R')
Now, I run this via the console on my virtual node and get the following error:
[XX home]$ Rscript test.R
Error in.External2(C_savehistory, file): no history available to save
Calls: save history
Execution halted
Is there any command like save history that works inside a script that is just executed?
The desired outcome is a file called test2.R is saved that contains this:
a <- rnorm(100)
#test
# Saving history for reproducibility to folder

You could copy the file instead. Since you're using Rscript, the script name is included in commandArgs() in the form --file=test.R. A simple function like this will return the path of the executing script:
get_filename <- function() {
c_args <- commandArgs()
r_file <- c_args[grepl("\\.R$", c_args, ignore.case = TRUE)]
r_file <- gsub("--file=", "", r_file)
r_file <- normalizePath(r_file)
return(r_file)
}
Then you can copy the file as you see fit. For example, appending ".backup":
script_name <- get_filename()
backup_name <- paste0(script_name, ".backup")
file.copy(script_name, backup_name)

Related

Running system() inside foreach()

I am running R version 3.4.1 on Windows 7.
I need to run a certain program using N=500 bootstrapped data files that I have stored within a working directory. The executable will produce 3 more datafiles for each of the N runs, all of which are stored in one working directory. I'd like to use parallel computing to increase the speed of this process.
The loop requires one starting file ('starter') that gets modified at every iteration of the loop. This is a really important part that I can't really get around doing. Also, at every iteration any pre-existing data files are removed and then the executable is called using system(). The executable produces the three data files called Report, CompReport, and covar.
In a normal loop the process looks something like this:
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
for(iboot in 1:N){
starter$datfile <- paste("BootData", iboot, ".ss", sep="")
#code to replace the starter file with the new one
SS_writestarter(starter, overwrite=TRUE)
#any old files are removed
file.remove("Report.sso", paste("Report_",iboot,".sso",sep=""))
file.remove("CompReport.sso", paste("CompReport_",iboot,".sso",sep=""))
file.remove("covar.sso", paste("covar_",iboot,".sso",sep=""))
#then the program is called
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
This above process takes 5 days to complete so I've been exploring options for parallelization. What I've worked with so far doesn't work so any help or suggestions are appreciated. Here's what I was thinking. Using a foreach loop and making sure I'm not reading and overwriting the starter file. This is what I have:
# read starter file
starter_strt <- SS_readstarter(file="starter.ss") # read starter file
iter <- 2 #just for an example here even though I said N=500 above
cores <- parallel::detectCores()-1
cl <- makeCluster(cores)
registerDoParallel(cl)
clusterCall(cl, function(x) .libPaths(x), .libPaths())
foreach(iboot=1:iter, .packages='r4ss', .export="starter_strt", system ("ss3")) %dopar% {
#code to create the starter file so I don't overwrite it
datfile<- data.frame(rep(paste("BootData",iboot,".ss",sep=""),2))
colnames(datfile) <- "datfile"
starter_list <- cbind(starter_strt, datfile)
# replace starter file with modified version
SS_writestarter(starter_list, overwrite=TRUE)
#delete files as above
#run program
system("ss3")
#the files produced get pushed to the working directory
file.copy("Report.sso",paste("Report_",iboot,".sso",sep=""))
file.copy("CompReport.sso",paste("CompReport_",iboot,".sso",sep=""))
file.copy("covar.sso",paste("covar_",iboot,".sso",sep=""))
}
The program runs through once but then I get a warning (see below) and the expected files are missing from the directory.
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): starter_strt
I'm pretty stumped at this point. I saw this example Running multiple R scripts using the system() command where the system command is called in a parLapply function. Maybe this is what I need? Help and/or suggestions are greatly appreciated.

Run shell script in shiny

I think there is a problem in my shiny script, with executing a shell comand and I was wondering if there maybe is a way to do this command within shiny.
Outside of shiny my code functions.
#Function to calculate maxentscore
calculate_maxent <- function(seq_vector,maxent_type){
#Save working directory from before
Current_wkdir <- getwd()
#First, change working directory, to perl script location
setwd("S:/Arbeitsgruppen VIRO/AG_Sch/Johannes Ptok_JoPt/Rpackages/rnaSEQ/data/maxent")
#Create new text file with the sequences saved
cat(seq_vector,file="Input_sequences",sep="\n",append=TRUE)
#Execute the respective Perl script with the respective Sequence file
if(maxent_type == 3) cmd <- paste("score3.pl", "Input_sequences")
if(maxent_type == 5) cmd <- paste("score5.pl", "Input_sequences")
#Save the calculated Maxent score file
x <- shell(cmd,intern=TRUE)
#Reset the working directory
setwd(Current_wkdir)
#Substracting the Scores from the Maxent Score Table
x <- substr(x,(regexpr("\t",x)[[1]]+1),nchar(x))
#Returning the maxent table
return(x)
}
So basically I just try to execute following code:
shell("score5.pl Input_sequences")
This does seem to not be possible that way within shiny
I do not know the shell command, but executing shell commands is possible via system(). It even uses the current working directory set by R.
So you might try:
x <- system(cmd, intern=True)

Calling external program in parallel using foreach and doSNOW: How to import results?

I'm using R to call an external program in parallel on a cluster with multiple nodes and multiple cores. The external program requires three input data files and produces one output file (all files are stored in the same subfolder).
To run the program in parallel (or rather call it in a parallel fashion) I've initially used the foreach function together with the doParallel library. This works fine as long as I'm just using multiple cores on a single node.
However, I wanted to use multiple nodes with multiple cores. Therefore I modified my code accordingly to use the doSNOW library in conjunction with foreach (I tried Rmpi and doMPI, but I did not manage to run the code on multiple nodes with those libraries).
This works fine, i. e. the external program is now indeed run on multiple nodes (with multiple cores) and the cluster logfile shows, that it produces the required results. The problem I'm facing now, however, is that the external program no longer stores the results/output files on the master node/in the specified subfolder of the working directory (it did so, when I was using doParallel). This makes it impossible for me to import the results into R.
Indeed, if I check the content of the relevant folder it does not contain any output files, despite the logfile clearly showing that the external program ran successfully. I guess they are stored on the different nodes (?).
What modifications do I have to make to either my foreach function or the way I set up my cluster, to get those files saved on the master node/in the specified subfolder in my working directory?
Here some example R code, to showcase, what I'm doing:
# #Set working directory in non-interactive mode
setwd(system("pwd", intern = T))
# #Load some libraries
library(foreach)
library(parallel)
library(doParallel)
# ####Parallel tasks####
# #Create doSNOW cluster for parallel tasks
library(doSNOW)
nCoresPerNode <- as.numeric(Sys.getenv("PBS_NUM_PPN"))
nodeNames <- system("cat $PBS_NODEFILE | uniq", intern=TRUE)
machines <- rep(nodeNames, each = nCoresPerNode)
cl <- makeCluster(machines, type = "SOCK")
registerDoSNOW(cl)
# #How many workers are we using?
getDoParWorkers()
#####DUMMY CODE#####
# #The following 3 lines of code are just dummy code:
# #The idea is to create input files for the external program "myprogram"
external_Command_Script.cmd # #command file necessary for external program "myprogram" to run
startdata # #some input data for "myprogram"
enddata # #additional input data for "myprogram"
####DUMMY CODE######
# #Write necessary command and data files for external program: THIS WORKS!
for(i in 1:100)){
write(external_Command_Script.cmd[[i]], file=paste("./mysubfolder/external_Command_Script.",i,".cmd", sep=""))
write.table(startdata, file=paste("./mysubfolder/","startdata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
write.table(enddata, file=paste("./mysubfolder/","enddata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
}
# #Run external program "myprogram" in parallel: THIS WORKS!
foreach(i = 1:100)) %dopar% {
system(paste('(cd ./mysubfolder && ',"myprogram",' ' ,"enddata.",i,".txt ", "startdata.",i,".txt", sep="",' < external_Command_Script.',i,'.cmd)'))
}
# #Import results of external program: THIS DOES NOT WORK WHEN RUN ON MULTIPLE NODES!
results <- list()
for(i in 1:100)){
results[[i]] = read.table(paste("./mysubfolder/","enddata.txt.",i,".log.txt", sep=""), sep = "\t", quote="\"", header = TRUE)
}
# #The import does NOT work as the files created by the external program are NOT stored on the master node/in the
# #subfolder of the working directory!
# #Instead I get the following error message:
# #sh: line 0: cd: ./mysubfolder: No such file or directory
# #Error in { : task 6 failed - "cannot open the connection"
My pbs script for the cluster looks something like this:
#!/bin/bash
# request resources:
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:30:00
module add languages/R-3.3.3-ATLAS
export PBS_O_WORKDIR="/panfs/panasas01/gely/xxxxxxx/workingdirectory"
# on compute node, change directory to 'submission directory':
cd $PBS_O_WORKDIR
# run your program and time it:
time Rscript ./R_script.R
I'd like to suggest that you look into batchtools package. It provides methods for interacting with TORQUE / PBS from R.
If you're ok to use it's predecessor BatchJobs for a while, I'd also recommend to try that and when you understand how that works, look into the doFuture foreach adaptor. This will allow you to use the future.BatchJobs package. This combination of doFuture, future.BatchJobs, and BatchJobs allows you do everything from within R and you don't have to worry about creating temporary R scripts etc. (Disclaimer: I'm the author of both).
Example what it'll look like when you've got it set up:
## Tell foreach to use futures
library("doFuture")
registerDoFuture()
## Tell futures to use TORQUE / PBS with help from BatchJobs
library("future.BatchJobs")
plan(batchjobs_torque)
and then you use:
res <- foreach(i = 1:100) %dopar% {
my_function(pathname[i], arg1, arg2)
}
This will evaluate each iteration in a separate PBS job, i.e. you'll see 100 jobs added to the queue.
The future.BatchJobs vignettes have more examples and info.
UPDATE 2017-07-30: The future.batchtools package is on CRAN since May 2017. This package is now recommended over future.BatchJobs. The usage is very similar to the above, e.g. instead of plan(batchjobs_torque) you now use plan(batchtools_torque).
Problem solved:
I made a mistake: The external program is actually NOT running - I misinterpreted the log file. The reason for the external program to not run is that the subfolder (containing the necessary input data) is not found. It seems that the cluster defaults to the user directory instead of the working directory specified in the pbs submission script. This behaviour is different from clusters created with doParallel, which do indeed recognize the working directory. The problem is therefore solved by just adding the relative path to working directory and subfolder in the R script, i. e. ./workingdirectory/mysubfolder/ instead of just ./mysubfolder/. Alternatively, you can also use the full path to the folder.

Set two paths to point to the same path

I am sharing most of my codes with my colleague and in doing so we have different root directories we need to edit to run the code. For example, I am accessing all my files in:
/usethis/mypath/mydir/now_same/mapk/
and he is in:
/media/hispath/hisdir/now_same/mapk/
what I want to do is for any subsequent path directions to access any files/subroutines in the code,
I want to point to my directory, i.e.(/usethis/mypath/mydir/) and without changing anything afterwards, i.e. (/now_same/mapk/). So if he sends me code with /media/hispath/hisdir/now_same/mapk/, I just want to use it without changing anything in the code.
How do we do it?
Pass working director as argument, see example:
myScript.R
args <- commandArgs(trailingOnly = TRUE)
setwd(args[1])
# other code
# ...
# end of myScript.R
Now to run the script with custom working directory:
Rscript myScript.R path/to/my/directory

Copy script after executing

Is there a possibility to copy the executed lines or basically the script into the working directory?
So my normal scenario is, I have stand alone script which just need to be sourced within a working directory and they will do everything I need.
After a few month, I made update to these scripts and I would love to have a snapshot from the script when I executed the source...
So basically file.copy(ITSELF, '.') or something like this.
I think this is what you're looking for:
file.copy(sys.frame(1)$ofile,
to = file.path(dirname(sys.frame(1)$ofile),
paste0(Sys.Date(), ".R")))
This will take the current file and copy it to a new file in the same directory with the name of currentDate.R, so for example 2015-07-14.R
If you want to copy to the working directory instead of the original script directory, use
file.copy(sys.frame(1)$ofile,
to = file.path(getwd(),
paste0(Sys.Date(), ".R")))
Just note that the sys.frame(1)$ofile only works if a saved script is sourced, trying to run it in terminal will fail. It is worth mentioning though that this might not be the best practice. Perhaps looking into a version control system would be better.
Explanation:
TBH, I might not be the best person to explain this (I copied this idea from somewhere and use it sometimes), but I'll try. Basically in order to have information about the script file R needs to be running it as a file inside an environment with that information, and when that environment is a source call it contains the ofile data. We use (1) to select the next (source()'s) environment following the global environment (which is 0). When you're running this from terminal, there's no frame/environment other than Global (that's the error message), since no file is being ran - the commands are sent straight to terminal.
To illustrate that, we can do a simple test:
> sys.frame(1)
Error in sys.frame(1) : not that many frames on the stack
But if we call that from another function:
> myf <- function() sys.frame(1)
> myf()
<environment: 0x0000000013ad7638>
Our function's environment doesn't have anything in it, so it exists but, in this case, does not have ofile:
> myf <- function() names(sys.frame(1))
> myf()
character(0)
I just wanted to add my solution since I decided to use a try function before executing the copy command... Because I have the feeling I miss some control...
try({
script_name <- sys.frame(1)$ofile
copy_script_name <-
paste0(sub('\\.R', '', basename(script_name)),
'_',
format(Sys.time(), '%Y%m%d%H%M%S'),
'.R')
file.copy(script_name,
copy_script_name)
})
This will copy the script into the current directory and also adds a timestamp to the filename. In case something goes wrong, the rest of the script will still execute.
I originally posted this other thread, and I think it addresses your problem: https://stackoverflow.com/a/62781925/9076267
In my case, I needed a way to copy the executing file to back up the original >script together with its outputs. This is relatively important in research.
What worked for me while running my script on the command line, was a mixure of >other solutions presented here, that looks like this:
library(scriptName)
file_dir <- paste0(gsub("\\", "/", fileSnapshot()$path, fixed=TRUE))
file.copy(from = file.path(file_dir, scriptName::current_filename()) ,
to = file.path(getwd(), scriptName::current_filename()))
Alternatively, one can add to the file name the date and our to help in >distinguishing that file from the source like this:
file.copy(from = file.path(current_dir, current_filename()) ,
to = file.path(getwd(), subDir, paste0(current_filename(),"_", Sys.time(), ".R")))

Resources