Running Plink from within R - r

I'm trying to use a Windows computer to SSH into a Mac server, run a program, and transfer the output data back to my Windows. I've been able to successfully do this manually using Putty.
Now, I'm attempting to automate the process using Plink. I've added Plink to my Windows Path, so if I open cmd and type in a command, I can successfully log in and pass commands to the server:
However, I'd like to automate this using R, to streamline the data analysis process. Based on some searching, the internet seems to think that the shell command is best suited to this task. Unfortunately, it doesn't seem to find Plink, though passing commands through shell to the terminal is working:
If I try the same thing but manually setting the path to Plink using shell, no output is returned, but the commands do not seem to run (e.g. TESTFOLDER is not created):
Does anyone have any ideas for why Plink is unavailable when I try to call it from R? Alternately, if there are other ideas for how this could be accomplished in R, that would also be appreciated.
Thanks in advance,
-sam

I came here looking for an answer to this question, so I only have so much to offer, but I think I managed to get PLINK's initial steps to work in R using the shell function...
This is what worked for me:
NOT in R:
Install PLINK and add its location to your PATH.
Download the example files from PLINK's tutorial (http://pngu.mgh.harvard.edu/~purcell/plink/tutorial.shtml) and put them in a folder whose path contains NO spaces (unless you know something I don't, in which case, space it up).
Then, in R:
## Set your working directory as the path to the PLINK program files: ##
setwd("C:/Program Files/plink-1.07-dos")
## Use shell to check that you are now in the right directory: ##
shell("cd")
## At this point, the command "plink" should be at least be recognized
# (though you may get a different error)
shell("plink")
## Open the PLINK example files ##
# FYI mine are in "C:/PLINK/", so replace that accordingly...
shell("plink --file C:\\PLINK\\hapmap1")
## Make a binary PED file ##
# (provide the full path, not just the file name)
shell("plink --file C:\\PLINK\\hapmap1 --make-bed --out C:\\PLINK\\hapmap1")
... and so on.
That's all I've done so far. But with any luck, mirroring the structure and general format of those lines of code should allow you to do what you like with PLINK from within R.
Hope that helps!
PS. The PLINK output should just print in your R console when you run the lines above.
All the best,
- CC.

Just saw Caitlin's response and it reminded me I hadn't ever updated with my solution. My approach was kind of a workaround, rather than solving my specific problem, but it may be useful to others.
After adding Plink to my PATH, I created a batch script in Windows which contained all my Plink commands, then called the batch script from R using the command shell:
So, in R:
shell('BatchScript.bat')
The batch script contained all my commands that I wanted to use in Plink:
:: transfer file to phosphorus
pscp C:\Users\Sam\...\file zipper#144.**.**.208:/home/zipper/
:: open connection to Dolphin using plink
plink -ssh zipper#144.**.**.208 Batch_Script_With_Remote_Machine_Commands.bat
:: transfer output back to local machine
pscp zipper#144.**.**.208:/home/zipper/output/ C:\Users\Sam\..\output\
Hope that helps someone!

Related

SparkR: source() other R files in an R Script when running spark-submit

I'm new to Spark and newer to R, and am trying to figure out how to 'include' other R-scripts when running spark-submit.
Say I have the following R script which "sources" another R script:
main.R
source("sub/fun.R")
mult(4, 2)
The second R script looks like this, which exists in a sub-directory "sub":
sub/fun.R
mult <- function(x, y) {
x*y
}
I can invoke this with Rscript and successfully get this to work.
Rscript file.R
[1] 8
However, I want to run this with Spark, and use spark-submit. When I run spark-submit, I need to be able to set the current working directory on the Spark workers to the directory which contains the main.R script, so that the Spark/R worker process will be able to find the "sourced" file in the "sub" subdirectory. (Note: I plan to have a shared filesystem between the Spark workers, so that all workers will have access to the files).
How can I set the current working directory that SparkR executes in such that it can discover any included (sourced) scripts?
Or, is there a flag/sparkconfig to spark-submit to set the current working directory of the worker process that I can point at the directory containing the R Scripts?
Or, does R have an environment variable that I can set to add an entry to the "R-PATH" (forgive me if no such thing exists in R)?
Or, am I able to use the --files flag to spark-submit to include these additional R-files, and if so, how?
Or is there generally a better way to include R scripts when run with spark-submit?
In summary, I'm looking for a way to include files with spark-submit and R.
Thanks for reading. Any thoughts are much appreciated.

Need help getting bash/batch to work for R on Windows 10

I'm trying to get batch mode working for R on Windows 10. The ultimate goal is to run many iterations of some R code in batch on an external server.
I successfully installed bash (unix wrapper for windows 10?) on my cmd prompt. I am working through a tutorial on using batch. I'm not sure if I want this to run through cmd or through the r code directly? https://github.com/gastonstat/tutorial-R-noninteractive/blob/master/02-batch-mode.Rmd
Via the tutorial I am working on testing batch/bash with simple code myscript1.R. Then the code I enter into cmd promp/bash looks like:
R CMD BATCH "F:/Google Drive/Documents/batch/myscript1.R" "F:/Google Drive/Documents/batch/myscript1-output.R"
Currently the closest I get in the cmd/bash is that an output file is created in the right folder but blank and I am told \usr\lib\R\bin\BATCH: cannot create myscript1-output.R: Permission denied.
I have done everything possible to allow full permissions to all users and not sure what is going on. Can anyone who knows how to use batch mode or bash in R for windows advised me?
Thank you
Answer here thanks to Phil. I did not need to use "Bash" through Ubunto... I think.
Instead, I just needed to call the CMD BATCH through regular cmd with three parts:
the directory of my R.exe: "C:\Program Files\R\R-3.5.3\bin\x64\R.exe" (replace version)
CMD BATCH
directory of the project file/script: "F:\project_folders_batch\myscript1.R"
directory of the desired output (so it doesn't default to C/users/username). In this case, I output to the same folder as the script: "F:\project_folders_batch\myscript1-output.R"
Also, in case you are outputting plots or anything (I was), go ahead and cd (change directory) to the project folder before you do this. Final result in 2 steps:
cd /d "F:\projectfolder\batch"
"C:\Program Files\R\R-3.5.3\bin\x64\R.exe" CMD BATCH "F:\projectfolder\batch\myscript1.R" "F:\projectfolder\batch\myscript1-output.R"
Also mind your antivirus... it blocked access a few times.

Why do which and Sys.which return different paths?

I tried to run a Python script from R with:
system('python script.py arg1 arg2')
And got an error:
ImportError: No module named pandas
This was a bit of a surprise since the script was working from the terminal as expected. Having encountered this type of issue before (with knitr, whence the engine.path chunk option), I know to check:
Sys.which('python')
# python
# "/usr/bin/python"
And compare it to the command line:
$ which python
# /Users/michael.chirico/anaconda2/bin/python
(i.e., the error arises because I have pandas installed for the anaconda distribution, though TBH I don't know why I have a different distribution)
Hence I can fix my issue by running:
system('/Users/michael.chirico/anaconda2/bin/python script.py arg1 arg2')
My question is two-fold:
How does R's system/Sys.which find a different python than my terminal?
How can I fix this besides writing out the full binary path each time?
I read ?Sys.which for some hints, but to no avail. In particular, ?Sys.which suggests Sys.which is using which:
This is an interface to the system command which
This is clearly (?) untrue; to be sure, I checked Sys.which('which') and which which to confirm both are pointing to /usr/bin/which (goaded on by this tidbit):
On a Unix-alike the full path to which (usually /usr/bin/which) is found when R is installed.
To the latter, on a whim I tried Sys.setenv(python = '/Users/michael.chirico/anaconda2/bin/python') to no avail.
As some of the comments hint, this is a problem that arises because the PATH environment variable is different for programs launched by Finder (or the Dock) than it is in the Terminal. There are ways to set the PATH for Dock-launched applications, but they aren't pretty. Here's a place to start looking if you want to go that route:
https://apple.stackexchange.com/questions/51677/how-to-set-path-for-finder-launched-applications
The other thing you can do, which is probably more straightforward, is tell R to set the PATH variable when it starts up, using Sys.setenv to add the path to your desired Python instance. You can do that for just one project, for your whole user account, or for the whole system, by placing the command in a .Rprofile file in the corresponding location. More information on how to do this here:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html

How do I pass the environment of one R script to another?

I'm effectively trying to tack save.image() onto the end of a script without modifying that script.
I was hoping something like Rscript target_script.R | saveR.R destination_path would work, where saveR.R reads,
args.from.usr<-commandArgs(TRUE)
setwd(args.from.usr[1])
save.image(file=".RData")
But that clearly does not work. Any alternatives?
You can write an R script file that takes two parameters: 1, the script file you want to run, and 2, the file you want to save the image to.
# runAndSave.R ------
args.from.usr <- commandArgs(trailingOnly=TRUE)
source(args.from.usr[1])
setwd(args.from.usr[2])
save.image(file=".RData")
And then run it with
Rscript runAndSave.R target_script.R destination_path
You could try to program a task to be done within the OS of that computer. In Linux you will be using the terminal and there is this tool called CRON. In Windows you can use Task Scheduler. If you program the OS to open a terminal and load an script and later save the image, you maybe will get what you need, save the data generated from the script without actually modifying it.

When using mpirun with R script, should I copy manually file/script on clusters?

I'm trying to understand how openmpi/mpirun handle script file associated with an external program, here a R process ( doMPI/Rmpi )
I can't imagine that I have to copy my script on each host before running something like :
mpirun --prefix /home/randy/openmpi -H clust1,clust2 -n 32 R --slave -f file.R
But, apparently it doesn't work until I copy the script 'file.R' on clusters, and then run mpirun. Then, when I do this, the results are written on cluster, but I expected that they would be returned to working directory of localhost.
Is there another way to send R job from localhost to multiple hosts, including the script to be evaluated ?
Thanks !
I don't think it's surprising that mpirun doesn't know details of how scripts are specified to commands such as "R", but the Open MPI version of mpirun does include the --preload-files option to help in such situations:
--preload-files <files>
Preload the comma separated list of files to the current working
directory of the remote machines where processes will be
launched prior to starting those processes.
Unfortunately, I couldn't get it to work, which may be because I misunderstood something, but I suspect it isn't well tested because very few use that option since it is quite painful to do parallel computing without a distributed file system.
If --preload-files doesn't work for you either, I suggest that you write a little script that calls scp repeatedly to copy the script to the cluster nodes. There are some utilities that do that, but none seem to be very common or popular, which I again think is because most people prefer to use a distributed file system. Another option is to setup an sshfs file system.

Resources