How can I copy data over the Amazon's EC2 and run a script? - r

I am a novice as far as using cloud computing but I get the concept and am pretty good at following instructions. I'd like to do some simulations on my data and each step takes several minutes. Given the hierarchy in my data, it takes several hours for each set. I'd like to speed this up by running it on Amazon's EC2 cloud.
After reading this, I know how to launch an AMI, connect to it via the shell, and launch R at the command prompt.
What I'd like help on is being able to copy data (.rdata files) and a script and just source it at the R command prompt. Then, once all the results are written to new .rdata files, I'd like to copy them back to my local machine.
How do I do this?

I don't know much about R, but I do similar things with other languages. What I suggest would probably give you some ideas.
Setup a FTP server on your local machine.
Create a "startup-script" that you launch with your instance.
Let the startup script download the R files from your local machine, initialize R and do the calculations, then the upload the new files to your machine.
Start up script:
#!/bin/bash
set -e -x
apt-get update && apt-get install curl + "any packages you need"
wget ftp://yourlocalmachine:21/r_files > /mnt/data_old.R
R CMD BATCH data_old.R -> /mnt/data_new.R
/usr/bin/curl -T /mnt/data_new.r -u user:pass ftp://yourlocalmachine:21/new_r_files
Start instance with a startup script
ec2-run-instances --key KEYPAIR --user-data-file my_start_up_script ami-xxxxxx

first id use amazon S3 for storing the filesboth from your local machine and back from the instance
as stated before, you can create start up scripts, or even bundle your own customized AMI with all the needed settings and run your instances from it
so download the files from a bucket in S3, execute and process, finally upload the results back to the same/different bucket in S3
assuming the data is small (how big scripts can be) than S3 cost/usability would be very effective

Related

R Automate the execution of a script on Linux

I would like to run an R script on a Linux server (CentOS) in an automated way. This should be done once a day (if possible several times a day). I would like to download stock prices using R (and later enter them into a database).
For example, the R script looks like this:
library(tidyquant)
library(lubridate)
data<-tq_get("AAPL", from="2021-01-01", to=today())
How should I write a job so that I can run the script automatically within a certain interval?
Can anyone help me?
Many thanks in advance!
you might would like to create a service. Depends on the CentOS version what type of service Systemd or init deamon
Full information of a timed service and the workings here.
Simple tutorial of how to create services here
This lets you create a service with the desired conditions and run your application/script.
Service example:
services are located # /etc/systemd/system/
for example open cli sudo touch /etc/systemd/system/updatestockdb.service
go into file and write your service sudo vim /etc/systemd/system/updatestockdb.service
[unit]
Description=Update stock price DB
Type=simple
[Timer]
OnCalendar=daily
AccuracySec=12h
Persistent=true
ExecStart=/opt/scripts/fetch_Stonks.sh --full --to-external
Restart=on-failure
PIDFile=/tmp/yourservice.pid
[Install]
WantedBy=multi-user.target

Need help getting bash/batch to work for R on Windows 10

I'm trying to get batch mode working for R on Windows 10. The ultimate goal is to run many iterations of some R code in batch on an external server.
I successfully installed bash (unix wrapper for windows 10?) on my cmd prompt. I am working through a tutorial on using batch. I'm not sure if I want this to run through cmd or through the r code directly? https://github.com/gastonstat/tutorial-R-noninteractive/blob/master/02-batch-mode.Rmd
Via the tutorial I am working on testing batch/bash with simple code myscript1.R. Then the code I enter into cmd promp/bash looks like:
R CMD BATCH "F:/Google Drive/Documents/batch/myscript1.R" "F:/Google Drive/Documents/batch/myscript1-output.R"
Currently the closest I get in the cmd/bash is that an output file is created in the right folder but blank and I am told \usr\lib\R\bin\BATCH: cannot create myscript1-output.R: Permission denied.
I have done everything possible to allow full permissions to all users and not sure what is going on. Can anyone who knows how to use batch mode or bash in R for windows advised me?
Thank you
Answer here thanks to Phil. I did not need to use "Bash" through Ubunto... I think.
Instead, I just needed to call the CMD BATCH through regular cmd with three parts:
the directory of my R.exe: "C:\Program Files\R\R-3.5.3\bin\x64\R.exe" (replace version)
CMD BATCH
directory of the project file/script: "F:\project_folders_batch\myscript1.R"
directory of the desired output (so it doesn't default to C/users/username). In this case, I output to the same folder as the script: "F:\project_folders_batch\myscript1-output.R"
Also, in case you are outputting plots or anything (I was), go ahead and cd (change directory) to the project folder before you do this. Final result in 2 steps:
cd /d "F:\projectfolder\batch"
"C:\Program Files\R\R-3.5.3\bin\x64\R.exe" CMD BATCH "F:\projectfolder\batch\myscript1.R" "F:\projectfolder\batch\myscript1-output.R"
Also mind your antivirus... it blocked access a few times.

how to monitor a directory and include new files with tail -f in Centos(for shiny-server logs in Docker)

Due to the need to direct shiny-server logs to stdout so that "docker logs" (and monitoring utilities relying on it) can see them i'm trying to do some kind of :
tail -f <logs_directory>/*
Which works as needed when no new files are added to the directory, the problem is shiny-server dynamically creates files in this directory which we need to automatically consider.
I found other users have solved this via the xtail package, the problem is i'm using Centos and xtail is not available for centos.
The question is , is there any "clean" way of doing this via standard tail command without needing xtail ? or maybe there exists an equivalent package to xtail for centos?
You will probably find it easier to use the docker run -v option to mount a host directory into the container and collect logs there. Then you can use any tool you want that collects log files out of a directory (logstash is popular but far from the only option) to collect those log files.
This also avoids the problem of having to both run your program and a log collector inside your container; you can just run the service as the main container process, and not have to do gymnastics with tail and supervisord and whatever else to try to keep everything running.

run r script using docker kaggle image

I am trying to reproduce results of an R script on my local Windows OS (reproduce the results which it gave on kaggle server). For this someone suggested to use docker images to run r script on my local.
I have installed docker and finished the steps to set it up by following instructions given here https://docs.docker.com/windows/step_one/
After installing, I am struggling with on how to create the kaggle R image and run an R script on my local using local resources/data. Can someone please help me with these?
You can load already builded image rstat from dockerhub:
docker run kaggle/rstats
For using your local data you should create volume:
docker run -v /you/local/data/path:path/in/docker/container kaggle/rstat
Volume binds your local storage with container storage. Also you can create additional volume for output data.
The last line in rstate dockerfile is
CMD ["R"]
It means that R console will be called after container start. Just past your script in terminal (script should use data from mounted volume in container and write result to mounted output volume). After script execution you can stop container. Your output data will be saved on your local machine.
P.S. image is giant (6Gb). I never seen before such large docker image.

When using mpirun with R script, should I copy manually file/script on clusters?

I'm trying to understand how openmpi/mpirun handle script file associated with an external program, here a R process ( doMPI/Rmpi )
I can't imagine that I have to copy my script on each host before running something like :
mpirun --prefix /home/randy/openmpi -H clust1,clust2 -n 32 R --slave -f file.R
But, apparently it doesn't work until I copy the script 'file.R' on clusters, and then run mpirun. Then, when I do this, the results are written on cluster, but I expected that they would be returned to working directory of localhost.
Is there another way to send R job from localhost to multiple hosts, including the script to be evaluated ?
Thanks !
I don't think it's surprising that mpirun doesn't know details of how scripts are specified to commands such as "R", but the Open MPI version of mpirun does include the --preload-files option to help in such situations:
--preload-files <files>
Preload the comma separated list of files to the current working
directory of the remote machines where processes will be
launched prior to starting those processes.
Unfortunately, I couldn't get it to work, which may be because I misunderstood something, but I suspect it isn't well tested because very few use that option since it is quite painful to do parallel computing without a distributed file system.
If --preload-files doesn't work for you either, I suggest that you write a little script that calls scp repeatedly to copy the script to the cluster nodes. There are some utilities that do that, but none seem to be very common or popular, which I again think is because most people prefer to use a distributed file system. Another option is to setup an sshfs file system.

Resources