run r script using docker kaggle image - r

I am trying to reproduce results of an R script on my local Windows OS (reproduce the results which it gave on kaggle server). For this someone suggested to use docker images to run r script on my local.
I have installed docker and finished the steps to set it up by following instructions given here https://docs.docker.com/windows/step_one/
After installing, I am struggling with on how to create the kaggle R image and run an R script on my local using local resources/data. Can someone please help me with these?

You can load already builded image rstat from dockerhub:
docker run kaggle/rstats
For using your local data you should create volume:
docker run -v /you/local/data/path:path/in/docker/container kaggle/rstat
Volume binds your local storage with container storage. Also you can create additional volume for output data.
The last line in rstate dockerfile is
CMD ["R"]
It means that R console will be called after container start. Just past your script in terminal (script should use data from mounted volume in container and write result to mounted output volume). After script execution you can stop container. Your output data will be saved on your local machine.
P.S. image is giant (6Gb). I never seen before such large docker image.

Related

Getting RStudio in Docker Environment Command Window as opposed to browser?

This is a question related to using docker to run scripts of RStudio. The problem I'm having is that the person evaluating the results I'm getting wants to have it so that they type in ./test.sh in the command window, that runs my R script, and a csv of my results prints out to the local directory.
Is it possible to get the rocker rstudio to appear in the command window as opposed to having to log into the browser to use rstudio? It seems all the resources online say something along the lines of put -p 8002:8787 and -d in the docker line of code, but this makes it so you have to go to a local browser to actually run your R script.
I've found this code snippet works, but is there an alternative to using \bin\bash at the end so that the rstudio commands can just stay in the window?
$ docker run -e PASSWORD=MYPASSWORD -v "$(pwd):/data:ro" -v "$(pwd):/workdir" -it thatguy /bin/bash
Or, better yet, is there a way to put this docker run command in my Dockerfile so that when I run my $ docker build this line just runs automatically?

How do I find where a file is located inside container?

I'm trying to give special permissions to a file located inside a container but I'm getting a "No such file or directory" error message.
The Dockerfile basically runs a R Script that generates an output.pptx file located inside an output folder created inside the container.
I want to send that output into a s3 bucket but for some reason it isn't finding the file inside the container.
# Make the output directory
RUN mkdir output
# Process main file
CMD ["RScript", "Script.R"]
# install AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
RUN unzip awscliv2.zip
RUN ./aws/install -i /usr/local/bin/aws -b /usr/local/bin
# run AWS CLI to push output file to s3 folder
RUN chmod 775 ./output/*.pptx
RUN aws s3 cp ./output/*.pptx s3://bucket
Could this be related to the path I'm using for the file?
(Edited to fix a word-swap brain-fart in the first version.)
I get the idea that there is a misunderstanding of how the image should be used. That is, a DOCKERFILE creates an image, and the CMD is not actually run when building the image.
Up front:
an image is really just a tarball with filesystems; multiple "layers" are there, to indicate the layers of the build process (which can be squashed); an image has no "running" component, no processes are active in an image; and
a container is an image that is in a running state. It might be the CMD you specify, or it might be something else (e.g., docker run -it --rm myimage /bin/bash to run a bash shell with the container as the filesystem/environment). When the running command finishes and exits, the container is stopped.
Typically, you create an image "once" (security updates and such notwithstanding), and then run it as needed (either manually or via some trigger, e.g., cron or CI triggers). That might look like
docker run --rm myimage # using the default `CMD`
docker run --rm myimage R Script.R # specifying the command manually
with a couple assumptions:
the image has R installed and within the PATH ... though you could specify the full /path/to/bin/R instead; and
the default working directory (dockerfile's WORKDIR directive) contains Script.R ... or you can specify the full internal path to Script.R
With this workflow, I suggest a few changes:
from the DOCKERFILE, remove the lines after # run AWS CLI;
add to Script.R steps to copy the file to your S3 instance, either using the awscli you installed in the image, or by using the aws.s3 R package (which might preclude the need to install the awscli);
I don't use AWS S3, but I suspect that you need credentials to be able to access the bucket; there are many ways for dealing with images and "secrets" like S3 credentials, the most naïve approaches involve hard-coding the credentials into the container, which is a security risk; others involve "docker secrets" or environment variables. For instance,
docker run -it --rm -e "S3TOKEN=asldkfjlaskdf"
though even that might be intercepted by neighboring users on the docker host.

Use crontab to automate R script

I am attempting to automate a R script using Rstudio Server on ec2 machine.
The R script is working without errors. I then navigated to the terminal on RStudio Sever and attempted to run the R script using the command - Rscript "Rfilename" and it works.
At this point I created a shell script and placed the command above for running the R script in there. This shell command is also running fine - sh "shellfilename"
But when I try to schedule this shell command using crontab, it does not produce any result. I am using the following cron entry :
* * * * * /usr/bin/sh ./shellfilename.sh
I am using cron for the first time and need help debug what is going wrong. My intuition is that there is there is difference in the environments used by the command when I run it on terminal and when I use the same in crontab. In case it is relevant information - am doing all of this on a user account created for myself on this machine so would differ from admin account.
Can someone help resolve this issue? Thanks!
The issue arose due to relative paths used in the script for importing files and objects. Changing this to absolute path resolved the described issue.

Let Docker image build fail when R package installation returns error

I am trying to create a custom Docker image based on Rocker using Dockerfile. In the Dockerfile I am pulling my own R package from a custom GitLab server using:
RUN R -e "devtools::install_git('[custom gitlab server]', quiet = FALSE)"
Everything usually works, but I have noticed that when the GitLab server is down, or the machine running Docker is low on RAM memory, the package does not install correctly and returns an error message in the R console. This behavior is to be expected. However, Docker does not notice the error produced by R and continues evaluating the rest of the Dockerfile. I would like Docker to fail building the image when this occurs. In that way, I could ultimately prevent automatic deployment of the incomplete Docker container by Kubernetes.
So far I have thought of two potential solutions, but I am struggling with the execution:
R level: Wrap tryCatch() around devtools::install_git to catch the error. But then what? Use stop? Will this cause the Docker building process to stop as well? Could withCallingHandlers() be used?
Dockerfile level: Use a shell command to check for errors? I cannot find the contents of R --help as I do not have a Linux machine at the moment. So I am not sure of what R -e actually does (execute I presume) and which other commands could be passed along with R.
It seems that a similar issue is discussed here and here, but the I do not understand how they have solved it.
Thus how to make sure no Docker image ends up running on the Kubernetes cluster without the custom package?
The Docker build process should stop once one of the commands in the Dockerfile returns a non zero status.
install_git doesn't seem to throw an error when the package wasn't installed successfully, so the execution keeps on.
An obvious way to go would be to wrap the installation inside a dedicated R script and throw an error if it didn't finish successfully, which would then stop the build.
So I would suggest something like this ...
Create installation script install_gitlab.R:
### file install_gitlab.R
## change repo- and package name!!
repo <- '[custom gitlab server]'
pkgname <- 'testpackage'
devtools::install_git(repo, quiet = FALSE)
stopifnot(pkgname %in% installed.packages()[,'Package'])
Modify your Dockerfile accordingly (replace the install_git line):
...
Add install_gitlab.R /runscripts/install_gitlab.R
RUN Rscript /runscripts/install_gitlab.R
...
One thing to keep in mind is, this approach assumes the package you're trying to install is NOT installed prior to calling the command.
If you're using a rocker image, they already have the littler package installed, which has the handy installGithub.r script. I believe it should already have the functionality you want. If not, it at least simplifies the running of the custom install_github.r script.
A docker RUN command using littler just looks like:
RUN installGithub.r "yourRepo"

How can I copy data over the Amazon's EC2 and run a script?

I am a novice as far as using cloud computing but I get the concept and am pretty good at following instructions. I'd like to do some simulations on my data and each step takes several minutes. Given the hierarchy in my data, it takes several hours for each set. I'd like to speed this up by running it on Amazon's EC2 cloud.
After reading this, I know how to launch an AMI, connect to it via the shell, and launch R at the command prompt.
What I'd like help on is being able to copy data (.rdata files) and a script and just source it at the R command prompt. Then, once all the results are written to new .rdata files, I'd like to copy them back to my local machine.
How do I do this?
I don't know much about R, but I do similar things with other languages. What I suggest would probably give you some ideas.
Setup a FTP server on your local machine.
Create a "startup-script" that you launch with your instance.
Let the startup script download the R files from your local machine, initialize R and do the calculations, then the upload the new files to your machine.
Start up script:
#!/bin/bash
set -e -x
apt-get update && apt-get install curl + "any packages you need"
wget ftp://yourlocalmachine:21/r_files > /mnt/data_old.R
R CMD BATCH data_old.R -> /mnt/data_new.R
/usr/bin/curl -T /mnt/data_new.r -u user:pass ftp://yourlocalmachine:21/new_r_files
Start instance with a startup script
ec2-run-instances --key KEYPAIR --user-data-file my_start_up_script ami-xxxxxx
first id use amazon S3 for storing the filesboth from your local machine and back from the instance
as stated before, you can create start up scripts, or even bundle your own customized AMI with all the needed settings and run your instances from it
so download the files from a bucket in S3, execute and process, finally upload the results back to the same/different bucket in S3
assuming the data is small (how big scripts can be) than S3 cost/usability would be very effective

Resources