I have tried a number of things put files into my EC2 Rstudio instance, particularly uploading via putty, adding dropbox (via Louis Aslett's AMI, but only very small files sync), using filezilla and also winscp. Unfortunately after several months I can get them in, but they are corrupted on arrival.
There are some older questions related to this
To access S3 bucket from R
but they all quote packages that are no longer maintained or have no accepted answer.
This seems possible in python using boto and there are alot more answered questions on this via python.
I think maybe AWS might prefer using S3, so I am trying that, is there any r package that is current and works for uploading into EC2 from S3. I have set up EC2 and S3? Does anyone have the r code lines to use and is there anything behind the scenes that needs to be done. I was told by a colleague it was possible to do this in r with only 4 lines of r code, once the packages were installed, but she does not know how to actually do it.
Related
I have searched online but I cannot find a way to use the packages such as aws.translate. My company accesses AWS using SSO and we cannot generate key pairs. I launched an ec2 instance where I run a docker file which contains R and Python. I believe lots of corporate users face similar issues. Could someone please guild on how to use those CloudyR packages in SSO environment?
I installed aws.translate, aws.ec2metadata and aws.signature. use_credentials cannot find .aws/credentials file. ecs.metadata() returns NULL. And I receive Bad Request HTTP 400 if I directly use the translate function.
Problem:
I would like to make julia available for our developers on our corporate network, which has no internet access at all (no proxy), due to sensitive data.
As far as I understand julia is designed to use github.
For instance julia> Pkg.init() tries to access:
git://github.com/JuliaLang/METADATA.jl
Example:
I solved this problem for R by creating a local CRAN repository (rsync) and setting up a local webserver.
I also solved this problem for python the same way by creating a local PyPi repository (bandersnatch) + webserver.
Question:
Is there a way to create a local repository for metadata and packages for julia?
Thank you in advance.
Roman
Yes, one of the benefits from using the Julia package manager is that you should be able to fork METADATA and host it anywhere you'd like (and keep a branch where you can actually check new packages before allowing your clients to update). You might be one of the first people to actually set up such a system, so expect that you will need to submit some issues (or better yet; pull requests) in order to get everything working smoothly.
See the extra arguments to Pkg.init() where you specify the METADATA repo URL.
If you want a simpler solution to manage I would also think about having a two tier setup where you install packages on one system (connected to the internet), and then copy the resulting ~/.julia directory to the restricted system. If the packages you use have binary dependencies, you might run into problems if you don't have similar systems on both sides, or if some of the dependencies is installed globally, but Pkg.build("Pkgname") might be helpful.
This is how I solved it (for now), using second suggestion by
ivarne.I use a two tier setup, two networks one connected to internet (office network), one air gapped network (development network).
System information: openSuSE-13.1 (both networks), julia-0.3.5 (both networks)
Tier one (office network)
installed julia on an NFS share, /sharename/local/julia.
soft linked /sharename/local/bin/julia to /sharename/local/julia/bin/julia
appended /sharename/local/bin/ to $PATH using a script in /etc/profile.d/scriptname.sh
created /etc/gitconfig on all office network machines: [url "https://"] insteadOf = git:// (to solve proxy server problems with github)
now every user on the office network can simply run # julia
Pkg.add("PackageName") is then used to install various packages.
The two networks are connected periodically (with certain security measures ssh, firewall, routing) for automated data exchange for a short period of time.
Tier two (development network)
installed julia on NFS share equal to tier one.
When the networks are connected I use a shell script with rsync -avz --delete to synchronize the .julia directory of tier one to tier two for every user.
Conclusion (so far):
It seems to work reasonably well.
As ivarne suggested there are problems if a package is installed AND something more than just file copying is done (compiled?) on tier one, the package wont run on tier two. But this can be resolved with Pkg.build("Pkgname").
PackageCompiler.jl seems like the best tool for using modern Julia (v1.8) on secure systems. The following approach requires a build server with the same architecture as the deployment server, something your institution probably already uses for developing containers, etc.
Build a sysimage with PackageCompiler's create_sysimage()
Upload the build (sysimage and depot) along with the Julia binaries to the secure system
Alias a script to julia, similar to the following example:
#!/bin/bash
set -Eeu -o pipefail
unset JULIA_LOAD_PATH
export JULIA_PROJECT=/Path/To/Project
export JULIA_DEPOT_PATH=/Path/To/Depot
export JULIA_PKG_OFFLINE=true
/Path/To/julia -J/Path/To/sysimage.so "$#"
I've been able to run a research pipeline on my institution's secure system, for which there is a public version of the approach.
I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike
#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.
First I should say that a lot of this is over my head, so I apologize in advance for using incorrect terminology and potentially asking an unclear question. I'm doing my best.
Also, I saw ThisPost; is RCurl the tool I want to use for this task?
Every day for 4 months I'll be analyzing new data, and generating .csv files and .png's that need to be uploaded to a web site so that other team members will be checking. I've (nearly) automated all of the data collecting, data downloading, analysis, and file saving. The analysis is carried out in R, and R saves the files. Currently I use Filezilla to manually upload the new files to the website. Is there a way to use R to upload the files to the web site, so that I don't have to open Filezilla and drag+drop files?
It'd be nice to run my R-code and walk away, knowing that once it finishes running, the newly saved files will be automatically be put on the website.
Thanks for any help!
You didn't specify which protocol you use to upload your files using FileZilla. I assume it is ftp. If so, you can use the ftpUpload function of RCurl:
library(RCurl)
ftpUpload("yourfile", "ftp://ftp.yourserver.foo/yourfile",
userpwd="username:passwd")
RCurl also had methods for scp and should also support sftp using ftpUpload.
Often in "restricted security" situations in which programs can't be installed on a computer I run R from a flash drive. Works like a charm. I've recently started using dropbox and was thinking it could be used in a similar fashion to the flash drive. For anyone who has tried this does it work?
I can test it myself but don't want to go to the bother if it's a dead end.
Thanks in advance.
PS this has the advantage of storing an .Rprofile that people whom you share the dropbox folder with can then run your R code. This is particularly nice for people unfamiliar with R.
It should just work.
R is set up in such a way that all its files are relative to a given top-level directory. Whether that is a F:\ or Z:\ drive from your flashdrive, or your Dropbox folder should not matter.
By the same token, R can run happily off a shared folder, be it via Samba, NFS or another mechanism.
It is fine if you want to share .Rprofile or .Rhistory. However, I see a problem with .Rdata, because it can be big (for example 4Gb). For me to save 100 Mb file on Dropbox takes minutes and .RData can be far bigger.
An alternative would be a remote server, where you could connect through ssh.