To access S3 bucket from R - r

I have set-up R on an EC2 Instance on AWS.
I have few csv files uploaded into a S3 bucket.
I was wondering if there is a way to access the csv files in the S3 bucket from R.
Any help/pointers would be appreciated.

Have a look at the cloudyr aws.s3 package (https://github.com/cloudyr/aws.s3), it might do what you need. Unfortunately (at time of writing), this package is quite early stage & a little unstable.
I've had good success simply using R's system() command to make a call to the AWS CLI. This is relatively easy to get started on, very robust and very well supported.
Start here: http://aws.amazon.com/cli/
List objects using S3 API: http://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects.html
Get objects using S3 API: http://docs.aws.amazon.com/cli/latest/reference/s3api/get-object.html
So, for example, on command-line try following:
pip install awscli
aws configure
aws s3 help
aws s3api list-objects --bucket some-bucket --query 'Contents[].{Key: Key}'
aws s3api get-object --bucket some-bucket --key some_file.csv new_file_name.csv
In R, can just do something like:
system("aws s3api list-objects --bucket some-bucket --query 'Contents[].{Key: Key}' > my_bucket.json")

Enter the following command: install.packages("AWS.tools")
From there, use the s3.get() command. The Help tab should tell you what goes in for arguments.

Install the libdigest-hmac-perl package;
sudo apt-get install libdigest-hmac-perl

Related

How to see the 'http' adress of Git repository in use from RStudio?

I have started to use version control in RStudio using GitLab, but now I wish to use GitHub instead. So, I need to move all of my repositories from GitLab to GitHub, as specified e.g. here.
But, for some time, I was using both GitHub and GitLab simulteneously, and developing single project on both repos individually! Quite stupid, but it happens..
Shortly, now I need know, which repo (GitHub or GitLab) is my R studio using? I am looking for some code that will print out the http of the repository in use?
something like repo.print and will return the http associated: https://gitlab.com/xx/yy/z or https://github.com/xx/yy/z.
git config --get remote.origin.url in Terminal or
shell("git config --get remote.origin.url")
in R console / script.

How to use R libraries in Azure Databricks without depending on CRAN server connectivity?

We are using a few R libraries in Azure Databricks which do not come preinstalled. To install these libraries during Job Runs on Job Clusters, we use an init script to install them.
sudo R --vanilla -e 'install.packages("package_name",
repos="https://mran.microsoft.com/snapsot/YYYY-MM-DD")'
During one of our production runs, the Microsoft Server was down (could the timing be any worse?) and the job failed.
As a workaround, we now install libraries in /dbfs/folder_x and when we want to use them, we include the following block in our R code:
.libpaths('/dbfs/folder_x')
library("libraryName")
This does work for us, but what is the ideal solution to this? Since, if we want to update a library to another version, remove a library or add one, we have to go through the following steps everytime and there is a chance of forgetting this during code promotions:
install.packages("xyz")
system("cp -R /databricks/spark/R/lib/xyz /dbfs/folder_x/xyz")
It is a very simple and workable solution, but not ideal.

SparkR - list of files for a AWS mounted bucket

I just started working with Spark R. There is a aws.s3 bucket that is mounted in Databricks.
I would like to use list.files() or dir() to list files that contain certain pattern in their names. However, I can't detect the path for that. I can read a single file with sparkr::read.df, but I don't know how to find all the files out of sevral thousands that I am interested in the bucket?
Many thanks if you can help!
There are multiple ways to do this. There is an R package called aws.s3 that can help you with that here: https://github.com/cloudyr/aws.s3
OR, if you have the aws command line tool installed, you could do a call to it via system from within R. You could include a grep in the call and do something like
listOfBucketsWithABCpattern<- system("aws s3 ls | grep abc", intern=TRUE)
Just to clarify, this isn't a SparkR question so much as it is an R question and an AWS Command Line Interface question.
Hopefully this helps. Cheers!

How to install jvmr package on databricks

I want to call R function in scala script on databricks. Is there anyway that we can do it?
I use
JVMR_JAR=$(R --slave -e 'library("jvmr"); cat(.jvmr.jar)')
scalac -cp "$JVMR_JAR"
scala -cp ".:$JVMR_JAR"
on my mac and it automatically open a scala which can call R functions.
Is there any way I can do similar stuff on databricks?
On the DataBricks Cloud, you can use the sbt-databricks to deploy external libraries to the cloud and attach them to specific clusters, which are two necessary steps to make sure jvmr is available to the machines you're calling this on.
See the plugin's github README and the blog post.
If those resources don't suffice, perhaps you should ask your questions to Databricks' support.
If you want to call an R function in the scala notebook, you can use the %r shortcut.
df.registerTempTable("temp_table_scores")
Create a new cell, then use:
%r
scores <- table(sqlContext, "temp_table_scores")
local_df <- collect(scores)
someFunc(local_df)
If you want to pass the data back into the environment, you can save it to S3 or register it as a temporary table.

How can I copy data over the Amazon's EC2 and run a script?

I am a novice as far as using cloud computing but I get the concept and am pretty good at following instructions. I'd like to do some simulations on my data and each step takes several minutes. Given the hierarchy in my data, it takes several hours for each set. I'd like to speed this up by running it on Amazon's EC2 cloud.
After reading this, I know how to launch an AMI, connect to it via the shell, and launch R at the command prompt.
What I'd like help on is being able to copy data (.rdata files) and a script and just source it at the R command prompt. Then, once all the results are written to new .rdata files, I'd like to copy them back to my local machine.
How do I do this?
I don't know much about R, but I do similar things with other languages. What I suggest would probably give you some ideas.
Setup a FTP server on your local machine.
Create a "startup-script" that you launch with your instance.
Let the startup script download the R files from your local machine, initialize R and do the calculations, then the upload the new files to your machine.
Start up script:
#!/bin/bash
set -e -x
apt-get update && apt-get install curl + "any packages you need"
wget ftp://yourlocalmachine:21/r_files > /mnt/data_old.R
R CMD BATCH data_old.R -> /mnt/data_new.R
/usr/bin/curl -T /mnt/data_new.r -u user:pass ftp://yourlocalmachine:21/new_r_files
Start instance with a startup script
ec2-run-instances --key KEYPAIR --user-data-file my_start_up_script ami-xxxxxx
first id use amazon S3 for storing the filesboth from your local machine and back from the instance
as stated before, you can create start up scripts, or even bundle your own customized AMI with all the needed settings and run your instances from it
so download the files from a bucket in S3, execute and process, finally upload the results back to the same/different bucket in S3
assuming the data is small (how big scripts can be) than S3 cost/usability would be very effective

Resources