SparkR - list of files for a AWS mounted bucket - directory

I just started working with Spark R. There is a aws.s3 bucket that is mounted in Databricks.
I would like to use list.files() or dir() to list files that contain certain pattern in their names. However, I can't detect the path for that. I can read a single file with sparkr::read.df, but I don't know how to find all the files out of sevral thousands that I am interested in the bucket?
Many thanks if you can help!

There are multiple ways to do this. There is an R package called aws.s3 that can help you with that here: https://github.com/cloudyr/aws.s3
OR, if you have the aws command line tool installed, you could do a call to it via system from within R. You could include a grep in the call and do something like
listOfBucketsWithABCpattern<- system("aws s3 ls | grep abc", intern=TRUE)
Just to clarify, this isn't a SparkR question so much as it is an R question and an AWS Command Line Interface question.
Hopefully this helps. Cheers!

Related

Sparklyr copy_to fails

I am using the Sparklyr library to read and write data from R to HDFS. Reading data works as expected but writing gives issues.
To be able to use the spark_write_csv function, I need to convert my R data.frames into Spark objects.
I use the sparklyr sdf_copy_to function for this (also tried with copy_to). However, I always get errors
Code:
table1 <- sdf_copy_to(sc,dataframe,OVERWRITE=TRUE)
spark_write_csv(table1, "path")
Error:
Error: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
hdfs://iacchadoopdev01.dap:8020/tmp/Rtmp2gpelH/spark_serialize_62547a7b0f9ad206fd384af04e585deb3a2636ca7b1f026943d4cc1d11c7759a.csv
Did somebody encountered the same problem and knows how to solve this?
A possible reason might be that the sdf_copy_to function stores the data onto my linux /tmp folder while the write function is looking for data on the HDFS /tmp folder.
I had the same problem. You need to put the .csv into hdfs.
You can do this via the shell.
You logg into your cluster via ssh. Then you use 'put' to put the .csv into hdfs.
Write in the shell after you connected to the cluster:
hdfs dfs 'path to local file/file.csv' put 'path to folder in hdfs of your choosing'
Then you will use the hdfs path to load the file.

R-Hadoop integration - how to connect R to remote hdfs

I have a case where I will be running R code on a data that will be downloaded from Hadoop. Then, the output of the R code will be uploaded back to Hadoop as well. Currently, I am doing it manually and I would like to avoid this manual downloading/uploading process.
Is there a way I can do this in R by connecting to hdfs? In other words, in the beginning of the R script, it connects to Hadoop and reads the data, then in the end it uploads the output data to Hadoop again. Are there any packages that can be used? Any changes required in Hadoop server or R?
I forgot to note the important part: R and Hadoop are on different servers.
Install the package rmr2 , you will have an option of from.dfs function which can solves your requirement of getting the data from HDFS as mentioned below:
input_hdfs <- from.dfs("path_to_HDFS_file",format="format_columns")
For Storing the results back into HDFS, you can do like
write.table(data_output,file=pipe(paste('hadoop dfs -put -', path_to_output_hdfs_file, sep='')),row.names=F,col.names=F,sep=',',quote=F)
(or)
You can use rmr2 to.dfs function to store back into HDFS.
So... Have you found a solution for this?
Some months ago I have stumbled upon the same situation. After fiddling around for some time with the Revolution Analytics packages, I couldn't find a way for it to work in the situation where R and Hadoop are on different servers.
I tried using webHDFS, which at the time worked for me.
You can find an R package for webhdfs acess here
The package is not available on CRAN you need to run:
devtools::install_github(c("saurfang/rwebhdfs"))
(yeah... You will need the devtools package)

How to install jvmr package on databricks

I want to call R function in scala script on databricks. Is there anyway that we can do it?
I use
JVMR_JAR=$(R --slave -e 'library("jvmr"); cat(.jvmr.jar)')
scalac -cp "$JVMR_JAR"
scala -cp ".:$JVMR_JAR"
on my mac and it automatically open a scala which can call R functions.
Is there any way I can do similar stuff on databricks?
On the DataBricks Cloud, you can use the sbt-databricks to deploy external libraries to the cloud and attach them to specific clusters, which are two necessary steps to make sure jvmr is available to the machines you're calling this on.
See the plugin's github README and the blog post.
If those resources don't suffice, perhaps you should ask your questions to Databricks' support.
If you want to call an R function in the scala notebook, you can use the %r shortcut.
df.registerTempTable("temp_table_scores")
Create a new cell, then use:
%r
scores <- table(sqlContext, "temp_table_scores")
local_df <- collect(scores)
someFunc(local_df)
If you want to pass the data back into the environment, you can save it to S3 or register it as a temporary table.

Running Plink from within R

I'm trying to use a Windows computer to SSH into a Mac server, run a program, and transfer the output data back to my Windows. I've been able to successfully do this manually using Putty.
Now, I'm attempting to automate the process using Plink. I've added Plink to my Windows Path, so if I open cmd and type in a command, I can successfully log in and pass commands to the server:
However, I'd like to automate this using R, to streamline the data analysis process. Based on some searching, the internet seems to think that the shell command is best suited to this task. Unfortunately, it doesn't seem to find Plink, though passing commands through shell to the terminal is working:
If I try the same thing but manually setting the path to Plink using shell, no output is returned, but the commands do not seem to run (e.g. TESTFOLDER is not created):
Does anyone have any ideas for why Plink is unavailable when I try to call it from R? Alternately, if there are other ideas for how this could be accomplished in R, that would also be appreciated.
Thanks in advance,
-sam
I came here looking for an answer to this question, so I only have so much to offer, but I think I managed to get PLINK's initial steps to work in R using the shell function...
This is what worked for me:
NOT in R:
Install PLINK and add its location to your PATH.
Download the example files from PLINK's tutorial (http://pngu.mgh.harvard.edu/~purcell/plink/tutorial.shtml) and put them in a folder whose path contains NO spaces (unless you know something I don't, in which case, space it up).
Then, in R:
## Set your working directory as the path to the PLINK program files: ##
setwd("C:/Program Files/plink-1.07-dos")
## Use shell to check that you are now in the right directory: ##
shell("cd")
## At this point, the command "plink" should be at least be recognized
# (though you may get a different error)
shell("plink")
## Open the PLINK example files ##
# FYI mine are in "C:/PLINK/", so replace that accordingly...
shell("plink --file C:\\PLINK\\hapmap1")
## Make a binary PED file ##
# (provide the full path, not just the file name)
shell("plink --file C:\\PLINK\\hapmap1 --make-bed --out C:\\PLINK\\hapmap1")
... and so on.
That's all I've done so far. But with any luck, mirroring the structure and general format of those lines of code should allow you to do what you like with PLINK from within R.
Hope that helps!
PS. The PLINK output should just print in your R console when you run the lines above.
All the best,
- CC.
Just saw Caitlin's response and it reminded me I hadn't ever updated with my solution. My approach was kind of a workaround, rather than solving my specific problem, but it may be useful to others.
After adding Plink to my PATH, I created a batch script in Windows which contained all my Plink commands, then called the batch script from R using the command shell:
So, in R:
shell('BatchScript.bat')
The batch script contained all my commands that I wanted to use in Plink:
:: transfer file to phosphorus
pscp C:\Users\Sam\...\file zipper#144.**.**.208:/home/zipper/
:: open connection to Dolphin using plink
plink -ssh zipper#144.**.**.208 Batch_Script_With_Remote_Machine_Commands.bat
:: transfer output back to local machine
pscp zipper#144.**.**.208:/home/zipper/output/ C:\Users\Sam\..\output\
Hope that helps someone!

To access S3 bucket from R

I have set-up R on an EC2 Instance on AWS.
I have few csv files uploaded into a S3 bucket.
I was wondering if there is a way to access the csv files in the S3 bucket from R.
Any help/pointers would be appreciated.
Have a look at the cloudyr aws.s3 package (https://github.com/cloudyr/aws.s3), it might do what you need. Unfortunately (at time of writing), this package is quite early stage & a little unstable.
I've had good success simply using R's system() command to make a call to the AWS CLI. This is relatively easy to get started on, very robust and very well supported.
Start here: http://aws.amazon.com/cli/
List objects using S3 API: http://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects.html
Get objects using S3 API: http://docs.aws.amazon.com/cli/latest/reference/s3api/get-object.html
So, for example, on command-line try following:
pip install awscli
aws configure
aws s3 help
aws s3api list-objects --bucket some-bucket --query 'Contents[].{Key: Key}'
aws s3api get-object --bucket some-bucket --key some_file.csv new_file_name.csv
In R, can just do something like:
system("aws s3api list-objects --bucket some-bucket --query 'Contents[].{Key: Key}' > my_bucket.json")
Enter the following command: install.packages("AWS.tools")
From there, use the s3.get() command. The Help tab should tell you what goes in for arguments.
Install the libdigest-hmac-perl package;
sudo apt-get install libdigest-hmac-perl

Resources