R-Hadoop integration - how to connect R to remote hdfs - r

I have a case where I will be running R code on a data that will be downloaded from Hadoop. Then, the output of the R code will be uploaded back to Hadoop as well. Currently, I am doing it manually and I would like to avoid this manual downloading/uploading process.
Is there a way I can do this in R by connecting to hdfs? In other words, in the beginning of the R script, it connects to Hadoop and reads the data, then in the end it uploads the output data to Hadoop again. Are there any packages that can be used? Any changes required in Hadoop server or R?
I forgot to note the important part: R and Hadoop are on different servers.

Install the package rmr2 , you will have an option of from.dfs function which can solves your requirement of getting the data from HDFS as mentioned below:
input_hdfs <- from.dfs("path_to_HDFS_file",format="format_columns")
For Storing the results back into HDFS, you can do like
write.table(data_output,file=pipe(paste('hadoop dfs -put -', path_to_output_hdfs_file, sep='')),row.names=F,col.names=F,sep=',',quote=F)
(or)
You can use rmr2 to.dfs function to store back into HDFS.

So... Have you found a solution for this?
Some months ago I have stumbled upon the same situation. After fiddling around for some time with the Revolution Analytics packages, I couldn't find a way for it to work in the situation where R and Hadoop are on different servers.
I tried using webHDFS, which at the time worked for me.
You can find an R package for webhdfs acess here
The package is not available on CRAN you need to run:
devtools::install_github(c("saurfang/rwebhdfs"))
(yeah... You will need the devtools package)

Related

How to use R libraries in Azure Databricks without depending on CRAN server connectivity?

We are using a few R libraries in Azure Databricks which do not come preinstalled. To install these libraries during Job Runs on Job Clusters, we use an init script to install them.
sudo R --vanilla -e 'install.packages("package_name",
repos="https://mran.microsoft.com/snapsot/YYYY-MM-DD")'
During one of our production runs, the Microsoft Server was down (could the timing be any worse?) and the job failed.
As a workaround, we now install libraries in /dbfs/folder_x and when we want to use them, we include the following block in our R code:
.libpaths('/dbfs/folder_x')
library("libraryName")
This does work for us, but what is the ideal solution to this? Since, if we want to update a library to another version, remove a library or add one, we have to go through the following steps everytime and there is a chance of forgetting this during code promotions:
install.packages("xyz")
system("cp -R /databricks/spark/R/lib/xyz /dbfs/folder_x/xyz")
It is a very simple and workable solution, but not ideal.

Including a dataset in a R package

This question may seem pretty naive and I bag your patience.
I have saved extension.RData and documented it in extension.R. Both of them are saved in /data folder of the R package I am developing.
As I close RStuidio and reload the package, however, I cannot call the data until I execute one of the functions, devtools::document() or devtool::load_all(). Does this suggest that my dataset is not in memory of the package? How could I not to execute devtools every time I start working on the package?
Thank you very much.
As I have understood, you just created files extension.RData, extension.R (with documentation) in your project directory. However, this is not enough for RStudio to be able to reach your data. You have to install the package by running devtools::install() or clicking 'Build & Reload' button on 'Build' tab of RStudio.
Edit: Putting extension.R into R folder solves the problem.

How to install jvmr package on databricks

I want to call R function in scala script on databricks. Is there anyway that we can do it?
I use
JVMR_JAR=$(R --slave -e 'library("jvmr"); cat(.jvmr.jar)')
scalac -cp "$JVMR_JAR"
scala -cp ".:$JVMR_JAR"
on my mac and it automatically open a scala which can call R functions.
Is there any way I can do similar stuff on databricks?
On the DataBricks Cloud, you can use the sbt-databricks to deploy external libraries to the cloud and attach them to specific clusters, which are two necessary steps to make sure jvmr is available to the machines you're calling this on.
See the plugin's github README and the blog post.
If those resources don't suffice, perhaps you should ask your questions to Databricks' support.
If you want to call an R function in the scala notebook, you can use the %r shortcut.
df.registerTempTable("temp_table_scores")
Create a new cell, then use:
%r
scores <- table(sqlContext, "temp_table_scores")
local_df <- collect(scores)
someFunc(local_df)
If you want to pass the data back into the environment, you can save it to S3 or register it as a temporary table.

connect to hive from R

I need to process data stored on Hadoop in R (some clustering, and statistic). I used Hive to analysis data previously. I found JDBC package for R and would like to use it. However, it doesn't works, it seems a lot of jars are not available. Could you provide a good instruction or tutorial? How to query data from Hive in R?
you were need to copy Hive's jars to your R classpath and load them to RJDBC. You can read details with sample in my blog here http://simpletoad.blogspot.com/2013/12/r-connection-to-hive.html
or you have rhive package which you can use with below commands
you can simply connect to hiveserver2 from R using RHIVE package
below are the commands that i had used.
Sys.setenv(HIVE_HOME="/usr/local/hive") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") rhive.env(ALL=TRUE) rhive.init() rhive.connect("localhost")

R Temporary Directory Set to External Drive

When running the RecordLinkage package in R on a large dataset, the GUI failed and closed down.
I realize now that as a result of R's activity, 120GB of data had been stored in my Windows Temporary folder (file format .ff), running into the existing limits of my HD.
I would like to plug into an external drive with more space, and set the temporary directory for R to use there.
Can I do this in R, before running my analysis? What is the command?
Is there another way around this problem I'm not thinking about? Thanks kindly.
If you are generating *.ff files, you appear to making use of the ff package.
Assuming this to be true, you should be able to set the fftempdir option as follows...
...
library(ff)
options("fftempdir"="/EnterYourFilePathHere/"...)
...
Just replace EnterYourFilePathHere with a path to your external hdd.
You should read more about the ff package and the fftempdir in the package documentation: http://cran.r-project.org/web/packages/ff/index.html
The package handles the temp files differently (e.g. deletion, etc.) depending on whether or not it takes fftempdir from your working directory (i.e. getwd()) or from the fftempdir option.

Resources