SparkR job deal with dependencies

SparkR job deal with dependencies - r

How to deal with dependencies in case of a (interactive) sparkR job?
I know java jobs can be submitted as a fat-Jar containing all the dependencies. For any other job the --packages option can be specified on the spark-submit command. But I would like to connect from R (Rstudio) using sparkR to my little cluster. (this works pretty straigth forward)
But I need some external packages e.g. to connect to a database (Mongo, Cassandra) or read a csv file. In local mode I can easily specify these packages on launch. This naturally does not work in the already running cluster.
https://github.com/andypetrella/spark-notebook provides a very convenient mode to load such external packages at runtime.
How can I similarly load maven-coordinate packages into the spark classpath either during runtime from my sparkR (interactive session) or during image creation of the dockerized cluster?

You can also try to configure these 2 variables : spark.driver.extraClassPath and spark.executor.extraClassPath in SPARK_HOME/conf/spark-default.conf file and specify the value of these variables as the path of the jar file. Ensure that the same path exists on worker nodes.
From No suitable driver found for jdbc in Spark

Related

Connect to ORACLE via R, using the info in sql developer

I am working on a machine without admin rights. I use sql developer to connect to an internal database. I would like to connect via R also.
Is there any way I can do this, without admin rights? Some solutions require me to set up a systemDNS - which I can not do.
Other requires me to install jvm.dll
My environment: Windows7, sqldeveloper, connection method is via TNS file.

Connecting to SQL Developer via R is far more difficult than other databases I've encountered. It's important that you have jdbc6.jar installed on your machine, and that you know the file path to where it was installed. Installing the jar file does not require admin rights. You can install the jar file from Oracle's website.
I use the RJDBC package to connect like so:
library(RJDBC)
jdbcDriver <- JDBC("oracle.jdbc.OracleDriver", classPath = "file path to where ojdbc6.jar is installed on your computer")
jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:#YOUR_SERVER","YOUR_USERNAME","YOUR_PASSWORD")
You can then test the connection with a number of commands; I typically use:
dbListTables(jdbcConnection)
Another favorite of mine is to use dbplyr for dplyr-like functions when working with databases:
library(dbplyr)
tbl(jdbcConnection, "SAMPLE_TABLE_NAME")
The resulting output will be the data from the queried table in tibble form.

You can set the environment variables in your R session.
Sys.setenv(OCI_LIB64="/Path/to/instantclient",OCI_INC="/Path/to/instantclient/sdk/include")
You can put this in the file .Rprofile in your home directory, and RStudio will run it each time you begin a new session. Once you have this in .Rprofile you should be able to install ROracle.

SparkR cannot access to file in workers

This question is basically the same as this one (but in R).
I am developing an R package that uses SparkR. I have created some unit tests (several .R files) in PkgName/inst/tests/testthat using the testthat package. For one of the tests I need to read an external data file, and since it is small and is only used in the tests, I read that it can be placed just in the same folder of the tests.
When I deploy this with Maven in a standalone Spark cluster, using "local[*]" as master, it works. However, if I try using a "remote" Spark cluster (via docker -the image has java, Spark 1.5.2 and R- where I create a master in,e.g http://172.17.0.1 and then a worker that is successfully linked to that master), then it does not work. It complains that the data file cannot be found, because it seems to look for it using an absolute path that is valid only in my local pc but not in the workers. The same happens if I use only the filename (without preceding path).
I have also tried delivering the file to the workers with the --file argument to spark-submit, and the file is successfully delivered (apparently it is placed in http://192.168.0.160:44977/files/myfile.dat although the port changes with every execution). If I try to retrieve the file's location using SparkFiles.get, I get a path (that has some random number in one of the intermediate folders) but apparently, it still refers to a path in my local machine. If I try to read the file using the path I've retrieved, it throws the same error (file not found).
I have set the environment variables like this:
SPARK_PACKAGES = "com.databricks:spark-csv_2.10:1.3.0"
SPARKR_SUBMIT_ARGS = " --jars /path/to/extra/jars --files /local/path/to/mydata.dat sparkr-shell"
SPARK_MASTER_IP = "spark://172.17.0.1:7077"
The INFO messages say:
INFO Utils: Copying /local/path/to/mydata.dat to /tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/mydata.dat
INFO SparkContext: Added file file:/local/path/to/mydata.dat at http://192.168.0.160:47110/files/mydata.dat with timestamp 1458995156845
This port changes from one run to another. From within R, I have tried:
fullpath <- SparkR:::callJStatic("org.apache.spark.SparkFiles", "get", "mydata.dat")
(here I use callJStatic only for debugging purposes) and I get
/tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/ratings.dat
but when I try to read from fullpath in R, I get fileNotFound exception, probably because fullpath is not the location in the workers. When I try to read simply from "mydata.dat" (without a full path) I get the same error, because R is still trying to read from my local path where my project is placed (just appends "mydata.dat" to that local path).
I have also tried delivering my R package to the workers (not sure if this may help or not), following the correct packaging conventions (a JAR file with a strict structure and so on). I get no errors (seems the JAR file with my R package can be installed in the workers) but with no luck.
Could you help me please? Thanks.
EDIT: I think I was wrong and I don't need to access the file in the workers but just in the driver, because the access is not part of a distributed operation (just calling SparkR::read.df). Anyway it does not work. But surprisingly, if I pass the file with --files and I read it with read.table (not from SparkR but the basic R utils) passing the full path returned by SparkFiles.get, it works (although this is useless to me). Btw I'm using SparkR version 1.5.2.

R-Hadoop integration - how to connect R to remote hdfs

I have a case where I will be running R code on a data that will be downloaded from Hadoop. Then, the output of the R code will be uploaded back to Hadoop as well. Currently, I am doing it manually and I would like to avoid this manual downloading/uploading process.
Is there a way I can do this in R by connecting to hdfs? In other words, in the beginning of the R script, it connects to Hadoop and reads the data, then in the end it uploads the output data to Hadoop again. Are there any packages that can be used? Any changes required in Hadoop server or R?
I forgot to note the important part: R and Hadoop are on different servers.

Install the package rmr2 , you will have an option of from.dfs function which can solves your requirement of getting the data from HDFS as mentioned below:
input_hdfs <- from.dfs("path_to_HDFS_file",format="format_columns")
For Storing the results back into HDFS, you can do like
write.table(data_output,file=pipe(paste('hadoop dfs -put -', path_to_output_hdfs_file, sep='')),row.names=F,col.names=F,sep=',',quote=F)
(or)
You can use rmr2 to.dfs function to store back into HDFS.

So... Have you found a solution for this?
Some months ago I have stumbled upon the same situation. After fiddling around for some time with the Revolution Analytics packages, I couldn't find a way for it to work in the situation where R and Hadoop are on different servers.
I tried using webHDFS, which at the time worked for me.
You can find an R package for webhdfs acess here
The package is not available on CRAN you need to run:
devtools::install_github(c("saurfang/rwebhdfs"))
(yeah... You will need the devtools package)

How to install jvmr package on databricks

I want to call R function in scala script on databricks. Is there anyway that we can do it?
I use
JVMR_JAR=$(R --slave -e 'library("jvmr"); cat(.jvmr.jar)')
scalac -cp "$JVMR_JAR"
scala -cp ".:$JVMR_JAR"
on my mac and it automatically open a scala which can call R functions.
Is there any way I can do similar stuff on databricks?

On the DataBricks Cloud, you can use the sbt-databricks to deploy external libraries to the cloud and attach them to specific clusters, which are two necessary steps to make sure jvmr is available to the machines you're calling this on.
See the plugin's github README and the blog post.
If those resources don't suffice, perhaps you should ask your questions to Databricks' support.

If you want to call an R function in the scala notebook, you can use the %r shortcut.
df.registerTempTable("temp_table_scores")
Create a new cell, then use:
%r
scores <- table(sqlContext, "temp_table_scores")
local_df <- collect(scores)
someFunc(local_df)
If you want to pass the data back into the environment, you can save it to S3 or register it as a temporary table.

Running hadoop job without creating a jar file

I have wriiten a simple hadoop job. Now I want to run it without creating the jar file as opposed to lots of tutorials found on net.
I am calling it from a shell script on ubuntu platform which runs a cloudera CHD4 distribution of hadoop(2.0.0+91).
I can't create the jar file of the job because it depends on several other third party jars and configuration files which are already centrally deployed over my machine and are not accessible at the time of creating the jar. Hence I am looking out for a way where I can include these custom jar files and configuration files.
I also can't use -libjars and DistributedCache options because they only affect map/reduce phase but my driver class also is using these jar and configuration files. My job uses several in house utility code which internally uses these third party libraries and configuration files which I have only access to read from a centrally deployed location.
Here is how I am calling it from a shell script.
sudo -u hdfs hadoop x.y.z.MyJob /input /output
It shows me a
Caused by: java.lang.ClassNotFoundException: x.y.z.MyJob
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
My calling shell script successfully sets the hadoop classpath and contains all my required third party libraries and configuration files from a centrally deployed location.
I am sure that my class x.y.z.MyJob and all required libraries and configuration files are found in both the $CLASSPATH and $HADOOP_CLASSPATH environment varibales which I am setting before calling the hadoop job
Why at the time of running the script my program is not able to find the class.
Can't I run the job as a normal java class? All my other normal java programs are using the same classpath and they can always find the classes and configuration files without any problem.
Please let me know how can I access centrally deployed haddop job code and execute it.
EDIT: Here is my code to set the classpath
CLASSES_DIR=$BASE_DIR/classes/current
BIN_DIR=$BASE_DIR/bin/current
LIB_DIR=$BASE_DIR/lib/current
CONFIG_DIR=$BASE_DIR/config/current
DATA_DIR=$BASE_DIR/data/current
CLASSPATH=./
CLASSPATH=$CLASSPATH:$CLASSES_DIR
CLASSPATH=$CLASSPATH:$BIN_DIR
CLASSPATH=$CLASSPATH:$CONFIG_DIR
CLASSPATH=$CLASSPATH:$DATA_DIR
LIBPATH=`$BIN_DIR/lib.sh $LIB_DIR`
CLASSPATH=$CLASSPATH:$LIBPATH
export HADOOP_CLASSPATH=$CLASSPATH
lib.sh is the file to concatenate all third party files to a : separated format and CLASSES_DIR contains my job code x.y.z.MyJob class.
All my configuration files are unders CONFIG_DIR
When I print my CLASSPATH and HADOOP_CLASSPATH it shows me correct values. However whenever I call hadoop classpath just before executing the job, it shows me following output.
$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:myname:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-0.20-mapreduce/.//*
$
It obviously does not have any of those previously set $CLASSPATH and $HADOOP_CLASSPATH varibales appended. Where are these environment varibales.

Inside my shell script I was running the hadoop jar command with Cloudera's hdfs user
sudo -u hdfs hadoop jar x.y.z.MyJob /input /output
This code was actually being called from the script with a regular ubuntu user which was setting the CLASSPATH and HADOOP_CLASSPATH varibles as mentioned above. And at the time of execution the hadoop jar command was not called using the same regular ubuntu user. Hence there was an exception indicating that the class was not found.
So you have to run the job with the same user who is actually setting the CLASSPATH and HADOOP_CLASSPATH environment variables.
Thanks all for your time.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex