I need to process data stored on Hadoop in R (some clustering, and statistic). I used Hive to analysis data previously. I found JDBC package for R and would like to use it. However, it doesn't works, it seems a lot of jars are not available. Could you provide a good instruction or tutorial? How to query data from Hive in R?
you were need to copy Hive's jars to your R classpath and load them to RJDBC. You can read details with sample in my blog here http://simpletoad.blogspot.com/2013/12/r-connection-to-hive.html
or you have rhive package which you can use with below commands
you can simply connect to hiveserver2 from R using RHIVE package
below are the commands that i had used.
Sys.setenv(HIVE_HOME="/usr/local/hive") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") rhive.env(ALL=TRUE) rhive.init() rhive.connect("localhost")
Related
I am trying to install a package from my own repository in order to test if the functions work. The link to the repository is here: https://github.com/hharder74/SampleMeansII. When I try to install it using the following code:
devtools::install_github("hharder74/SampleMeansII")
I get the following error:
Error: Failed to install 'unknown package' from GitHub: HTTP error 404. Not Found Did you spell the repo owner (`hharder74`) and repo name (`SampleMeansII`) correctly? - If spelling is correct, check that you have the required permissions to access the repo.
I am really confused on where this error is coming from. This is my first time trying to upload a package to GitHub, and I just wanted to check if the package can be installed before I turned it in to my teacher. Here is a bit of code to test the functions if anyone needs it:
vec <- runif(100)
sample_mean(vec, 50)
many_sample_means(vec, reps = 10, n = 50)
sample_means_ns(vec, reps = 10, ns = c(5, 50, 500))
You have not yet created a package. You've just created some files with R code in them. An R package has a very particular structure that includes things like a DESCRIPTION file and a NAMESPACE file. In theory you could create these yourself, but often it's easier to use things like devtools::create and roxygen to create them for you. Or if you are using RStudio, you can create a new R Package project with default versions of these files.
To add a DESCRIPTION File, try running
usethis::use_description()
That will fill in defaults you can change.
Then you will need to create a NAMESPACE file. If you just want to make all the functions you define inside your R files to be available outside the pacakge, you can just put
exportPattern("^[[:alpha:]]+")
in that file and that should work.
You might also consider following guides like thoses http://r-pkgs.had.co.nz/package.html or https://swcarpentry.github.io/r-novice-inflammation/08-making-packages-R/ for a better overview on creating a package.
Once your repo looks like a proper R package, then you can use devtools::install_github to install it.
Note that github can be useful for tracking changes to any types of files. You may perform an analysis in an R script that you would like to track changes for and save that on github but it may not make sense to turn that analysis script into a packages. You generally make packages when you want to reuse functions or data across different projects then those projects can install and load your package. So not all R code lives inside an R package, but devtools::install_github can only be used to install actual packages.
How to deal with dependencies in case of a (interactive) sparkR job?
I know java jobs can be submitted as a fat-Jar containing all the dependencies. For any other job the --packages option can be specified on the spark-submit command. But I would like to connect from R (Rstudio) using sparkR to my little cluster. (this works pretty straigth forward)
But I need some external packages e.g. to connect to a database (Mongo, Cassandra) or read a csv file. In local mode I can easily specify these packages on launch. This naturally does not work in the already running cluster.
https://github.com/andypetrella/spark-notebook provides a very convenient mode to load such external packages at runtime.
How can I similarly load maven-coordinate packages into the spark classpath either during runtime from my sparkR (interactive session) or during image creation of the dockerized cluster?
You can also try to configure these 2 variables : spark.driver.extraClassPath and spark.executor.extraClassPath in SPARK_HOME/conf/spark-default.conf file and specify the value of these variables as the path of the jar file. Ensure that the same path exists on worker nodes.
From No suitable driver found for jdbc in Spark
I have a case where I will be running R code on a data that will be downloaded from Hadoop. Then, the output of the R code will be uploaded back to Hadoop as well. Currently, I am doing it manually and I would like to avoid this manual downloading/uploading process.
Is there a way I can do this in R by connecting to hdfs? In other words, in the beginning of the R script, it connects to Hadoop and reads the data, then in the end it uploads the output data to Hadoop again. Are there any packages that can be used? Any changes required in Hadoop server or R?
I forgot to note the important part: R and Hadoop are on different servers.
Install the package rmr2 , you will have an option of from.dfs function which can solves your requirement of getting the data from HDFS as mentioned below:
input_hdfs <- from.dfs("path_to_HDFS_file",format="format_columns")
For Storing the results back into HDFS, you can do like
write.table(data_output,file=pipe(paste('hadoop dfs -put -', path_to_output_hdfs_file, sep='')),row.names=F,col.names=F,sep=',',quote=F)
(or)
You can use rmr2 to.dfs function to store back into HDFS.
So... Have you found a solution for this?
Some months ago I have stumbled upon the same situation. After fiddling around for some time with the Revolution Analytics packages, I couldn't find a way for it to work in the situation where R and Hadoop are on different servers.
I tried using webHDFS, which at the time worked for me.
You can find an R package for webhdfs acess here
The package is not available on CRAN you need to run:
devtools::install_github(c("saurfang/rwebhdfs"))
(yeah... You will need the devtools package)
I want to call R function in scala script on databricks. Is there anyway that we can do it?
I use
JVMR_JAR=$(R --slave -e 'library("jvmr"); cat(.jvmr.jar)')
scalac -cp "$JVMR_JAR"
scala -cp ".:$JVMR_JAR"
on my mac and it automatically open a scala which can call R functions.
Is there any way I can do similar stuff on databricks?
On the DataBricks Cloud, you can use the sbt-databricks to deploy external libraries to the cloud and attach them to specific clusters, which are two necessary steps to make sure jvmr is available to the machines you're calling this on.
See the plugin's github README and the blog post.
If those resources don't suffice, perhaps you should ask your questions to Databricks' support.
If you want to call an R function in the scala notebook, you can use the %r shortcut.
df.registerTempTable("temp_table_scores")
Create a new cell, then use:
%r
scores <- table(sqlContext, "temp_table_scores")
local_df <- collect(scores)
someFunc(local_df)
If you want to pass the data back into the environment, you can save it to S3 or register it as a temporary table.
Installing the RODBC package on Ubuntu is a bit of a kludge. First I learned to install the following:
$ sudo apt-get install r-cran-rodbc
That wasn't good enough as the package was still looking for header files. I solved this issue by:
$ sudo apt-get install unixodbc-dev
Good, RODBC installed properly on the Ubuntu machine. But when I try to run the following script:
## import excel file from Dropbox
require("RODBC")
channel <- odbcConnectExcel("~/Dropbox/DATA/SAMPLE/petro.xls")
petro <- sqlFetch (channel, "weekly")
odbcClose(channel)
str(petro)
head(petro)
I get an error thrown that function odbcConnectExcel not found. I checked the case of each letter, making sure it was not a simple typo. Nope. Then I ran this same script on a Windows R installation (file path different, of course) and the script works.
Any idea of why Ubuntu R installation cannot find the odbcConnectExcel function and how I can get this to work?
That functionality is available where Excel is available. In other words: not on Ubuntu.
For reference, from the R Data Import / Export manual (with my highlighting):
4.3.2 Package RODBC
Package RODBC on CRAN provides an
interface to database sources
supporting an ODBC interface. This is
very widely available, and allows the
same R code to access different
database systems. RODBC runs on
Unix/Linux, Windows and Mac OS X, and
almost all database systems provide
support for ODBC. We have tested
Microsoft SQL Server, Access, MySQL,
PostgreSQL, Oracle and IBM DB2 on
Windows and MySQL, Oracle, PostgreSQL
and SQLite on Linux.
ODBC is a client-server system, and we
have happily connected to a DBMS
running on a Unix server from a
Windows client, and vice versa.
On Windows ODBC support is normally
installed, and current versions are
available from
http://www.microsoft.com/data/odbc/ as
part of MDAC. On Unix/Linux you will
need an ODBC Driver Manager such as
unixODBC (http://www.unixODBC.org) or
iOBDC (http://www.iODBC.org: this is
pre-installed in Mac OS X) and an
installed driver for your database
system.
Windows provides drivers not just for
DBMSs but also for Excel (.xls)
spreadsheets, DBase (.dbf) files and
even text files. (The named
applications do not need to be
installed. Which file formats are
supported depends on the the versions
of the drivers.) There are versions
for Excel 2007 and Access 2007 (go to
http://download.microsoft.com, and
search for Office ODBC, which will
lead to AccessDatabaseEngine.exe), the
`2007 Office System Driver'.
I've found RODBC to be a real pain in the Ubuntu. Maybe it's because I don't know the right incantations, but I switched to RJDBC and have had much better luck with it. As discussed here.
As Dirk says, that wont solve your Excel problem. For writing Excel I've had very good luck with the WriteXLS package. In Ubuntu I found it quite easy to set up. I had Perl and many of the packages already installed and had to simply install Text::CSV_XS which I installed with the GUI package manager. The reason I like WriteXLS is the ability to write data frames to different sheets in the Excel file. And now that I look at your question I see that you want to READ Excel files not WRITE them. Hell. WriteXLS doesn't do that. Stick with gdata, like Dirk said in his comments:
gdata on CRAN and you are going to want the read.xls() function:
read.xls("//path//to/excelfile.xls", sheet = 1, verbose=FALSE, pattern, ...,
method=c("csv","tsv","tab"), perl="perl")
you may need to run installXLSXsupport which installs the needed Perl modules.
read.xls expect sheet numbers, not names. The method parameter is simply the intermediate file format. If your data has tabs then don't use tab as the intermediate format. And likewise for commas and csv.