Is it possible toaccess Hive data in Hadoop HDInsight cluster using R? - r

Is it possible to access Hive data in Hadoop HDInsight cluster using R? Say we don't have R Server, all I am interested in is by using R as a client tool accessing Hive data?

Yes, it's possible for accessing Hive without R Server. There are many solutions for doing this, as below.
RHive, an R extension facilitating distributed computing via Apache Hive. There is a slide you can refer to, but it seems to be too old and not support YARN.
RJDBC, a package implementing DBI in R on the basis of JDBC. There is a blog which introduce the usage for R with Hive.
R package hive, there is the document for this package, you can refer to and know how to use it.
It seems that the R package hive is a good choice, because it support the version of Hadoop is Apache Hadoop >= 2.6.0 that HDInsight based on it.
Hope it helps.

Related

Connect H2O with Teradata

I'm trying to find if one can connect to teradata using H2O. Upon reading some of the basic documentation on H2O, i found that H2O has the ability to connect to relational databases provided they supply a JDBC driver.
Link: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/getting-data-into-h2o.html?highlight=jdbc
However, the documentation suggests: "Currently supported SQL databases are MySQL, PostgreSQL, and MariaDB"
So I'm wondering if H2O can connect to other databases like Teradata because they do have a jdbc driver
Link: https://downloads.teradata.com/download/connectivity/jdbc-driver
-Suhail
The core H2O function importSqlTable in water.jdbc.SQLManager class is called by both h2o.import_sql_table and h2o.import_sql_select (H2O R API - must be similar with Python counterparts). After inspecting importSqlTable source code I found a problem that will likely prevent you from loading with Teradata due to SELECT syntax.
Still I'd suggest trying and reporting in comments on result and error if it fails. When starting H2O server add the following to your command line:
-cp <path_to_h2o_jar>:<path_to_Teradata_jdbc_driver_jar> jdbc.drivers=com.teradata.jdbc.TeraDriver
UPDATE:
Use version Xia (3.22.0.1) - 10/26/2018 or later that fixed JDBC support for Teradata.

Connecting to Cassandra Data using Sparklyr

I am using RStudio. Installed a local version of Spark, run a few things, quite happy. Now I am trying to read my actual data from a Cluster, using RStudio Server and a standalone version of Spark. Data is in Cassandra, and I do not know how to connect to it. Can anyone point me to a good primer on how to connect and read that data?
there is a package called crassy and is listed on the official sparklyr site's under Extensions. You can search for crassy in GitHub.

Rstudio - connect to HDInsight Spark Cluster SparkR

I am using RStudio on my local laptop and trying to connect to an existing, remote HDInsight Spark Cluster.
A couple questions:
1) Do I need to have RStudio installed on the HDInsight Spark Cluster?
2) How do I connect local RStudio to a remote Spark Cluster? I've been looking at the SparkR docs here but it doesn't seem to give a connect example (ie URL, credentials, etc.)?
HDInsight includes an R Server option to be integrated into your HDInsight cluster. This option allows R scripts to use Spark and MapReduce to run distributed computations.
For more details, refer “Get started using R Server on HDInsight”.

How to write a table reference working with rjdbc?

I am working with really big data with R. My data is on Hive and I am using rjdbc. I am thinking of using a reference table on R because its impossible to load the table onto R even just using 10% sample. I am using the tbl function from dplyr.
transaction <- tbl(conn,"transaction")
R gave me an error message :
the dbplyr package is required to communicate with the database
backends.
I am using a remote computer and it's impossible to install package on this R version.
Any other solutions to solve the problem?

Running R script on hadoop and mapreduce

I have an R-script that does stuff with a bunch of tweets and I would like to use the same script on the same data but saved in an Hadoop file system. According to this Hortonworks tutorial I could use R code with data from my HDFS, but it is not quite clear.
Can I use the very same R-script, taking advantage of the mapreduce paradigm, by using this Revolution R? Should I change my code or is there a way to execute the same functions optimized for an Hadoop architecture?
My wish would be to write my code on a standard R IDE like R-Studio and then use it, or use the most of it, on my cloud services (such as Microsoft Azure) with mapreduce on the base.
Yes, you can run any R script across different data platform from Hadoop to Spark to Teradata and SQL Server by using environment specific compute context.
Following two links should help you get started on how to use Revolution R / Microsoft R Server on Hadoop:
https://msdn.microsoft.com/en-us/microsoft-r/scaler-hadoop-getting-started
https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/MicrosoftR/Samples/NYCTaxi/NYC2013_MRS_LinearBinary.Rmd

Resources