Connecting to Cassandra Data using Sparklyr - r

I am using RStudio. Installed a local version of Spark, run a few things, quite happy. Now I am trying to read my actual data from a Cluster, using RStudio Server and a standalone version of Spark. Data is in Cassandra, and I do not know how to connect to it. Can anyone point me to a good primer on how to connect and read that data?

there is a package called crassy and is listed on the official sparklyr site's under Extensions. You can search for crassy in GitHub.

Related

Connect H2O with Teradata

I'm trying to find if one can connect to teradata using H2O. Upon reading some of the basic documentation on H2O, i found that H2O has the ability to connect to relational databases provided they supply a JDBC driver.
Link: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/getting-data-into-h2o.html?highlight=jdbc
However, the documentation suggests: "Currently supported SQL databases are MySQL, PostgreSQL, and MariaDB"
So I'm wondering if H2O can connect to other databases like Teradata because they do have a jdbc driver
Link: https://downloads.teradata.com/download/connectivity/jdbc-driver
-Suhail
The core H2O function importSqlTable in water.jdbc.SQLManager class is called by both h2o.import_sql_table and h2o.import_sql_select (H2O R API - must be similar with Python counterparts). After inspecting importSqlTable source code I found a problem that will likely prevent you from loading with Teradata due to SELECT syntax.
Still I'd suggest trying and reporting in comments on result and error if it fails. When starting H2O server add the following to your command line:
-cp <path_to_h2o_jar>:<path_to_Teradata_jdbc_driver_jar> jdbc.drivers=com.teradata.jdbc.TeraDriver
UPDATE:
Use version Xia (3.22.0.1) - 10/26/2018 or later that fixed JDBC support for Teradata.

Is it possible toaccess Hive data in Hadoop HDInsight cluster using R?

Is it possible to access Hive data in Hadoop HDInsight cluster using R? Say we don't have R Server, all I am interested in is by using R as a client tool accessing Hive data?
Yes, it's possible for accessing Hive without R Server. There are many solutions for doing this, as below.
RHive, an R extension facilitating distributed computing via Apache Hive. There is a slide you can refer to, but it seems to be too old and not support YARN.
RJDBC, a package implementing DBI in R on the basis of JDBC. There is a blog which introduce the usage for R with Hive.
R package hive, there is the document for this package, you can refer to and know how to use it.
It seems that the R package hive is a good choice, because it support the version of Hadoop is Apache Hadoop >= 2.6.0 that HDInsight based on it.
Hope it helps.

Running R script on hadoop and mapreduce

I have an R-script that does stuff with a bunch of tweets and I would like to use the same script on the same data but saved in an Hadoop file system. According to this Hortonworks tutorial I could use R code with data from my HDFS, but it is not quite clear.
Can I use the very same R-script, taking advantage of the mapreduce paradigm, by using this Revolution R? Should I change my code or is there a way to execute the same functions optimized for an Hadoop architecture?
My wish would be to write my code on a standard R IDE like R-Studio and then use it, or use the most of it, on my cloud services (such as Microsoft Azure) with mapreduce on the base.
Yes, you can run any R script across different data platform from Hadoop to Spark to Teradata and SQL Server by using environment specific compute context.
Following two links should help you get started on how to use Revolution R / Microsoft R Server on Hadoop:
https://msdn.microsoft.com/en-us/microsoft-r/scaler-hadoop-getting-started
https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/MicrosoftR/Samples/NYCTaxi/NYC2013_MRS_LinearBinary.Rmd

Access S3 bucket from AWS EC2 via r command lines

I have tried a number of things put files into my EC2 Rstudio instance, particularly uploading via putty, adding dropbox (via Louis Aslett's AMI, but only very small files sync), using filezilla and also winscp. Unfortunately after several months I can get them in, but they are corrupted on arrival.
There are some older questions related to this
To access S3 bucket from R
but they all quote packages that are no longer maintained or have no accepted answer.
This seems possible in python using boto and there are alot more answered questions on this via python.
I think maybe AWS might prefer using S3, so I am trying that, is there any r package that is current and works for uploading into EC2 from S3. I have set up EC2 and S3? Does anyone have the r code lines to use and is there anything behind the scenes that needs to be done. I was told by a colleague it was possible to do this in r with only 4 lines of r code, once the packages were installed, but she does not know how to actually do it.

Amazon EMR: Using R code in Amazon EMR

I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike
#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.

Resources