How to connect to HDFS from R and read/write parquets using arrow? - r

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is awesome.
The problem is, nowhere in the R arrow docs can I find information about working with HDFS and also in general not much information about how to use the library properly.
I am basically looking for the R equivalent of:
from pyarrow import fs
filesystem = fs.HadoopFileSystem(host = 'my_host', port = 0, kerb_ticket = 'my_ticket')
Disclosure:
I know how to use odbc to read and write my data. While reading is fine (but slow), inserting larger amounts of data into impala/hive this way is pure awful (slow, often fails, and impala isn't really built to digest data this way).
I know I could probably use pyarrow to work with hdfs, but would like to avoid installing python in my docker image just for this purpose.

The bindings for this are not currently implemented in R; there is a ticket open here on the project JIRA, which at time of writing is still marked "Unresolved": https://issues.apache.org/jira/browse/ARROW-6981. I'll comment on the JIRA ticket to mention that there is user interest in implementing these bindings.

Related

Running R script on hadoop and mapreduce

I have an R-script that does stuff with a bunch of tweets and I would like to use the same script on the same data but saved in an Hadoop file system. According to this Hortonworks tutorial I could use R code with data from my HDFS, but it is not quite clear.
Can I use the very same R-script, taking advantage of the mapreduce paradigm, by using this Revolution R? Should I change my code or is there a way to execute the same functions optimized for an Hadoop architecture?
My wish would be to write my code on a standard R IDE like R-Studio and then use it, or use the most of it, on my cloud services (such as Microsoft Azure) with mapreduce on the base.
Yes, you can run any R script across different data platform from Hadoop to Spark to Teradata and SQL Server by using environment specific compute context.
Following two links should help you get started on how to use Revolution R / Microsoft R Server on Hadoop:
https://msdn.microsoft.com/en-us/microsoft-r/scaler-hadoop-getting-started
https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/MicrosoftR/Samples/NYCTaxi/NYC2013_MRS_LinearBinary.Rmd

Getting Data in and out of Rhipe [R + Hadoop]

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of packages.
Now in both of the packages [rhipe and rmr] I can ingest / read the data stored into csv or text file. Both of them kind of supports creation of new file formats but I find rmr has more support for it or at least more resources to get started. Well, this requirement will be useful when one plans to perform few data processing on raw data stored in HDFS and finally want to store it back to HDFS in a format recognizable by other components of Hadoop like Hive Impala etc. Both of the packages can write in their native format recognizable by the package only. The package rmr supports few other formats.
For reference related to rmr have a look into: https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md
However for rhipe I did not get any such document and I tried various ways it failed.
So my question is how can I write back into text [as for example, other recognizable format will also work] after reading a file stored into HDFS and running rhwatch in rhipe ?

Reading LabVIEW TDMS files with R

As part of a transition from MATLAB to R, I am trying to figure out how to read TDMS files created with National Instruments LabVIEW using R. TDMS is a fairly complex binary file format (http://www.ni.com/white-paper/5696/en/).
Add-ons exist for excel and open-office (http://www.ni.com/white-paper/3727/en/), and I could make something in LabVIEW to make the conversion, but I am looking for a solution that would let me read the TDMS files directly into R. This would allow us to test out the use of R for certain data processing requirements without changing what we do earlier in the data acquisition process. Having a simple process would also reduce the barriers to others trying out R for this purpose.
Does anyone have any experience with reading TDMS files directly into R, that they could share?
This is far from supporting all TDMS specifications but I started a port of a python npTDMS package into R here https://github.com/msuefishlab/tdmsreader and it has been tested out in the context of a shiny app here
You don't say if you need to automate the reading of these files using R, or just convert the data manually. I'm assuming you or your colleagues don't have any access to LabVIEW yourselves otherwise you could just create a LabVIEW tool to do the conversion (and build it as a standalone application or DLL, if you have the professional development system or app builder - you could run the built app from your R code by passing parameters on a command line).
The document on your first link refers to (a) add-ins for OpenOffice Calc and for Excel, which should work for a manual conversion and which you might be able to automate using those programs' respective macro languages, and (b) a C DLL for reading TDMS - would it be possible for you to use one of those?

R dataset connection to tableau

Recently tableau gave the functionality of R connection in their release 8.1. I want to know if there is any way i can call an entire table created in R to tableau. Or an .rds object which contains the dataset into Tableau?
There is a tutorial on the Tableau website for this and a blog on r-bloggers which discuss. The tutorial has a number of comments and one of them (in early Dec I think) asks how to get an rds file in. You need to start Rserve and then execute a script on it to get your data.
Sorry I can't be more help as I only looked into it briefly and put it on the back-burner but if you get stuck they seem to come back quickly if you post a comment on the page:
http://www.tableausoftware.com/about/blog/2013/10/tableau-81-and-r-25327
Just pointing out that the Tableau Data Extract API might be useful here, even if the current version of R integration doesn't yet meet your needs. (Note, that link is to the version 8.1 docs released in late 2013 - so look for the latest version to see what functionality they've added since)
If what you want to do is to manipulate data in R and then send a table of data to Tableau for visualization, you could first try the simple step of exporting the data from R as a CSV file and then visualizing that data in Tableau. I know that's not sexy, but its always good to make sure you've got a way to get the output result you need before investing time in optimizing the process.
If that gets the effect you want, but you just want to automate more of the steps, then take a look at the Tableau Data Extract API. You could use that library to generate a Tableau Data Extract instead of a CSV file. If you have something in production that needs updates, then you could presumably create a python script or JVM program to read your RDS file periodically and generate a revised extract.
Let us assume your data.frame/ tibble etc (say dataset object) is ready in R/ RStudio and you want to connect it with Tableau
1. In RStudio (or R terminal), execute the following steps:
install.packages("Rserve")
library(Rserve)
Rserve() ##This gets the R connection service up and running
2. Now go to Tableau (I am using 10.3.2):
Help > Settings and Performances > Manage External Service Connection
Enter localhost in the Server field and click on Test Connection.
You have now established a connection between R and Tableau.
3. Come back to RStudio. Now we need a .rdatafile that will consist of our R object(s). In this case, dataset. This is the R object that we want to use in Tableau. Enter this in the R console:
save(dataset, file="objectName.rdata")
4. Switch to Tableau now.
Connect To a File > Statistical File
Go to your working directory where the newly created objectName.rdata resides. From the drop down list of file type, select R files (*.rdata, *.rda) and select your object. This will open the object you created in R in Tableau. Alternatively, you can drag and drop your object directly to Tableau's workspace.

Apache log file format analysis by R

I was trying to do the analysis of weblog files by R. I am comfortable to deal with the date and bytes, wherever numeric data is present but fail to deal with the strings.
From the log file (log file in CSV format), I want to find out the particular user (with help of IP and Agents) and its total spending on the web page.
There are numurous libraries to do this kind of analysis, although I could find none in R. A google for parse apache logfile yielded a library in Perl, and python parse apache logfile yields the Scratchy library. Both rely on regular expressions to parse the contents of the file.
From here there are two ways to deal with the apache logfile:
Call perl or python from R, either using a direct link, or using a system call (this is simpler).
Take the idea from the perl or python lib and use it to implement R versions of the functions. This will take a lot of time.
You refer to a csv file, but I think the libraries above work with the original text file with the Apache log, so I'd use those, and not your csv file.
In addition, this SO post mentions an answer by #doug (profile) where he states that he has created some functions to create visualizations of apache logfile data, parsed by Python. Maybe you could send him a message or mail and see if he is willing to share the code.
Logfile analysis in R is an interesting topic we had before, you can find our discussion right here. Maybe this discussion might also help you to adjust to the SO etiquette in order to get better feedback (not to take anything away from yours, Paul).

Resources