Getting Data in and out of Rhipe [R + Hadoop] - r

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of packages.
Now in both of the packages [rhipe and rmr] I can ingest / read the data stored into csv or text file. Both of them kind of supports creation of new file formats but I find rmr has more support for it or at least more resources to get started. Well, this requirement will be useful when one plans to perform few data processing on raw data stored in HDFS and finally want to store it back to HDFS in a format recognizable by other components of Hadoop like Hive Impala etc. Both of the packages can write in their native format recognizable by the package only. The package rmr supports few other formats.
For reference related to rmr have a look into: https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md
However for rhipe I did not get any such document and I tried various ways it failed.
So my question is how can I write back into text [as for example, other recognizable format will also work] after reading a file stored into HDFS and running rhwatch in rhipe ?

Related

How to connect to HDFS from R and read/write parquets using arrow?

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is awesome.
The problem is, nowhere in the R arrow docs can I find information about working with HDFS and also in general not much information about how to use the library properly.
I am basically looking for the R equivalent of:
from pyarrow import fs
filesystem = fs.HadoopFileSystem(host = 'my_host', port = 0, kerb_ticket = 'my_ticket')
Disclosure:
I know how to use odbc to read and write my data. While reading is fine (but slow), inserting larger amounts of data into impala/hive this way is pure awful (slow, often fails, and impala isn't really built to digest data this way).
I know I could probably use pyarrow to work with hdfs, but would like to avoid installing python in my docker image just for this purpose.
The bindings for this are not currently implemented in R; there is a ticket open here on the project JIRA, which at time of writing is still marked "Unresolved": https://issues.apache.org/jira/browse/ARROW-6981. I'll comment on the JIRA ticket to mention that there is user interest in implementing these bindings.

Is it possible to download software using R?

I am writing a user-friendly function to import Access tables using R. I have found that most of the steps will have to be done outside of R, but I want to keep most of this within the script if possible. The first step is to download a Database driver from microsoft: https://www.microsoft.com/en-US/download/details.aspx?id=13255.
I am wondering if it is possible to download software from inside R, and what function/package I can use? I have looked into download.file but this seems to be for downloading information files rather than software.
Edit: I have tried
install_url(https://download.microsoft.com/download/2/4/3/24375141-E08D-
4803-AB0E-10F2E3A07AAA/AccessDatabaseEngine_X64.exe)
But I get an error:
Downloading package from url: https://download.microsoft.com/download/2/4/3/24375141-E08D-4803-AB0E-10F2E3A07AAA/AccessDatabaseEngine_X64.exe
Installation failed: Don't know how to decompress files with extension exe

Huge XML-Parsing/converting using R or RDotnet

I have XML file of 780GB (yes yes, indeed, 5GB pcap file which was converted to XML).
The name of the XML file is tmp.xml.
I am trying to commit this operation in R-Studio:
require("XML")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
When I am trying to do it with R I get errors (memory allocation failure, R session aborted etc).
Is there any benefit to use RDotnet instead of the regular R?
Is there any way to use R to do this?
Do you know another strong tool to convert this huge xml to csv or easier format?
Thank you!

R integration with Tableau

I am facing difficulty in integrating R with Tableau.
Initially when I created calculated field it was asking for Rserve package in R and was not alowing to drag field to worksheet. I have installed this package but still it shows error saying
"Error occurred while communicating with the Resrve service.Tableau i unable to connect to the service.Verify that server is running and that you have access privileges"
Any inputs. Thank you
You need to start Rserve. If you successfully install Rserve package, simply run this (on RGui, RStudio or wherever you run R scripts)
> library(Rserve)
> Rserve()
You can test your connection to RServe on Tableau, on Help, Settings and Performance, Manage R Connection.
As of Tableau 9, you can use *.rdata files with Tableau. Tableau 9 will read the first item stored in the *.rdata file. Just open an *.rdata file under "Statistical Files" in the Tableau intro screen.
To do this do:
save(myDataframe, "Myfile.rdata")
This will save the file with the dataframe stored in it. You can save as many items as you want, but Tableau will only read the first. It can read vectors and variables as well if they are in the first item. Note that rdata files compress data quite a bit. I recently compressed 900mb to 25mb. However Tableau will still need to decompress it to use it so be careful about memory.

Can we write BSON files in R without MongoDB

I am trying to explore alternate serialization/un-serialization in R.
Is it possible to read /write data in BSON file format in R without actually creating or using a MongoDB database?
I browsed the rmongodb package description, and it appeared that the package might require one to use a MongoDB database for reading/writing BSON files.

Resources