How to access data stored in hbase from spark in R - r

I need to get data stored in hbase to analyse in R but I need to do it through Spark because the data does not fit in memory.
Does anybody know how to access data in hbase through Spark in R?
I've searched both the web and SO but no joy. I've found pages that explain how to access data in hbase from R but they don't do it through Spark. And all the pages I've seen explaining how to access data in R and Spark (with sparklyr) provide examples with the iris dataset :(
Any help is much appreciated!

One option seems to be to install rhbase and get the data from hbase and save it to csv, first, then use sparkr to read the data from the csv file and proceed to analyse etc. blogs.wandisco.com/2014/08/19/experiences-r-big-data/
Is there a better way? One that does not require saving the data to a csv file?

Related

How to connect to HDFS from R and read/write parquets using arrow?

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is awesome.
The problem is, nowhere in the R arrow docs can I find information about working with HDFS and also in general not much information about how to use the library properly.
I am basically looking for the R equivalent of:
from pyarrow import fs
filesystem = fs.HadoopFileSystem(host = 'my_host', port = 0, kerb_ticket = 'my_ticket')
Disclosure:
I know how to use odbc to read and write my data. While reading is fine (but slow), inserting larger amounts of data into impala/hive this way is pure awful (slow, often fails, and impala isn't really built to digest data this way).
I know I could probably use pyarrow to work with hdfs, but would like to avoid installing python in my docker image just for this purpose.
The bindings for this are not currently implemented in R; there is a ticket open here on the project JIRA, which at time of writing is still marked "Unresolved": https://issues.apache.org/jira/browse/ARROW-6981. I'll comment on the JIRA ticket to mention that there is user interest in implementing these bindings.

Download excel file from an ArcGIS rest endpoint using R

I'm trying to download the excel file stored at https://www.arcgis.com/sharing/rest/content/items/b5e7488e117749c19881cce45db13f7e/data (website: https://www.folkhalsomyndigheten.se/smittskydd-beredskap/utbrott/aktuella-utbrott/covid-19/bekraftade-fall-i-sverige) with COVID-19 data on Sweden. The data is freely available and is stored on an ArcGIS infrastructure.
I tried with download.file() but the what is produced is not the excel but some unreadable file whose actual format is not clear to me.
I tried to investigate the ArcGis Rest API but I couldn't find a simple solution.
Do you have any guidance on how to do this with R or with a general CURL/WGET based infrastructure?

Is there an R package to import VSAM files as a tbble or dataframe?

I am looking for ways to process VSAM files with R and export as a csv.
I have been searching the web and have not been able to find any methods of using R to read VSAM files.
A little more information would be of use. How are you going to get the data from the VSAM files? Are you reading directly from an IBM system? What access method will you be using? What is the structure of the file you are reading since since if you want it to be put in a data.frame, is it something like a CSV file already?. So any other particulars would be helpful.

Huge XML-Parsing/converting using R or RDotnet

I have XML file of 780GB (yes yes, indeed, 5GB pcap file which was converted to XML).
The name of the XML file is tmp.xml.
I am trying to commit this operation in R-Studio:
require("XML")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
When I am trying to do it with R I get errors (memory allocation failure, R session aborted etc).
Is there any benefit to use RDotnet instead of the regular R?
Is there any way to use R to do this?
Do you know another strong tool to convert this huge xml to csv or easier format?
Thank you!

R dataset connection to tableau

Recently tableau gave the functionality of R connection in their release 8.1. I want to know if there is any way i can call an entire table created in R to tableau. Or an .rds object which contains the dataset into Tableau?
There is a tutorial on the Tableau website for this and a blog on r-bloggers which discuss. The tutorial has a number of comments and one of them (in early Dec I think) asks how to get an rds file in. You need to start Rserve and then execute a script on it to get your data.
Sorry I can't be more help as I only looked into it briefly and put it on the back-burner but if you get stuck they seem to come back quickly if you post a comment on the page:
http://www.tableausoftware.com/about/blog/2013/10/tableau-81-and-r-25327
Just pointing out that the Tableau Data Extract API might be useful here, even if the current version of R integration doesn't yet meet your needs. (Note, that link is to the version 8.1 docs released in late 2013 - so look for the latest version to see what functionality they've added since)
If what you want to do is to manipulate data in R and then send a table of data to Tableau for visualization, you could first try the simple step of exporting the data from R as a CSV file and then visualizing that data in Tableau. I know that's not sexy, but its always good to make sure you've got a way to get the output result you need before investing time in optimizing the process.
If that gets the effect you want, but you just want to automate more of the steps, then take a look at the Tableau Data Extract API. You could use that library to generate a Tableau Data Extract instead of a CSV file. If you have something in production that needs updates, then you could presumably create a python script or JVM program to read your RDS file periodically and generate a revised extract.
Let us assume your data.frame/ tibble etc (say dataset object) is ready in R/ RStudio and you want to connect it with Tableau
1. In RStudio (or R terminal), execute the following steps:
install.packages("Rserve")
library(Rserve)
Rserve() ##This gets the R connection service up and running
2. Now go to Tableau (I am using 10.3.2):
Help > Settings and Performances > Manage External Service Connection
Enter localhost in the Server field and click on Test Connection.
You have now established a connection between R and Tableau.
3. Come back to RStudio. Now we need a .rdatafile that will consist of our R object(s). In this case, dataset. This is the R object that we want to use in Tableau. Enter this in the R console:
save(dataset, file="objectName.rdata")
4. Switch to Tableau now.
Connect To a File > Statistical File
Go to your working directory where the newly created objectName.rdata resides. From the drop down list of file type, select R files (*.rdata, *.rda) and select your object. This will open the object you created in R in Tableau. Alternatively, you can drag and drop your object directly to Tableau's workspace.

Resources