How to read parquet files from HDFS in R - r

I need to read parquet files stored on HDFS (I have a Kerberos-protected Hadoop cluster) in my R program. I came across a couple of packages, but none of them completely satisfy what I need
rhadoop: It looks like an old project with no further development. rhdfs package under these libraries does not support parquet files or Kerberos.
arrow: It seems like it can read parquet files, but there is no connectivity to HDFS
Is there any other library which let me read parquet files from HDFS in R?
I'm aware of sparklyr, but I believe I need to install spark on the machine which runs the spark driver? Is that correct? My R client is a different machine.

Related

How to connect to HDFS from R and read/write parquets using arrow?

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is awesome.
The problem is, nowhere in the R arrow docs can I find information about working with HDFS and also in general not much information about how to use the library properly.
I am basically looking for the R equivalent of:
from pyarrow import fs
filesystem = fs.HadoopFileSystem(host = 'my_host', port = 0, kerb_ticket = 'my_ticket')
Disclosure:
I know how to use odbc to read and write my data. While reading is fine (but slow), inserting larger amounts of data into impala/hive this way is pure awful (slow, often fails, and impala isn't really built to digest data this way).
I know I could probably use pyarrow to work with hdfs, but would like to avoid installing python in my docker image just for this purpose.
The bindings for this are not currently implemented in R; there is a ticket open here on the project JIRA, which at time of writing is still marked "Unresolved": https://issues.apache.org/jira/browse/ARROW-6981. I'll comment on the JIRA ticket to mention that there is user interest in implementing these bindings.

Is it possible to write parquet files to local storage from h2o on hadoop?

I'm working with h2o (latest version 3.26.0.10) on a Hadoop cluster. I've read in a parquet file from HDFS and have performed some manipulation on it, built a model, etc.
I've stored some important results in an H2OFrame that I wish to export to local storage, instead of HDFS. Is there a way to export this file as a parquet?
I tried using h2o.exportFile, documentation here: http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.exportFile.html but all the examples are for writing .csv. I tried using the a file path with .parquet as an extension and that didn't work. It wrote a file but I think it was basically a .csv as it was identical file size to the .csv.
example: h2o.exportFile(iris_hf, path = "/path/on/h2o/server/filesystem/iris.parquet")
On a related note, if I were to export my H2OFrame to HDFS instead of local storage, would it be possible to write that in parquet format? I could at least then move that to local storage.

using sparklyr in RStudio, can I upload a LOCAL csv file to a spark cluster

I'm pretty new to cluster computing, so not sure if this is even possible.
I am successfully creating a spark_context in Rstudio (using sparklyr) to connect to our local Spark cluster. Using copy_to I can upload data frames from R to Spark, but I am trying to upload a locally stored CSV file directly to the Spark cluster using spark_read_csv without importing it into the R environment first (it's a big 5GB file). It's not working (even prefixing location with file:///), and it seems that it can only upload files that are ALREADY stored in the cluster.
How do I upload a local file directly to spark without loading it into R first??
Any tips appreciated.
You cannot. File has to be reachable from each machine in your cluster either as a local copy or placed on distributed files system / object storage.
You can upload the files from local to spark by using spark_read_csv() method. Please pass the path properly.
Note: It is not necessary to load the data first into R environment.

Getting Data in and out of Rhipe [R + Hadoop]

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of packages.
Now in both of the packages [rhipe and rmr] I can ingest / read the data stored into csv or text file. Both of them kind of supports creation of new file formats but I find rmr has more support for it or at least more resources to get started. Well, this requirement will be useful when one plans to perform few data processing on raw data stored in HDFS and finally want to store it back to HDFS in a format recognizable by other components of Hadoop like Hive Impala etc. Both of the packages can write in their native format recognizable by the package only. The package rmr supports few other formats.
For reference related to rmr have a look into: https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md
However for rhipe I did not get any such document and I tried various ways it failed.
So my question is how can I write back into text [as for example, other recognizable format will also work] after reading a file stored into HDFS and running rhwatch in rhipe ?

Amazon EMR: Using R code in Amazon EMR

I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike
#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.

Resources