Is it possible to write parquet files to local storage from h2o on hadoop? - r

I'm working with h2o (latest version 3.26.0.10) on a Hadoop cluster. I've read in a parquet file from HDFS and have performed some manipulation on it, built a model, etc.
I've stored some important results in an H2OFrame that I wish to export to local storage, instead of HDFS. Is there a way to export this file as a parquet?
I tried using h2o.exportFile, documentation here: http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.exportFile.html but all the examples are for writing .csv. I tried using the a file path with .parquet as an extension and that didn't work. It wrote a file but I think it was basically a .csv as it was identical file size to the .csv.
example: h2o.exportFile(iris_hf, path = "/path/on/h2o/server/filesystem/iris.parquet")
On a related note, if I were to export my H2OFrame to HDFS instead of local storage, would it be possible to write that in parquet format? I could at least then move that to local storage.

Related

Attach csv file to mariadb

Is it possible to access a csv file directly with the csv storage engine without going through the trouble of loading it first?
We deploy a data warehouse where during load CSV files are read into a temp table and then insert the content in production fact tables. I wonder if we could just pass the load into and "directly go to jail insert"?

How to read parquet file from AWS S3 bucket using R without downloading it locally?

I'm able to download the parquet file from AWS S3 bucket on local and then read from it (see the below code). But is there any way that I can directly read parquet file from S3 and read, without storing in local?
save_object("Financial_Sample.parquet", file = "Financial_Sample.parquet", bucket = 'my-bucket')
df <- read_parquet("Financial_Sample.parquet")```
Take a look at the arrow package: https://arrow.apache.org/docs/r/index.html
It can directly read from s3 and even filter before reading using some dplyr verbs.

How to read parquet files from HDFS in R

I need to read parquet files stored on HDFS (I have a Kerberos-protected Hadoop cluster) in my R program. I came across a couple of packages, but none of them completely satisfy what I need
rhadoop: It looks like an old project with no further development. rhdfs package under these libraries does not support parquet files or Kerberos.
arrow: It seems like it can read parquet files, but there is no connectivity to HDFS
Is there any other library which let me read parquet files from HDFS in R?
I'm aware of sparklyr, but I believe I need to install spark on the machine which runs the spark driver? Is that correct? My R client is a different machine.

using sparklyr in RStudio, can I upload a LOCAL csv file to a spark cluster

I'm pretty new to cluster computing, so not sure if this is even possible.
I am successfully creating a spark_context in Rstudio (using sparklyr) to connect to our local Spark cluster. Using copy_to I can upload data frames from R to Spark, but I am trying to upload a locally stored CSV file directly to the Spark cluster using spark_read_csv without importing it into the R environment first (it's a big 5GB file). It's not working (even prefixing location with file:///), and it seems that it can only upload files that are ALREADY stored in the cluster.
How do I upload a local file directly to spark without loading it into R first??
Any tips appreciated.
You cannot. File has to be reachable from each machine in your cluster either as a local copy or placed on distributed files system / object storage.
You can upload the files from local to spark by using spark_read_csv() method. Please pass the path properly.
Note: It is not necessary to load the data first into R environment.

importing compressed csv into 'h2o' using r

The 'h2o' package is a fun ML java tool that is accessible via R. The R package for accessing 'h2o' is called "h2o".
One of the input avenues is to tell 'h2o' where a csv file is and let 'h2o' upload the raw CSV. It can be more effective to just point out the folder and tell 'h2o' to import "everything in it" using the h2o.importFolder command.
Is there a way to point out a folder of "gzip" or "bzip" csv files and get 'h2o' to import them?
According to this link (here) the h2o can import compressed files. I just don't see the way to specify this for the importFolder approach.
Is it faster or slower to import the compressed form? If I have another program that makes output does it save me time in the h2o import process speed if they are compressed? If they are raw text? Guidelines and performance best practices are appreciated.
as always, comments, suggestions, and feedback are solicited.
I took the advice of #screechOwl and asked on the 0xdata.atlassian.net board for h2o and was given a clear answer:
It was supplied by user "cliff".
Hi, yes H2O - when importing a folder - takes all the files in the folder; it unzips gzip'd or zip'd files as needed, and parses them all into one large CSV. All the files have to be compatible in the CSV sense - same number and kind of columns.
H2O does not currently handle bzip files.

Resources