Converting R dataframe to H2O Frame without writing to disk - r

I know the as.h2o function from h2o library converts an R data.frame to an H2O frame. Two questions:
Does as.h2o() write data to disk during conversion? How long is this data stored?
Are there other options that avoids the temp step of writing to disk?

The exact path of running as.h2o on a data.frame, df :
path <- write.csv(df)
h2o.upload(path)
remove.file(path)
We temporarily write to disk the data.frame and then subsequently upload rather than import the file into H2O and as soon as the file is uploaded we delete the temporary frame. There is no cleaner alternative to not writing to disk.

Related

Convert raw bytes into a NetCDF object

I am pulling in NetCDF data from a remote server using data <- httr:GET(my_url) in an R session. I can writeBin(content(data, "raw"), "my_file.nc") and then nc_open("my_file.nc") but that is rather cumbersome (I am processing hundreds of NetCDF files).
Is there a way to convert the raw data straight into a ncdf4 object without going through the file system? For instance, would it be possible to pipe the raw data into nc_open()? I looked at the source code and the function prototype expects a named file, so I suppose a named pipe might work but how do I make a named pipe from a raw blob of bytes in R?
Any other suggestions welcome.

Which file format takes less space in sink() function?

I am running a long time script (gets information from the server), which runs the whole day and sink() function saves the output to .txt format. I heard that sometimes sink() function stops abruptly if a huge file is created. In my case, the file size is approx. 100-200mb. Which file format is better to use in order to save some space? or is there are any other functions to save data to my computer?
The first option that comes to mind is the feather package. It stores data frames in binary format, which allows you to push and pull data frames easily. The data should also be lightweight in memory compared to traditional options like sink().
An example workflow would be:
#write data
library(feather)
path <- "my_data.feather"
write_feather(df, path)
#read data
df <- read_feather(path)
Without having your data on hand to benchmark myself, try it out, and let me know if it's indeed faster

Saving H2o data frame

I am working with 10GB training data frame. I use H2o library for faster computation. Each time I load the dataset, I should convert the data frame into H2o object which is taking so much time. Is there a way to store the converted H2o object ? (so that i can skip the as.H2o(trainingset) step each time I make trails on building models )
After the first transformation with as.h2o(trainingset) you can export / save the file to disk and later import it again.
my_h2o_training_file <- as.h2o(trainingset)
path <- "whatever/my/path/is"
h2o.exportFile(my_h2o_training_file , path = path)
And when you want to load it use either h2o.importFile or h2o.importFolder. See the function help for correct usage.
Or save the file as csv / txt before you transform it with as.h2o and load it directly into h2o with one of the above functions.
as.h2o(d) works like this (even when client and server are the same machine):
In R, export d to a csv file in a temp location
Call h2o.uploadFile() which does an HTTP POST to the server, then a single-threaded import.
Returns the handle from that import
Deletes the temp csv file it made.
Instead, prepare your data in advance somewhere(*), then use h2o.importFile() (See http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.importFile.html). This saves messing around with the local file, and it can also do a parallelized read and import.
*: For speediest results, the "somewhere" should be as close to the server as possible. For it to work at all, the "somewhere" has to be somewhere the server can see. If client and server are the same machine, then that is automatic. At the other extreme, if your server is a cluster of machines in an AWS data centre on another continent, then putting the data into S3 works well. You can also put it on HDFS, or on a web server.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/importing-data.html for some examples in both R and Python.

Import .rds file to h2o frame directly

I have a large .rds file saved and I trying to directly import .rds file to h2o frame using some functionality, because it is not feasible for me to read that file in R enviornment and then use as.h2o function to convert.
I am looking for some fast and efficient way to deal with it.
My attempts:
I have tried to read that file and then convert it into h2o frame. But, it is way much time consuming process.
I tried saving file in .csv format and using h2o.import() with parse=T.
Due to memory constraint I was not able to save complete dataframe.
Please suggest me any efficient way to do it.
Any suggestions would be highly appreciated.
The native read/write functionality in R is not very efficient, so I'd recommend using data.table for that. Both options below make use of data.table in some way.
First, I'd recommend trying the following: Once you install the data.table package, and load the h2o library, set options("h2o.use.data.table"=TRUE). What that will do is make sure that as.h2o() uses data.table underneath for the conversion from an R data.frame to an H2O Frame. Something to note about how as.h2o() works -- it writes the file from R to disk and then reads it back again into H2O using h2o.importFile(), H2O's parallel file-reader.
There is another option, which is effectively the same thing, though your RAM doesn't need to store two copies of the data at once (one in R and one in H2O), so it might be more efficient if you are really strapped for resources.
Save the file as a CSV or a zipped CSV. If you are having issues saving the data frame to disk as a CSV, then you should make sure you're using an efficient file writer like data.table::fwrite(). Once you have the file on disk, read it directly into H2O using h2o.importFile().

R- set working directory to hdfs

I need to create some data frames from very large data sets in R. Is there a way to change my working directory so that R objects that I create are saved into hdfs? I don't have enough space under /home to save these large data frames, but I need to use a few data frame functions that require a data frame as input.
If we are using data frame to do some operations on data from hdfs, we are technically using memory not the disk space. So the limiting factor will be memory(RAM) not the available disk space in any working directory and changing working directory wont make too much sense.
You don't need to copy the file from hdfs to local compute context to process it as dataframe.
Use rxReadXdf() to directly convert the xdf dataset to a dataframe in hdfs itself.
something like this(assuming you are in hadoop compute context):
airDS <- RxTextData(file="/data/revor/AirlineDemoSmall.csv", fileSystem=hdfFS)
# making a text data source from a csv file at above hdfs location
# hdfsFS is the object storing hadoop fileSystem details using RxHdfsFileSyStem()
airxdf <- RxXdfData(file= "/data/AirlineXdf")
# specifying the location to create the composite xdf file in hdfs
# make sure this location exits in hdfs
airXDF <- rxImport(inFile=airDS, outFile=airxdf)
# Importing csv to composite xdf
airDataFrame <- rxReadXdf(file=airXDF)
# Now airDataFrame is a dataframe in memory
# use class(airDataframe) to double check
# do your required operations on this data frame

Resources