R- set working directory to hdfs - r

I need to create some data frames from very large data sets in R. Is there a way to change my working directory so that R objects that I create are saved into hdfs? I don't have enough space under /home to save these large data frames, but I need to use a few data frame functions that require a data frame as input.

If we are using data frame to do some operations on data from hdfs, we are technically using memory not the disk space. So the limiting factor will be memory(RAM) not the available disk space in any working directory and changing working directory wont make too much sense.
You don't need to copy the file from hdfs to local compute context to process it as dataframe.
Use rxReadXdf() to directly convert the xdf dataset to a dataframe in hdfs itself.
something like this(assuming you are in hadoop compute context):
airDS <- RxTextData(file="/data/revor/AirlineDemoSmall.csv", fileSystem=hdfFS)
# making a text data source from a csv file at above hdfs location
# hdfsFS is the object storing hadoop fileSystem details using RxHdfsFileSyStem()
airxdf <- RxXdfData(file= "/data/AirlineXdf")
# specifying the location to create the composite xdf file in hdfs
# make sure this location exits in hdfs
airXDF <- rxImport(inFile=airDS, outFile=airxdf)
# Importing csv to composite xdf
airDataFrame <- rxReadXdf(file=airXDF)
# Now airDataFrame is a dataframe in memory
# use class(airDataframe) to double check
# do your required operations on this data frame

Related

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark)
The situation
created parquet files with a Spark pipe and save on S3
read with RStudio/RShiny with one column as index to do further analysis
The parquet file structure
The parquet files created from my Spark consists of several parts
tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc
How do I read this component_mapping.parquet into R?
What I tried
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")
but this fails with the error
IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory
It works if I just read one file of the directory
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")
but I need to load all in order to query on it
What I found in the documentation
In the apache arrow documentation
https://arrow.apache.org/docs/r/reference/read_parquet.html and
https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html
I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.
read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)
How do I set the properties correctly to read the full directory?
# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)
Help would be very appreciated
As #neal-richardson alluded to in his answer, more work has been done on this, and with the current arrow package (I'm running 4.0.0 currently) this is possible.
I noticed your files used snappy compression, which requires a special build flag before installation. (Installation documentation here: https://arrow.apache.org/docs/r/articles/install.html)
Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)
The Dataset API implements the functionality you are looking for, with multi-file datasets. While the documentation does not yet include a wide variety of examples, it does provide a clear starting point. https://arrow.apache.org/docs/r/reference/Dataset.html
The example below shows a minimal example of reading a multi-file dataset from a given directory and converting it to an in-memory R data frame. The API also supports filtering criteria and selecting a subset of columns, though I'm still trying to figure out the syntax myself.
library(arrow)
## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)
Solution for: Read partitioned parquet files from local file system into R dataframe with arrow
As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr, SparkR or reticulate and dplyr as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?
I solved my task now with your proposal using arrow together with lapply and rbindlist
my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))
looking forward until the apache arrow functionality is available
Thanks
Reading a directory of files is not something you can achieve by setting an option to the (single) file reader. If memory isn't a problem, today you can lapply/map over the directory listing and rbind/bind_rows into a single data.frame. There's probably a purrr function that does this cleanly. In that iteration over the files, you also can select/filter on each if you only need a known subset of the data.
In the Arrow project, we're actively developing a multi-file dataset API that will let you do what you're trying to do, as well as push down row and column selection to the individual files and much more. Stay tuned.
Solution for: Read partitioned parquet files from S3 into R dataframe using arrow
As it tooked me now very long to figure out a solution and I was not able to find anything in the web I would like to share this solution on how to read partitioned parquet files from S3
library(arrow)
library(aws.s3)
bucket="mybucket"
prefix="my_prefix"
# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key
# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})
# concatenate all data together into one data.frame
data <- do.call(rbind, data)
What a mess but it works.
#neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R
I am working on this package to make this easier. https://github.com/mkparkin/Rinvent
Right now it can read from Local, AWS S3 or Azure Blob. parquet files or deltafiles
# read parquet from local with where condition in the partition
readparquetR(pathtoread="C:/users/...", add_part_names=F, sample=F, where="sku=1 & store=1", partition="2022")
#read local delta files
readparquetR(pathtoread="C:/users/...", format="delta")
your_connection = AzureStor::storage_container(AzureStor::storage_endpoint(your_link, key=your_key), "your_container")
readparquetR(pathtoread="blobpath/subdirectory/", filelocation = "azure", format="delta", containerconnection = your_connection)

Saving H2o data frame

I am working with 10GB training data frame. I use H2o library for faster computation. Each time I load the dataset, I should convert the data frame into H2o object which is taking so much time. Is there a way to store the converted H2o object ? (so that i can skip the as.H2o(trainingset) step each time I make trails on building models )
After the first transformation with as.h2o(trainingset) you can export / save the file to disk and later import it again.
my_h2o_training_file <- as.h2o(trainingset)
path <- "whatever/my/path/is"
h2o.exportFile(my_h2o_training_file , path = path)
And when you want to load it use either h2o.importFile or h2o.importFolder. See the function help for correct usage.
Or save the file as csv / txt before you transform it with as.h2o and load it directly into h2o with one of the above functions.
as.h2o(d) works like this (even when client and server are the same machine):
In R, export d to a csv file in a temp location
Call h2o.uploadFile() which does an HTTP POST to the server, then a single-threaded import.
Returns the handle from that import
Deletes the temp csv file it made.
Instead, prepare your data in advance somewhere(*), then use h2o.importFile() (See http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.importFile.html). This saves messing around with the local file, and it can also do a parallelized read and import.
*: For speediest results, the "somewhere" should be as close to the server as possible. For it to work at all, the "somewhere" has to be somewhere the server can see. If client and server are the same machine, then that is automatic. At the other extreme, if your server is a cluster of machines in an AWS data centre on another continent, then putting the data into S3 works well. You can also put it on HDFS, or on a web server.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/importing-data.html for some examples in both R and Python.

How do I export a data frame to a file?

How do I export a data frame from R (the file is in Global Environment) to some folder in desktop? I have created some data frames in R and need to export to the linux operating system. That's why I want to export the data frame to desktop/documents and later export to the Linux.
The Save function can do that, just specify the path you want to export the data. Change your directory in Rstudio to that folder in the desktop or else save it somewhere and do a cp linux command.
Save Function Information
Save Data Frame (Stack Overflow)
saving a data file in R
To write a data frame to a text file, the command is write.table. I'll let you read the help (?write.table) to see all the options, but a sample usage is
write.table(x = mtcars, file = "C:/exported_mtcars.txt")
This will write the data frame called mtcars to a file called exported_mtcars.txt on my C drive. The default is to use spaces to separate columns. If you want tab-separations, specify sep = "\t".
You may want to simply set your working directory to the folder you've created (setwd("C:/Users/...your filepath.../Desktop/your_folder")). Then you can just specify, e.g., file = "file1.txt" for the file names in write.table.
As far as writing multiple data frames to multiple files, I strongly recommend working with lists of data frames. I'll refer you to my answer here on the subject. See especially the section I didn't put my data in a list :( I will next time, but what can I do now?. You can then pretty easily use a for loop to save all your data frames to files using write.table.
You have to specify here the route where you want to export a complete image of your current working environment:
save.image("~/User/RstudioFiles/dataset.RData")
Hope it works!

Converting R dataframe to H2O Frame without writing to disk

I know the as.h2o function from h2o library converts an R data.frame to an H2O frame. Two questions:
Does as.h2o() write data to disk during conversion? How long is this data stored?
Are there other options that avoids the temp step of writing to disk?
The exact path of running as.h2o on a data.frame, df :
path <- write.csv(df)
h2o.upload(path)
remove.file(path)
We temporarily write to disk the data.frame and then subsequently upload rather than import the file into H2O and as soon as the file is uploaded we delete the temporary frame. There is no cleaner alternative to not writing to disk.

R Converting large CSV files to HDFS

I am currently using R to carry out analysis.
I have a large number of CSV files all with the same headers that I would like to process using R. I had originally read each files sequentially into R and row binded them together before carrying out the analysis together.
The number of files that need to be read in is growing and so keeping them all in memory to carry out manipulations to the data is becoming infeasible.
I can combine all of the CSV files together without using R and thus not keeping it in memory. This leaves a huge CSV file would converting it to HDFS make sense in order to be able to carry out the relevant analysis? And in addition to this...or would be make more sense to carry out the analysis on each csv file separately and then combine it at the end?
I am thinking that perhaps a distributed file system and using a cluster of machines on amazon to carry out the analysis efficiently.
Looking at rmr here, it converts data to HDFS but apparently its not amazing for really big data...how would one convert the csv in a way that would allow efficient analysis?
You can build a composite csv file into the hdfs. First, you can create an empty hdfs folder first. Then, you pull each csv file separately into the hdfs folder. In the end, you will be able to treat the folder as a single hdfs file.
In order to pull the files into the hdfs, you can either use a terminal for loop, the rhdfs package, or load your files in-memory and user to.dfs (although I don't recommend you the last option). Remember to take the header off from the files.
Using rmr2, I advise you to first convert the csv into the native hdfs format, then perform your analysis on it. You should be able to deal with big data volumes.
HDFS is a file system, not a file format. HDFS actually doesn't handle small files well, as it usually has a default block size of 64MB, which means any file from 1B to 63MB will take 64MB of space.
Hadoop is best to work on HUGE files! So it would be best for you to concatenate all your small files into one giant file on HDFS that your Hadoop tool should have a better time handling.
hdfs dfs -cat myfiles/*.csv | hdfs dfs -put - myfiles_together.csv

Resources