How to read and rbind large CSV file efficiently? - r

I have 20 large CSV (100-150MB each) files i would like to load in R and rbind them in a large file and perform my analysis. Reading each CSV file is performed on one core only and takes about 7 min. I am on 64bit 8-core linux with 16gb RAM so resources should not be an issue.
Is there any way to perform this process more efficiently? I am also open to other (open source linux) software (for example binding the CSV files in a different programm and loading in R) or anything that could make this process faster.
Thank you very much

Maybe you want a function like paste. It's a bash function that merge lines of files.

Related

How to get data into h2o fast

What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.

Import .rds file to h2o frame directly

I have a large .rds file saved and I trying to directly import .rds file to h2o frame using some functionality, because it is not feasible for me to read that file in R enviornment and then use as.h2o function to convert.
I am looking for some fast and efficient way to deal with it.
My attempts:
I have tried to read that file and then convert it into h2o frame. But, it is way much time consuming process.
I tried saving file in .csv format and using h2o.import() with parse=T.
Due to memory constraint I was not able to save complete dataframe.
Please suggest me any efficient way to do it.
Any suggestions would be highly appreciated.
The native read/write functionality in R is not very efficient, so I'd recommend using data.table for that. Both options below make use of data.table in some way.
First, I'd recommend trying the following: Once you install the data.table package, and load the h2o library, set options("h2o.use.data.table"=TRUE). What that will do is make sure that as.h2o() uses data.table underneath for the conversion from an R data.frame to an H2O Frame. Something to note about how as.h2o() works -- it writes the file from R to disk and then reads it back again into H2O using h2o.importFile(), H2O's parallel file-reader.
There is another option, which is effectively the same thing, though your RAM doesn't need to store two copies of the data at once (one in R and one in H2O), so it might be more efficient if you are really strapped for resources.
Save the file as a CSV or a zipped CSV. If you are having issues saving the data frame to disk as a CSV, then you should make sure you're using an efficient file writer like data.table::fwrite(). Once you have the file on disk, read it directly into H2O using h2o.importFile().

Read a sample from sas7bdat file in R

I have a sas7bdat file of size around 80 GB. Since my pc has a memory of 4 GB the only way I can see is reading some of its rows. I tried using the sas7bdat package in R which gives the error "big endian files are not supported"
The read_sas() function in haven seems to work but the function supports reading specific columns only while I need to read any subset of rows with all columns. For example, it will be fine if I can read 1% of the data to understand it.
Is there any way to do this? Any package which can work?
Later on I plan to read parts of the file and divide it into 100 or so sections
If you have Windows you can use the SAS Universal Viewer, which is free, and export the dataset to CSV. Then you can import the CSV into R in more readable chunks using this method.

fread memory usage is much larger than the file

I am on a 512gb ram server. I have a 84gig CSV (hefty, I know). I am reading only 31 columns of 79, where the excluded are all floats/decimals.
After comparing many methods, it seems the highest performance way to do what I want would be to fread the file. The file size is 84gb, but watching "top" the process uses 160 gigs of memory (RES), even though the size of the eventual data.table is about 20gigs.
I know fread preallocates memory which is why it's so fast. Just wondering - is this normal and is there a way to curb the memory consumption?
Edit: it seems like, if I just ask fread to read 10000 rows (of 300MM), fread will still preallocate 84 gigs of memory.
See R FAQ 7.42. If you want to minimize the resources you use on the server, read the csv using fread once, then save the resulting object using save or saveRDS. Then read that binary file when you need the data.
Or you can use a command line tool like cut, awk, sed, etc to only select the columns you want and write the output to another file. Then you can use fread on that smaller file.
Try to see http://www.r-bloggers.com/efficiency-of-importing-large-csv-files-in-r/ or Reading 40 GB csv file into R using bigmemory.
May be bigmemory library helps you.

R Converting large CSV files to HDFS

I am currently using R to carry out analysis.
I have a large number of CSV files all with the same headers that I would like to process using R. I had originally read each files sequentially into R and row binded them together before carrying out the analysis together.
The number of files that need to be read in is growing and so keeping them all in memory to carry out manipulations to the data is becoming infeasible.
I can combine all of the CSV files together without using R and thus not keeping it in memory. This leaves a huge CSV file would converting it to HDFS make sense in order to be able to carry out the relevant analysis? And in addition to this...or would be make more sense to carry out the analysis on each csv file separately and then combine it at the end?
I am thinking that perhaps a distributed file system and using a cluster of machines on amazon to carry out the analysis efficiently.
Looking at rmr here, it converts data to HDFS but apparently its not amazing for really big data...how would one convert the csv in a way that would allow efficient analysis?
You can build a composite csv file into the hdfs. First, you can create an empty hdfs folder first. Then, you pull each csv file separately into the hdfs folder. In the end, you will be able to treat the folder as a single hdfs file.
In order to pull the files into the hdfs, you can either use a terminal for loop, the rhdfs package, or load your files in-memory and user to.dfs (although I don't recommend you the last option). Remember to take the header off from the files.
Using rmr2, I advise you to first convert the csv into the native hdfs format, then perform your analysis on it. You should be able to deal with big data volumes.
HDFS is a file system, not a file format. HDFS actually doesn't handle small files well, as it usually has a default block size of 64MB, which means any file from 1B to 63MB will take 64MB of space.
Hadoop is best to work on HUGE files! So it would be best for you to concatenate all your small files into one giant file on HDFS that your Hadoop tool should have a better time handling.
hdfs dfs -cat myfiles/*.csv | hdfs dfs -put - myfiles_together.csv

Resources