I used dump() command to dump some dataframes in R.
The particular dump files are about 200 MB, and one is about1.5 GB. Later I tried to retrieve them using source() and it is taking a lot of time and says windows stopped working after 3-4 hours. I am using 64 bit R 3.0.0 ( I tried in R 2.15.3 too) in windows 7 with memory of 48 GB. For one of the file, It threw some memory error, (I don't have log now) but loaded 4-5 datasets out of about 15 datasets.
Is there any way I can load a particular dataset if I know name?
or is there any other way?
I have learned my lesson and probably save command to create data and the original data. or one data in one dump file (or R image file)
Thank you
Use save() and load() rather than dump() and source().
save() writes out an binary representation of the data to an .Rdata file, which can then be loaded back in using load().
dump() converts everything to a text representation, which source() then has to reconvert back to binary. Both ends of that process are very inefficient.
Related
I have a huge dataset in SAS (about 14 gb) that I am trying to import to R-studio for some analysis. However, nothing is working for importation. I tried 'haven', 'sas7bdat', 'foreign' but nothing. I converted the file to csv (became about 7gb) and tried read.csv, fread..but nothing again.
R-studio takes a huge time to process the file and at teh end says that it doesn't work. (something about not being able to allocate space for a vector of x size in mg about 60mg depending on the method used).
Does anyone have any ideas about how to solve this situation ?
What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.
I have a large .rds file saved and I trying to directly import .rds file to h2o frame using some functionality, because it is not feasible for me to read that file in R enviornment and then use as.h2o function to convert.
I am looking for some fast and efficient way to deal with it.
My attempts:
I have tried to read that file and then convert it into h2o frame. But, it is way much time consuming process.
I tried saving file in .csv format and using h2o.import() with parse=T.
Due to memory constraint I was not able to save complete dataframe.
Please suggest me any efficient way to do it.
Any suggestions would be highly appreciated.
The native read/write functionality in R is not very efficient, so I'd recommend using data.table for that. Both options below make use of data.table in some way.
First, I'd recommend trying the following: Once you install the data.table package, and load the h2o library, set options("h2o.use.data.table"=TRUE). What that will do is make sure that as.h2o() uses data.table underneath for the conversion from an R data.frame to an H2O Frame. Something to note about how as.h2o() works -- it writes the file from R to disk and then reads it back again into H2O using h2o.importFile(), H2O's parallel file-reader.
There is another option, which is effectively the same thing, though your RAM doesn't need to store two copies of the data at once (one in R and one in H2O), so it might be more efficient if you are really strapped for resources.
Save the file as a CSV or a zipped CSV. If you are having issues saving the data frame to disk as a CSV, then you should make sure you're using an efficient file writer like data.table::fwrite(). Once you have the file on disk, read it directly into H2O using h2o.importFile().
I'm writing a dataset to file in ERMapper format (.ers) using the Raster package in R, but I'm having issues with the resulting .aux.xml auxiliary file (which I'm actually not interested in).
Simple example:
rst <- raster(ncols=15000,nrows=10000)
rst[] <- 1.234
writeRaster(rst, filename='_test.ers', overwrite=TRUE)
The writeRaster() line takes some time to execute, the data file is quite large, about 1.2GB on disk.
When checking what's happening while writeRaster() is executed, I find that the .ers file (header file + associated data file) is typically generated in about 20 sec. Then, it takes writeRaster() another 20 - 25 sec to generate the .aux.xml file, which only contains statistics such as min, max, mean, and st. dev. (which likely explains why it takes so long to compute).
Since I don't care about the .aux.xml file, I would like writeRaster() to not bother with it at all, and save me 20 - 25 sec of exec time (I'm writing lots of these datasets to disk so a 50% speedup in my code is quite substantial).
Anyone has any idea how to tell writeRaster() to not create a .aux.xml file? I suspect it's a GDAL-related issue, but haven't been able to find an answer yet after much research...
Any help most welcome!
Options related to the GDAL file format drivers can be set using the (not so easy to find) rgdal::setCPLConfigOption function.
In your case,
rgdal::setCPLConfigOption("GDAL_PAM_ENABLED", "FALSE")
should disable the xml file creation.
HTH
I am working with a 250 by 250 matrix. However, it takes loads and loads of time to compute this. It takes like an hour at least.
Is it possible that I can store this matrix in memory in R, such that everytime I open up R, it is already there.
Ideally, I would like to know if it is possible to run a job on background in R , so that I dont have to wait an hour to get the matrix out and be able to play around with it.
1) You can save the workspace of R when closing R. Usually R asks "Save workspace image?" when you are closing it. If you will answer "Yes" it will save the workspace in a file named ".Rdata" and will load it when staring a new R instance.
2) The better option (more safe) is to save the matrix explicitly. There are several options how it can be done. One of the options is to save it as Rdata file:
save(m, file = "matrix.Rdata")
where m is your matrix.
You can load the matrix at any time with
load("matrix.Rdata")
if you are on the same working directory.
3) There is not such option as background computing for R. But you can open several R instances. Do computation in one instance, and do something else on other instance.
What would help is to output it to a file when you have computed it and then parse that file everytime you open R. Write yourself a computeMatrix() function or script to produce a file with the matrix stored in a sensible format. Also write yourself a loadMatrix() function or script to read in that file and load the matrix into memory for use, then call or run loadMatrix everytime you start R and want to use the matrix.
In terms of running an R job in the background, you can run an R script from the command line with the syntax "R CMD BATCH scriptName" with scriptName replaced by the name of your script.
It might be better to use the ff package and save the matrix as an ff object. This means that the actual matrix will be saved on the disk in an efficient manner, then when you start a new R session you can point to that same file without loading the entire matrix into memory. When you need part of the matrix, only the part you need will be loaded so it will be much quicker. Even if you need the entire matrix loaded into memory it should load faster than reading a text file.