R raster timeseries: what's the most efficient read and write? - r

I have the following problem/question:
I've written an R functions which is smoothing values from a time series. The time series is defined by a big number of single global raster files, hence each single pixel a series with n timesteps (generally more than 500). Even though I've plenty of RAM, I have to rely on blockwise processing because loading the entire dataset is just too much. So far so good.
I've written (IMHO) a fairly decent code, which leverages parallel processing when possible. I have a processing machine which should be more than well equipped to handle this amount of data and computation. This leads me to believe that most of the time will be spent by reading lots of values from the disk and then, after smoothing, writing lots of values to the disk.
So I've tried running the code with the files being on either a normal HDD or a normal SSD.
Against my expectations, it didn't really matter much.
Then I tried running a test function which reads a file, gets the values and writes them back to disk with the raster being on either the HDD, the SSD or a blazing fast SSD. Again, no significant difference.
I've already done a fair share of profiling to find bottlenecks, as well as a good amount of time googling for efficient solutions. There's bits of info here and there, but I decided to post this question here to get a definitive answer and maybe some pointers for me and others how to efficient manage things.
So without further ado (and for people who skipped the above), here's my question:
In a setting as described above (high data volume, blockwise processing, reading and writing from/to disk), what's the most efficient (and/or fastest) way to do computation on a long raster time series which involves reading and writing values from/to disk? (especially regarding the read write aspect)
Assuming I have a fast SSD, how can I leverage the speed? Is it done automatically?
What are the influencing factors (filesize, filetype, caching) and the most efficient setting of these factors?
I know that in terms of raster, R works the fastest with .grd, but I would like to avoid this format for flexibility, compatibility and diskspace reasons.
Maybe I'm also having a misconception of how the raster package interacts with the files on disk. In that case, should I use different functions than getValues and writeValues ?
-- Some system info and example code: --
Os: Win7 x64
CPU: Xenon E5-1650 # 3.5 GHz
RAM: 128 GB
R-version: 3.2
Raster file format: .rst
Read/write benchmark function:
benchfun <- function(x){
# x ... raster file
xr <- raster(x)
x2 <- raster(xr)
xval <- getValues(xr)
x2 <- setValues(x2,xval)
writeRaster(x2,'testras.tif',overwrite=TRUE)
}
If needed I can also provide a little example code for the time series processing, but for now I don't think it's needed.
Appreciate all tips!
Thanks,
Val

Related

vector memory exhausted (R) workaround?

I tried a, as I came to see, quite memory intensive operation with R (write an xslx file with r of a dataset with 500k observations and 2000 variables).
I tried the method explained here. (First comment)
I set the max VSIZE to 10 GB, as I did not want to try more, because I was afraid to damage my computer (I saved money for a long time:)) and it still did not work.
I then looked up Cloud Computing with R, which I found to be quite difficult as well.
So finally, I wanted to ask here, if anyone could give me an answer on how much I can set the VSIZE without damaging my computer or if there is another way to solve my problem. (The goal is to transform an SAS file to an xslx or xsl file. The files are between 1.4 GB and 1.6 GB. My RAM is about 8GB big.) I am open to download programs if that's not too complicated.
Cheers.

R - Creating new file takes up too much memory

I'm relatively new and poor at R, and am trying to do something that appears to be giving me trouble.
I have several large spatialpolygonsdataframes that I am trying to combine into 1 spatialpolygonsdataframe. There are 7 and they combine to about 5 GB total. My mac only has 8GB of RAM.
When I try and create the aggregate spatialpolygonsdataframe R takes an incredibly long time to run and I have to quit out. I presume it is because I do not have sufficient RAM.
my code is simple: aggregate <-rbind(file1,file2,....). Is there a smarter/better way to do this?
Thank you.
I would disagree, a major component of reading large datasets isn't RAM capacity (although I would suggest that you upgrade if you can). But rather read/write speeds. Hardware, a HDD at 7200RPM is substantially slower vs. SSD. If you are able to install a SSD and have that as your working directory, I would recommend it.

What are the minimum system requirements for analysing large datasets (30gb) in R?

I tried running Apriori algorithm on 30GB CSV file in which each row is a basket upto 34 items(columns) in it. So R studio died just after execution. I want to know what are the minimum system requirements like how much RAM and CPU config I need to run algorithms on large data sets?
This question cannot be answered as such. It highly depends on what you want to do with the data.
Example
If you are able to process all lines 1 by 1, you just need a tiny bit of ram (for example if you want to count them, I believe this also holds for the most trivial use of Apriori)
If you want to calculate the distance between all points efficiently, you will want a ton of ram, and another few GB to store the output (I believe this is even less intense than the most extreme use of Apriori).
Conclusion
As such I would recommend:
Use whatever hardware you have to process a subset of the data. Check your memory and CPU usage, as you increase the data size (or other parameters) and extrapolate your results to see what you probably need.

Handling huge simulations in R

I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?
There are few problems and some not so good solutions.
First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.
Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!
Can anyone please suggest a way out! Thanks
This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.
If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.
If possible use sparse matrix techniques
If possible try leveraging storage memory and chunking the object into parts
If possible try to use Big Data tools such as H2O
Leverage multicore and HPC computing with pbdR, parallel, etc
Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can "spin up" (start) and down (stop) quickly and cheaply
Use sampling and statistical solutions when possible
Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually

restriction on the size of excel file

I need to use R to open an excel file, which can have 1000~10000 rows and 5000~20000 columns. I would like to know is there any restriction on the size of this kind of excel file in R?
Generally speaking, your limitation in using R will be how well the data set fits in memory, rather than specific limits on the size or dimension of a data set. The closer you are to filling up your available RAM (including everything else you're doing on your computer) the more likely you are to run into problems.
But keep in mind that having enough RAM to simply load the data set into memory is often a very different thing that having enough RAM to manipulate the data set, which by the very nature of R will often involve a lot of copying of objects. And this in turn leads to a whole collection of specialized R packages that allow for the manipulation of data in R with minimal (or zero) copying...
The most I can say about your specific situation, given the very limited amount of information you've provided, is that it seems likely your data will not exceed your physical RAM constraints, but it will be large enough that you will need to take some care to write smart code, as many naive approaches may end up being quite slow.
I do not see any barrier to this on the R side. Looks like a fairly modestly sized dataset. It could possibly depend on "how" you do this, but you have not described any code, so that remains an unknown.
The above answers correctly discuss the memory issue. I have been recently importing some large excel files too. I highly recommend trying out the XLConnect package to read in (and write) files.
options(java.parameters = "-Xmx1024m") # Increase the available memory for JVM to 1GB or more.
# This option should be always set before loading the XLConnect package.
library(XLConnect)
wb.read <- loadWorkbook("path.to.file")
data <- readWorksheet(wb.read, sheet = "sheet.name")

Resources