R - Chaging specific cell values in a large raster layer - r

I am working with R "raster" package and have a large raster layer (62460098 cells, 12 Mb for the object). My cell values range from -1 to 1. I have to replace all negative values with a 0 (example: a cell that has -1 as value has to become a 0). I tried to do this:
raster[raster < 0] <- 0
But it keeps overloading my RAM because of the raster size.
OS: Windows 7 64-bits
RAM size: 8GB

You can do
r <- reclassify(raster, c(-Inf, 0, 0))
This will work on rasters of any size (no memory limitation)

There are several postings that discuss memory issues and it's not clear if you have attempted any of these, .... but you should. The physical constraints are not clear, so you should edit your question to include size of machine and name of OS being tortured. I don't know how to construct a toybox that lets me do any testing, but one approach that might not blow up RAM use (as much) would be to first construct a set of indices marking the locations to be "zeroed":
idxs <- which(raster <0, arr.ind=TRUE)
gc() # may not be necessary
Then incrementally replace some fraction of locations, say a quarter or a tenth at a time.
raster[ idxs[ 1:(nrow(idxs)/10), ] ] <- 0
The likely problem with any of this is that R's approach to replacement is not "in place" but rather involves the creation of a temporary copy of the objects which is then reassigned to the original. Good Luck.


Why does my computers memory rapidly disappear when I try to process rasters?

I am working with a set of 13 .tif raster files, 116.7 MB each, containing data on mangrove forest distributions in West Africa. Each file holds the distribution for one year (2000-2012). The rasters load into R without any problems and plot relatively easily as well, taking ~20 seconds using base plot() and ~30 seconds using ggplot().
I am running into problems when I try to do any sort of processing or analysis of the rasters. I am trying to do simple raster math, subtracting the 2000 mangrove distribution raster from the 2000 raster to show deforestation hotspots, but as soon as I do, the memory on my computer starts rapidly disappearing.
I have 48GB of drive space free, but when I start running the raster math, I start to lose a GB of storage every few seconds. This continues until my storage is almost empty, I get a notification from my computer that my storage is critically low, and I have to stop R from running. I am running on a MacBook Pro 121GB storage 8GB ram Big Sur 11.0.1. Does anyone know what could be causing this?
Here's my code:
#import cropped rasters
crop2000 <- raster("cropped2000.tif")
crop2001 <- raster("cropped2001.tif")
crop2002 <- raster("cropped2002.tif")
crop2003 <- raster("cropped2003.tif")
crop2004 <- raster("cropped2004.tif")
crop2005 <- raster("cropped2005.tif")
crop2006 <- raster("cropped2006.tif")
crop2007 <- raster("cropped2007.tif")
crop2008 <- raster("cropped2008.tif")
crop2009 <- raster("cropped2009.tif")
crop2010 <- raster("cropped2010.tif")
crop2011 <- raster("cropped2011.tif")
crop2012 <- raster("cropped2012.tif")
#look at 2000 distribution
#look at 2012 distribuion
#subtract 2000 from 2012 to look at change
chg00_12 <- crop2012 - crop2000
If you work with large datasets that cannot be all kept in RAM, raster will save them to temporary files. This can be especially demanding with raster math, as each step will create a new file. e.g with Raster* x
y <- 3 * (x + 2) - 5
would create three temp files. First for (x+2), then for *3, and then for -5. You can avoid that by using functions like calc and overlay
y <- raster::calc(x, function(i) 3 * (i + 2) - 5)
That would only create one temp file. Or none if you provide a filename (which makes it also easier to delete), and perhaps use compression (see ?writeRaster).
Also see ?raster::removeTmpFiles
You can also increase the amount of RAM that raster is allowed to use. See ?raster::rasterOptions.

Generating 3.000.000 strings of length 11 in R

Apparently if I try this:
# first grab the package
# and then try to generate some serious dummy data
my_try <- as.vector(sample(1111111111:99999999999,3000000,replace=T))
R will say NOPE, sorry:
Error: cannot allocate vector of size 736.8 Gb
Should I buy more RAM*?
*this is a joke, but I seriously appreciate any help!
The desired output is a dataframe of 20 variables, and 3x10^6 rows. Some columns/variables should be strings, some integers. All in lengths ranging from 2 to 12.
The error isn't coming from sampling 3 million values, it's from trying to create a population of about 90 billion values 1111111111:99999999999 from which to sample. If you want to sample from that range, sample from the range 1:88888888889 and add 11111111110 using
sample(88888888889, 3000000,replace=TRUE) + 11111111110
There's no need for as.vector at the end, it's already a vector.
P.S. I believe in R-devel the range 1111111111:99999999999 will be stored much more efficiently (basically just the limits), but I don't know if sample() will be modified to work with it that way.

Run the script using hard drive

Given a set of n inputs, I want to generate all permutations of 0's and 1's (essentially the input matrix for a truth table). In order to do so, I am using the permutations command (using the gtools package) in R, as follows:
> permutations(2,n,v=c(0,1),repeats.allowed=TRUE)
where n is the number of inputs.
However, given sufficiently large number of n (let's say 26), the size of the variable becomes very high (if n=26, the variable would be approx. 13GB in size). Given this, I wanted to know if there is any way (in R) of using the hard disk instead of creating the variable on the RAM? (I might actually have to run this with n = 86 which would be an impossible thing to do on the RAM).

Very slow raster::sampleRandom, what can I do as a workaround?

tl;dr: why is raster::sampleRandom taking so much time? e.g. to extract 3k cells from 30k cells (over 10k timesteps). Is there anything I can do to improve the situation?
EDIT: workaround at bottom.
Consider a R script in which I have to read a big file (usually more than 2-3GB) and perform quantile calculation over the data. I use the raster package to read the (netCDF) file. I'm using R 3.1.2 under 64bit GNU/Linux with 4GB of RAM, 3.5GB available most of the time.
As the files are often too big to fit into memory (even 2GB files for some reason will NOT fit into 3GB of available memory: unable to allocate vector of size 2GB) I cannot always do this, which is what I would do if I had 16GB of RAM:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(getValues(pr)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
But instead I can sample a smaller number of cells in my files using the function sampleRaster() from the raster package, still getting good statistics.
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(sampleRandom(pr, cnsample)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
I perform this over 6 different files (i goes from 1 to 6) which all have about 30k cells and 10k timesteps (so 300M values). Files are:
1.4GB, 1 variable, filesystem 1
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
1.2GB, 1 variable, filesystem 3
1.2GB, 1 variable, filesystem 3
Note that:
files are on three different nfs filesystem, whose performance I'm not sure of. I cannot rule out the fact that the nfs filesystems can greatly vary in performance from one moment to the other.
RAM usage is 100% all of the time when the script runs, but the system does not use all of it's swap.
sampleRandom(dataset, N) takes N non-NA random cells from one layer (= one timestep), and reads their content. Does so for the same N cells for each layer. If you visualize the dataset as a 3D matrix, with Z as timesteps, the function takes N random non-NA columns. However, I guess the function does not know that all the layers have the NAs in the same positions, so it has to check that any column it chooses does not have NAs in it.
When using the same commands on files with 8393 cells (about 340MB in total) and reading all the cells, the computing time is a fraction of trying to read 1000 cells from a file with 30k cells.
The full script which produces the output below is here, with comments etc.
If I try to read all the 30k cells:
cannot allocate vector of size 2.6 Gb
If I read 1000 cells:
5 minutes
45 m
30 m
30 m
20 m
20 m
If I read 3000 cells:
15 minutes
18 m
35 m
34 m
60 m
60 m
If I try to read 5000 cells:
2.5 h
22 h
for >2 I had to stop after 18h, I had to use the workstation for other tasks
With more tests, I've been able to find out that it's the sampleRandom() function that's taking most of the computing time, not the calculation of the quantile (which I can speed up using other quantile functions, such as kuantile()).
Why is sampleRandom() taking so long? Why does it perform so strangely, sometimes fast and sometimes very slow?
What is the best workaround? I guess I could manually generate N random cells for the 1st layer and then manually raster::extract for all timesteps.
Working workaround is to do:
cells <- sampleRandom(pr[[1]], cnsample, cells=T) #Extract cnsample random cells from the first layer, exluding NAs
prvals <- pr[cells[,1]] #Read those cells from all layers
qs <- quantile(prvals, probs=qprobs, na.rm=T, type=8, names=F) #Compute quantile
This works and is very fast because all layers have NAs in the same positions. I think this should be an option that sampleRandom() could implement.

How to compute the size of the allocated memory for a general type

I need to work with some databases read with read.table from csv (comma separated values ), and I wish to know how to compute the size of the allocated memory for each type of variable.
How to do it ?
edit -- in other words : how much memory R allocs for a general data frame read from a .csv file ?
You can get the amount of memory allocated to an object with object.size. For example:
x = 1:1000
# 4040 bytes
This script might also be helpful- it lets you view or graph the amount of memory used by all of your current objects.
In answer to your question of why object.size(4) is 48 bytes, the reason is that there is some overhead in each numeric vector. (In R, the number 4 is not just an integer as in other languages- it is a numeric vector of length 1). But that doesn't hurt performance, because the overhead does not increase with the size of the vector. If you try:
> object.size(1:100000) / 100000
4.0004 bytes
This shows you that each integer itself requires only 4 bytes (as you expect).
Thus, summary:
For a numeric vector of length n, the size in bytes is typically 40 + 8 * floor(n / 2). However, on my version of R and OS there is a single slight discontinuity, where it jumps to 168 bytes faster than you would expect (see plot below). Beyond that, the linear relationship holds, even up to a vector of length 10000000.
plot(sapply(1:50, function(n) object.size(1:n)))
For a categorical variable, you can see a very similar linear trend, though with a bit more overhead (see below). Outside of a few slight discontinuities, the relationship is quite close to 400 + 60 * n.
plot(sapply(1:100, function(n) object.size(factor(1:n))))
