I have a RasterStack s1 consisting of 400 layers with data from an island. The extent of the raster is cropped to the extent of the island, but due to its irregular shape, only around 20% of the pixels are actually land area and have data values; the other 80% are water and NA.
I also have a land-water-mask lwm (RasterLayer), where land is coded as 1 and water as NA.
I would like to do different kinds of cell-based calculations on s1, but noticed that these take a long time to finish. To speed things up, the calculations should only be carried out for cells that are land area, whereas water areas should always be NA. In pseudo-code:
for each cell:
if cell is land
do calculation
if cell is water
return(NA)
An requirement is memory-safety.
Here is some sample data to illustrate the problem:
library(raster)
# generate data
lwm <- raster(nrow = 5, ncol = 5)
lwm[] <- c(rep(NA, 10), rep(1, 5), rep(NA, 10))
r1 <- raster(nrow = 5, ncol = 5)
r1[] <- runif(ncell(r1)) * 10
r2 <- raster(nrow = 5, ncol = 5)
r2[] <- runif(ncell(r2)) * 10
s1 <- stack(r1, r2)
s1 <- mask(s1, lwm)
# this works, but all NA-values on water are also unnecessarily evaluated
calc(s1, function(x) {sum(!is.na(x))})
After some playing around, I finally found a solution which works very well in my case, and which takes off quite a nice amount from the overall processing time:
library(raster)
# generate data
lwm <- raster(nrow = 50, ncol = 50)
lwm[] <- 1
# replace 80% with NA values
lwm[sample(1:ncell(lwm), round(0.8 * ncell(lwm)))] <- NA
r1 <- raster(lwm)
r1[] <- runif(ncell(r1))
r1_list <- replicate(400 , r1)
s1 <- stack(r1_list)
s1 <- mask(s1, lwm)
# this works, but all NA-values on water are also unnecessarily evaluated
system.time(r_sum1 <- calc(s1, function(x) {sum(x)}))
#user system elapsed
#0.14 0.00 0.14
## new approach:
# stack land-water-mask with RasterStack
s1_lwm <- stack(lwm, s1)
# function to check if first element of vector is NA; if yes, return NA; if no, do calculation
fun1 <- function(y) {
if (!is.na(y[1])) {
y = y[-1]
return(sum(y))
} else {
return(NA)
}
}
system.time(
r_sum2 <- calc(s1_lwm, fun = fun1)
)
# user system elapsed
# 0.4 0.0 0.4
# results are identical
identical(r_sum1[], r_sum2[])
That's a tricky one and unfortunetely I don't have a straight forward solution for you.
You can either do multiple crops of the island (i.e. 2-3) to minimise NA values and do the calculcations separately on each cropped raster and mosaic the results.
Or another option is to do a parallel calculation, which will speed up the process significantly:
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
beginCluster(3)
#calculation
result <- clusterR(s1, calc, args=list(fun=function(x) {sum(!is.na(x))}))
#end cluster
endCluster()
Since you asked for a memory-safe solution, you should look how much RAM is being allocated when you only a singe core and then estimate how many cores you can run your calculation on, so you won't run out of memory.
Good luck! Hope it helps.
Related
I've been trying to calculate the number of pixels of three raster files that satisfy certain condition with R (in Windows 10 and R version 3.6.3). The idea is to use these three images (which are pre-processed) and count the amount of pixels that satisfy the condition in the if statement of the following code.
library(rgdal)
library(raster)
mydir<- file.path("D:/files")
setwd(mydir)
listado1 <- list.files(pattern="crNDVI*")
listado2 <- list.files(pattern="crNBR*")
listado3 <- list.files(pattern="dNBR*")
df <- data.frame(NULL)
for (i in 1:length(listado1)) {
r1 <- raster(listado1[i])
v1 <- values(r1)
for (j in 1:length(listado2)) {
r2 <- raster(listado2[j])
v2 <- values(r2)
for (k in 1:length(listado3)) {
r3 <- raster(listado3[k])
v3 <- values(r3)
if ((i==j) & (j==k)){
df2 <- data.frame(NULL)
df2 <- data.frame(v1, v2, v3)
burn <- subset(df2, df2$v1 >= 0.3 & df2$v1 <= 1.0 & df2$v2 >= 0.2 & df2$v2 <= 1.3 & df2$v3 > 0.1 & df2$v3 <= 1.3)
burned_area <- nrow(burn)*(xres(r1)*yres(r1))*10^(-4) # Hectares
name <- paste(substr(listado1[i], 8, 15), sep="")
df <- rbind(df, data.frame(name,burned_area))
print(paste("iteracion:",i,j,k," ok",sep = " "))
}
else {
print(paste("iteracion:",i,j,k," passed",sep = " "))
invisible()}
}
}
}
write.csv(df,file = "burned_area.csv")
As it is possible to observe, I have three types of pre-processed raster (*.tif) files that start with crNDVI, crNBR and dNBR, all those files are listed in each "listado" variables. The idea is to iterate over each type of files and obtain the values of each raster file per day. The code does what it should (I get the .csv file with the burned area (pixels that satisfy the condition), however, it takes forever to compute because each raster file corresponds to an specific day and thus the result in burned_area is a value per day. Therefore, the only possibility for that to happen is when i=j=k, but applying the latter code the iteration is over each possibility in the for loop. The code runs from i=1, then j=1 and then k=1,2,3... until the last file of listado3[k], and when the length of listado3[k] finishes, it passes to the following j+1 index and looping all over again in k to each (unnecesary) iteration.
Can anyone help me in order to do the same in a more efficient way? Is there any possibility to force an "only i=j=k" for the whole nested for loops? Id be very grateful for any advice. Thanks in advance, Jorge.
Note: all the raster files have the same extent, nrow, ncol and ncell.
What is the need for the nested loops? It appears if you use one loop then i=j=k will always be true.
Here is a minimal self-contained example.
library(raster)
set.seed(0)
r1 <- raster(nrow=10, ncol=10, vals=1:100) / 100
r2 <- raster(nrow=10, ncol=10, vals=c(1:50, 50:1)) / 100
r3 <- raster(nrow=10, ncol=10, vals=c(rep(1, 10), 11:100)) / 100
x <- ((r1==r2) & (r2==r3))
s <- stack(r1, r2, r3)
s <- mask(s, x, maskvalue=1)
I am working on Spike Trains and my code to get a spike train like this:
for 20 trials is written below. The image is representational for 5 trials.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #10msSpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
plot(x=-1,y=-1, xlab="time (s)", ylab="Trial",
main="Spike trains",
ylim=c(0.5, nTrials+1), xlim=c(0, duration))
for (i in 1: nTrials)
{
clip(x1 = 0, x2= duration, y1= (i-0.2), y2= (i+0.4))
abline(h=i, lwd= 1/4)
abline(v= dt*which( SpikeMat[i,]== 1))
}
Each trial has spikes occuring at random time points. Now what I am trying to work towards, is getting a random sample time point that works for all 20 trials and I want to get the vector consisting of length of the intervals this point falls into, for each trial. The code to get the time vector for the points where the spikes occur is,
A <- numeric()
for (i in 1: nTrials)
{
ISI <- function(i){
spike_times <- c(dt*which( SpikeMat[i, ]==1))
ISI1vec <- c(diff(spike_times))
A <- c(A, ISI1vec)
return(A)}
}
Then you call ISI(i) for whichever trial you wish to see the Interspike interval vector for. A visual representation of what I want is:
I want to get a vector that has the lengths of the interval where this points fall into, for each trial. I want to figure out it's distribution as well, but that's for later. Can anybody help me figure out how to code my way to this? Any help is appreciated, even if it's just about how to start/where to look.
Your data
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
I suggest transforming your sparse matrix data into a list of indices where spikes occur
L <- lapply(seq_len(nrow(SpikeMat)), function(i) setNames(which(SpikeMat[i, ] == 1), seq_along(which(SpikeMat[i, ] == 1))))
Grab random timepoint
set.seed(1)
RT <- round(runif(1) * ncol(SpikeMat))
# 531
Result
distances contains the distances to the 2 nearest spikes - each element of the list is a named vector where the values are the distances (to RT) and their names are their positions in the vector. nearest_columns shows the original timepoint (column number) of each spike in SpikeMat.
bookend_values <- function(vec) {
lower_val <- head(sort(vec[sign(vec) == 1]), 1)
upper_val <- head(sort(abs(vec[sign(vec) == -1])), 1)
return(c(lower_val, upper_val))
}
distances <- lapply(L, function(i) bookend_values(RT-i))
nearest_columns <- lapply(seq_along(distances), function(i) L[[i]][names(distances[[i]])])
Note that the inter-spike interval of the two nearest spikes that bookend RT can be obtained with
sapply(distances, sum)
I have 2 large matrices (typically of dimensions 5000*40 and 20000*40). I am trying to create a correlation matrix, where i would like to calculate the correlation of each row from the first matrix to every other row in the second matrix. I have following minimal code, by it takes extremely long time. Any recommendations to speed it up or parallize. Thanks
-Jaison
nprots <- 50 #usually ca. 5000
ngenes <- 1000 #usually ca. 20000
a_mat <- matrix( runif(40*nprots, 120, 116000), ncol=40)
b_mat <- matrix( runif(40*ngenes, 0.1, 1000), ncol=40)
system.time(apply( a_mat, 1, function(xx)
apply(b_mat, 1, cor, y = xx, use = "pairwise.complete.obs")) -> cor_mat)
Thanks Richard, I did see the post you refered to on SO, but passed it over without thinking too much about it because it did not seem to pertain to my problem. I have 2 matrices across which I perform correlations. A little more thought on this and it ocurred to my dim brain that I can simply extend the solution you pointed out to suit my question. rbind the two matrices, then follow the solution you refered to. Finally I extract out the relevant corner of the correlation matrix. This works faster than my original code using apply and cor. I double checked the answers and all seem ok. So below is my current solution.
fast_cor <- function(a,b) {
mat <- rbind(a, b);
mat <- mat - rowMeans(mat);
mat <- mat / sqrt(rowSums(mat^2));
cr <- tcrossprod(mat)
edge <- dim(a_mat)[1]
cr <- t(cr[1:edge, -c(1:edge)])
return(cr)
}
nprots <- 50 #usually ca. 5000
ngenes <- 10000 #usually ca. 20000
a_mat <- matrix( runif(40*nprots, 120, 116000), ncol=40)
b_mat <- matrix( runif(40*ngenes, 0.1, 1000), ncol=40)
system.time(apply( a_mat, 1, function(xx)
apply(b_mat, 1, cor, y = xx, use = "pairwise.complete.obs")) -> c_1)
user system elapsed
20.48 0.00 20.48
system.time(c_2 <- fast_cor(a_mat, b_mat))
user system elapsed
1.97 0.11 2.08
I am trying to develop a function to "synchronise" NAs among layers of a raster stack, i.e. to make sure that for any given pixel of the stack, if one layer has a NA, then all layers should be set to NA for that pixel.
This is particularly useful when combining rasters coming from varying sources for species distribution modelling, because some models do not handle properly NAs.
I have found two ways to do this, but I find neither of them satisfactory. One of them requires to use the function getValues and thus is not usable for very large stacks or computers with low RAM. The other one is more memory-safe but is much slower. I am therefore here to ask if anyone has an idea to improve my attempts?
Here are the two possibilities:
Using getValues()
syncNA1 <- function (x)
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
}
Using calc()
syncNA2 <- function(y)
{
calc(y, na.rm = T, fun = function(x, na.rm = na.rm)
{
if(any(is.na(x)))
{
rep(NA, length(x))
} else
{
x
}
})
}
Now a demonstration of their respective computing times for the same stack:
> system.time(
+ b1 <- syncNA1(a1)
+ )
user system elapsed
3.04 0.15 3.20
> system.time(
+ b2 <- syncNA2(a1)
+ )
user system elapsed
5.89 0.19 6.08
Many thanks for your help,
Boris
With a stack named "s", I would first use calc(s, fun = sum) to compute a mask layer that records the location of all cells with an NA value in at least one of the stack's layers. mask() will then allow you to apply that mask to every layer in the stack.
Here's an example:
library(raster)
## Construct reproducible data! (Here a raster stack with NA values in each layer)
m <- raster(ncol=10, nrow=10)
n <- raster(ncol=10, nrow=10)
m[] <- runif(ncell(m))
n[] <- runif(ncell(n)) * 10
m[m < 0.5] <- NA
n[n < 5] <- NA
s <- stack(m,n)
## Synchronize the NA values
s2 <- mask(s, calc(s,fun = sum))
## Check that it worked
plot(s2)
I don't know about speed, but you might try converting to an array, loading up the NA, and converting back. pseudocode:
xarray<-as.array(xstack)
ind.na<-which(is.na(xarray),array.ind=TRUE)
for(j in nrow(ind.na) ) {
xarray[ind.na[j,1],ind.na[j,2],]<-NA
}
nastack<-raster(xarray)
I haven't verified the correct choice of indices there, nor have I verified I converted back to raster stack correctly, but I hope you get the idea.
EDIT: I ran a time test, with rasters 1000x1000 but otherwise as Josh created.
microbenchmark(josh(s),syncNA1(s),syncNA2(s),times=5)
Unit: milliseconds
expr min lq median uq max
josh(s) 774.2363 789.1653 800.2511 806.5364 809.9087
syncNA1(s) 652.3928 659.8327 692.3578 695.8057 743.9123
syncNA2(s) 7951.3918 8291.7917 8604.2226 8606.3432 10254.4739
neval
5
5
5
I ended up building an hybrid function between syncNA1 and Josh's solution.
This function is memory-safe if the computer does not have enough RAM, but can process faster if the computer has enough RAM:
synchroniseNA <- function(x)
{
if(canProcessInMemory(x, n = 2))
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
} else
{
x <- mask(x, calc(x, fun = sum))
return(x)
}
}
However, I empirically determined that the amount of ram used by a data.frame is twice the size of a raster file (for the n argument of canProcessInMemory()), but I am not exactly sure I am right here.
I would like to crop the no-data part of some rasters (example of the image in 1 where no-data is in black) without defining the extent manually.
Any idea?
You can use trim to remove exterior rows and columns that only have NA values:
library(raster)
r <- raster(ncols=18,nrows=18)
r[39:49] <- 1
r[205] <- 6
s <- trim(r)
To change other values to or from NA you can use reclassify. For example, to change NA to 0:
x <- reclassify(r, cbind(NA, 0))
[ subsetting and [<- replacement methods are defined for raster objects so you can simply do r[ r[] == 1 ] <- NA to get rid of the values where 1 is your nodata value (use NAvalue(r) to find out what R considers your nodata value is supposed to be if you aren't sure).
Note you have to use r[] inside the [ subsetting command to access the values. Here is a worked example...
Example
# Make a raster from system file
logo1 <- raster(system.file("external/rlogo.grd", package="raster"))
# Copy to see difference
logo2 <- logo1
# Set all values in logo2 that are > 230 to be NA
logo2[ logo2[] > 230 ] <- NA
# Observe difference
par( mfrow = c( 1,2 ) )
plot(logo1)
plot(logo2)
I have 2 slightly different solutions. The first requires to manually identify the extent but uses predefined functions. The second is more automatic, but a bit more handmade.
Create a reproducible raster for which the first 2 rows are NA
library(raster)
# Create a reproducible example
r1 <- raster(ncol=10, nrow=10)
# The first 2 rows are filled with NAs (no value)
r1[] <- c(rep(NA,20),21:100)
Solution #1
Manually get the extent from the plotted figure using drawExtent()
plot(r1)
r1CropExtent <- drawExtent()
Crop the raster using the extent selected from the figure
r2 <- crop(r1, r1CropExtent)
Plot for comparison
layout(matrix(1:2, nrow=1))
plot(r1)
plot(r2)
Solution #2
It identifies the rows and columns of the raster that only have NA values and remove the ones that are on the margin of the raster. It then calculate the extent using extent().
Transform the raster into a matrix that identifies whether the values are NA or not.
r1NaM <- is.na(as.matrix(r1))
Find the columns and rows that are not completely filled by NAs
colNotNA <- which(colSums(r1NaM) != nrow(r1))
rowNotNA <- which(rowSums(r1NaM) != ncol(r1))
Find the extent of the new raster by using the first ans last columns and rows that are not completely filled by NAs. Use crop() to crop the new raster.
r3Extent <- extent(r1, rowNotNA[1], rowNotNA[length(rowNotNA)],
colNotNA[1], colNotNA[length(colNotNA)])
r3 <- crop(r1, r3Extent)
Plot the rasters for comparison.
layout(matrix(1:2, nrow=1))
plot(r1)
plot(r3)
I have written a small function based on Marie's answer to quickly plot cropped rasters. However, there may be a memory issue if the raster is extremely large, because the computer may not have enough RAM to load the raster as a matrix.
I therefore wrote a memory safe function which will use Marie's method if the computer has enough RAM (because it is the fastest way), or a method based on raster functions if the computer does not have enough RAM (it is slower but memory-safe).
Here is the function:
plotCroppedRaster <- function(x, na.value = NA)
{
if(!is.na(na.value))
{
x[x == na.value] <- NA
}
if(canProcessInMemory(x, n = 2))
{
x.matrix <- is.na(as.matrix(x))
colNotNA <- which(colSums(x.matrix) != nrow(x))
rowNotNA <- which(rowSums(x.matrix) != ncol(x))
croppedExtent <- extent(x,
r1 = rowNotNA[1],
r2 = rowNotNA[length(rowNotNA)],
c1 = colNotNA[1],
c2 = colNotNA[length(colNotNA)])
plot(crop(x, croppedExtent))
} else
{
xNA <- is.na(x)
colNotNA <- which(colSums(xNA) != nrow(x))
rowNotNA <- which(rowSums(xNA) != ncol(x))
croppedExtent <- extent(x,
r1 = rowNotNA[1],
r2 = rowNotNA[length(rowNotNA)],
c1 = colNotNA[1],
c2 = colNotNA[length(colNotNA)])
plot(crop(x, croppedExtent))
}
}
Examples :
library(raster)
r1 <- raster(ncol=10, nrow=10)
r1[] <- c(rep(NA,20),21:100)
# Uncropped
plot(r1)
# Cropped
plotCroppedRaster(r1)
# If the no-data value is different, for example 0
r2 <- raster(ncol=10, nrow=10)
r2[] <- c(rep(0,20),21:100)
# Uncropped
plot(r2)
# Cropped
plotCroppedRaster(r2, na.value = 0)
If you use the rasterVis package (any version after Jun 25, 2021), it will automatically crop the NA values out for terra's SpatRaster
Install rasterVis development version from GitHub
if (!require("librarian")) install.packages("librarian")
librarian::shelf(raster, terra, oscarperpinan/rastervis)
# Create a reproducible example
r1 <- raster(ncol = 10, nrow = 10)
# The first 2 rows are filled with NAs (no value)
r1[] <- c(rep(NA, 20), 21:100)
levelplot() for r1
rasterVis::levelplot(r1,
margin = list(axis = TRUE))
Convert to terra's SpatRaster then plot again using levelplot()
r2 <- rast(r1)
rasterVis::levelplot(r2,
margin = list(axis = TRUE))
Created on 2021-06-26 by the reprex package (v2.0.0)