What causes the difference between calc and cellStats in raster calculations in R? - r

I am working with a dataset that consists of 20 layers, stacked in a RasterBrick (originating from an array). I have looked into the sum of the layers, calculated with both 'calc' and 'cellStats'. I have used calc to calculate the sum of the total values and cellStats to look at the average of the values per layer (useful for a time series).
However, when I sum the average of each layer, it is half the value of the other calculated sum. What causes this difference? What am I overlooking?
Code looks like this:
testarray <- runif(54214776,0,1)
# Although testarray should contain a raster of 127x147 with 2904 time layers.
# Not really sure how to create that yet.
for (i in 1830:1849){
slice<-array2[,,i]
r <- raster(nrow=(127*5), ncol=(147*5), resolution =5, ext=ext1, vals=slice)
x <- stack(x , r)
}
brickhp2 <- brick(x)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2[r_sumhp2<= 0] <- NA
SWEavgpertimestepM <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
The goal is to compare the sum of the layers calculated with 'calc(x, sum)' with the sum of the mean calculated with 'cellStats(x, mean)'.
Rasterbrick looks like this (600kb, GTiff) : http://www.filedropper.com/brickhp2
*If there is a better way to share this, please let me know.

The confusion comes as you are using calc which operates pixel-wise on a brick (i.e. performs the calculation on the 20 values at each pixel and returns a single raster layer) and cellStats which performs the calculation on each raster layer individually and returns a single values for each layer. You can see that the results are comparable if you use this code:
library(raster)
##set seed so you get the same runif vals
set.seed(999)
##create example rasters
ls=list()
for (i in 1:20){
r <- raster(nrow=(127*5), ncol=(147*5), vals=runif(127*5*147*5))
ls[[i]] <- r
}
##create raster brick
brickhp2 <- brick(ls)
##calc sum (pixel-wise)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2 ##returns raster layer
##calc mean (layer-wise)
r_meanhp2 <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
r_meanhp2 ##returns vector of length nlayers(brickhp2)
##to get equivalent values you need to divide r_sumhp2 by the number of layers
##and then calculate the mean
cellStats(r_sumhp2/nlayers(brickhp2),stat="mean")
[1] 0.4999381
##and for r_meanhp2 you need to calculate the mean of the means
mean(r_meanhp2)
[1] 0.4999381
You will need to determine for yourself if you want to use the pixel or layer wise result for your application.

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

R focal function: Calculate square of difference between raster cell and neighborhood and find the mean value for a 3x3 window

I am trying to calculate the square of the difference between a raster cell i and its neighbors js (i.e.,(j-i)^2) in a 3 x 3 neighborhood, and then calculate the mean value of those differences and assign that result to cell i.
I found this answer, given by Forrest R. Stevens, that comes close to what I want to achieve, but I have only one raster (not a stack) with 136710 cells (1 089 130 combinations with the adjacent function), so a for loop is taking forever.
I want to use the function focal from the raster package, so the for loop is only run for the 3x3 matrix, but it is not working for me.
Here is an example using Forrest R. Stevens' code I mentioned above:
r <- raster(matrix(1:25,nrow=5))
r[] <-c(2,3,2,3,2,
3,2,3,2,NA,
NA,3,2,3,2,
NA,2,3,2,3,
2,3,2,3,NA)
## Calculate adjacent raster cells for each focal cell:
a <- raster::adjacent(r, cell=1:ncell(r), directions=8, sorted=T)
# Function
sq_dff<- function(w){
## Create column to store calculation:
out <- data.frame(a)
out$sqrd_diff <- NA
## Loop over all focal cells and their adjacencies,
## extract the values across all layers and calculate
## the squared difference, storing it in the appropriate row of
## our output data.frame:
cores <- 8
beginCluster(cores, type='SOCK')
for (i in 1:nrow(a)) {
print(i)
out$sqrd_diff[i] <- (r[a[i,2]]- r[a[i,1]])^2
print(Sys.time())
}
endCluster()
## Take the mean of the squared differences by focal cell ID:
r_out_vals <- aggregate(out$sqrd_diff, by=list(out$from), FUN=mean,na.rm=T)
names(r_out_vals)<- c('cell_numb','value')
return(r_out_vals$value)
}
r1 <- focal(x=r, w=matrix(1,3,3), fun=sq_dff)
The function works well if I apply it like this:
r1 <-sq_dff(r), and using #r_out <- r[[1]]; #r_out[] <- r_out_vals$value; return(r_out) (as suggested by. Forrest R. Stevens in his answer) instead of return(r_out_vals$value)
But, when I apply it inside the focal function as written above, it returns a raster with values for only the nine cells in the center and all of them with the same value of 0.67 assigned.
Thanks!
You could try this:
library(terra)
r <- rast(matrix(1:25,nrow=5))
r[] <-c(2,3,2,3,2,
3,2,3,2,NA,
NA,3,2,3,2,
NA,2,3,2,3,
2,3,2,3,NA)
f <- function(x) {
mean((x[-5] - x[5])^2, na.rm=TRUE)
}
rr <- focal(r, 3 ,f)
plot(rr)
text(rr, dig=2)

R: Smoothing Data (LargeDataset - A For Loop is Too Slow)

I'm aware there are many questions related to smoothing data in R, however, my knowledge is far too basic to apply it to the following problem! My key issue is that my data is >1.7m rows.
My Problem
I have a list "df" of 4 equal length vectors.
df[[1]] is a vector containing all uk postcodes
df[[2]] is a vector of latitudes
df[[3]] is a vector of longitudes
df[[4]] contains concentrations of a certain material
What I need to do is create a vector of 'smoothed' concentrations for each postcode, which should be calculated as: "A weighted average of concentrations in all postcodes within a given distance. The weighting is defined as exp(-Distance)"
I currently have the following code. It works perfectly (I've tested on a subset of 100k postcodes). However, it's far too slow, given the fact it loops over almost 2 million entries.
Can anyone help me finding a faster way to do this?
df <- as.list(Import[,c("Postcode", "Latitude", "Longitude", "Concentration")])
n <- length(df[[1]])
Out <- rep(0,n)
for(i in 1:n){
#Calculate squared Euclidean Distance
BaseLat <- df[[2]][i]
BaseLong <- df[[3]][i]
Distance <- (df[[2]]-BaseLat)^2 + (df[[3]]-BaseLong)^2
#Weightings
Weight <- ifelse(Distance < 0.01, exp(-Distance), 0)
#Take average rate and assign to output vector
Out[i] <- sum(df[[4]]*Weight)/sum(Weight)
}

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

Recalculating distance matrix

I’ve got a large input matrix (4000x10000). I use dist() to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?
I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.
First of all, for euclidean distance of matrices, I'd recommend the rdist function from the fields package. It is written in Fortran and is a lot faster than the dist function. It returns a matrix instead of a dist object, but you can always go from one to the other using as.matrix and as.dist.
Here is (smaller than yours) sample data
num.points <- 400
num.vars <- 1000
original.points <- matrix(runif(num.points * num.vars),
nrow = num.points, ncol = num.vars)
and the distance matrix you already computed:
d0 <- rdist(original.points)
For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:
extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist <- rdist(extra.points)
outer.dist <- rdist(extra.points, original.points)
so you can bind them to your bigger distance matrix:
d1 <- rbind(cbind(d0, t(outer.dist)),
cbind(outer.dist, inner.dist))
Let's check that it matches what a full, long rerun would have produced:
d2 <- rdist(rbind(original.points, extra.points))
identical(d1, d2)
# [1] TRUE

Resources