R focal function: Calculate square of difference between raster cell and neighborhood and find the mean value for a 3x3 window - r

I am trying to calculate the square of the difference between a raster cell i and its neighbors js (i.e.,(j-i)^2) in a 3 x 3 neighborhood, and then calculate the mean value of those differences and assign that result to cell i.
I found this answer, given by Forrest R. Stevens, that comes close to what I want to achieve, but I have only one raster (not a stack) with 136710 cells (1 089 130 combinations with the adjacent function), so a for loop is taking forever.
I want to use the function focal from the raster package, so the for loop is only run for the 3x3 matrix, but it is not working for me.
Here is an example using Forrest R. Stevens' code I mentioned above:
r <- raster(matrix(1:25,nrow=5))
r[] <-c(2,3,2,3,2,
3,2,3,2,NA,
NA,3,2,3,2,
NA,2,3,2,3,
2,3,2,3,NA)
## Calculate adjacent raster cells for each focal cell:
a <- raster::adjacent(r, cell=1:ncell(r), directions=8, sorted=T)
# Function
sq_dff<- function(w){
## Create column to store calculation:
out <- data.frame(a)
out$sqrd_diff <- NA
## Loop over all focal cells and their adjacencies,
## extract the values across all layers and calculate
## the squared difference, storing it in the appropriate row of
## our output data.frame:
cores <- 8
beginCluster(cores, type='SOCK')
for (i in 1:nrow(a)) {
print(i)
out$sqrd_diff[i] <- (r[a[i,2]]- r[a[i,1]])^2
print(Sys.time())
}
endCluster()
## Take the mean of the squared differences by focal cell ID:
r_out_vals <- aggregate(out$sqrd_diff, by=list(out$from), FUN=mean,na.rm=T)
names(r_out_vals)<- c('cell_numb','value')
return(r_out_vals$value)
}
r1 <- focal(x=r, w=matrix(1,3,3), fun=sq_dff)
The function works well if I apply it like this:
r1 <-sq_dff(r), and using #r_out <- r[[1]]; #r_out[] <- r_out_vals$value; return(r_out) (as suggested by. Forrest R. Stevens in his answer) instead of return(r_out_vals$value)
But, when I apply it inside the focal function as written above, it returns a raster with values for only the nine cells in the center and all of them with the same value of 0.67 assigned.
Thanks!

You could try this:
library(terra)
r <- rast(matrix(1:25,nrow=5))
r[] <-c(2,3,2,3,2,
3,2,3,2,NA,
NA,3,2,3,2,
NA,2,3,2,3,
2,3,2,3,NA)
f <- function(x) {
mean((x[-5] - x[5])^2, na.rm=TRUE)
}
rr <- focal(r, 3 ,f)
plot(rr)
text(rr, dig=2)

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

What causes the difference between calc and cellStats in raster calculations in R?

I am working with a dataset that consists of 20 layers, stacked in a RasterBrick (originating from an array). I have looked into the sum of the layers, calculated with both 'calc' and 'cellStats'. I have used calc to calculate the sum of the total values and cellStats to look at the average of the values per layer (useful for a time series).
However, when I sum the average of each layer, it is half the value of the other calculated sum. What causes this difference? What am I overlooking?
Code looks like this:
testarray <- runif(54214776,0,1)
# Although testarray should contain a raster of 127x147 with 2904 time layers.
# Not really sure how to create that yet.
for (i in 1830:1849){
slice<-array2[,,i]
r <- raster(nrow=(127*5), ncol=(147*5), resolution =5, ext=ext1, vals=slice)
x <- stack(x , r)
}
brickhp2 <- brick(x)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2[r_sumhp2<= 0] <- NA
SWEavgpertimestepM <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
The goal is to compare the sum of the layers calculated with 'calc(x, sum)' with the sum of the mean calculated with 'cellStats(x, mean)'.
Rasterbrick looks like this (600kb, GTiff) : http://www.filedropper.com/brickhp2
*If there is a better way to share this, please let me know.
The confusion comes as you are using calc which operates pixel-wise on a brick (i.e. performs the calculation on the 20 values at each pixel and returns a single raster layer) and cellStats which performs the calculation on each raster layer individually and returns a single values for each layer. You can see that the results are comparable if you use this code:
library(raster)
##set seed so you get the same runif vals
set.seed(999)
##create example rasters
ls=list()
for (i in 1:20){
r <- raster(nrow=(127*5), ncol=(147*5), vals=runif(127*5*147*5))
ls[[i]] <- r
}
##create raster brick
brickhp2 <- brick(ls)
##calc sum (pixel-wise)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2 ##returns raster layer
##calc mean (layer-wise)
r_meanhp2 <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
r_meanhp2 ##returns vector of length nlayers(brickhp2)
##to get equivalent values you need to divide r_sumhp2 by the number of layers
##and then calculate the mean
cellStats(r_sumhp2/nlayers(brickhp2),stat="mean")
[1] 0.4999381
##and for r_meanhp2 you need to calculate the mean of the means
mean(r_meanhp2)
[1] 0.4999381
You will need to determine for yourself if you want to use the pixel or layer wise result for your application.

R: Smoothing Data (LargeDataset - A For Loop is Too Slow)

I'm aware there are many questions related to smoothing data in R, however, my knowledge is far too basic to apply it to the following problem! My key issue is that my data is >1.7m rows.
My Problem
I have a list "df" of 4 equal length vectors.
df[[1]] is a vector containing all uk postcodes
df[[2]] is a vector of latitudes
df[[3]] is a vector of longitudes
df[[4]] contains concentrations of a certain material
What I need to do is create a vector of 'smoothed' concentrations for each postcode, which should be calculated as: "A weighted average of concentrations in all postcodes within a given distance. The weighting is defined as exp(-Distance)"
I currently have the following code. It works perfectly (I've tested on a subset of 100k postcodes). However, it's far too slow, given the fact it loops over almost 2 million entries.
Can anyone help me finding a faster way to do this?
df <- as.list(Import[,c("Postcode", "Latitude", "Longitude", "Concentration")])
n <- length(df[[1]])
Out <- rep(0,n)
for(i in 1:n){
#Calculate squared Euclidean Distance
BaseLat <- df[[2]][i]
BaseLong <- df[[3]][i]
Distance <- (df[[2]]-BaseLat)^2 + (df[[3]]-BaseLong)^2
#Weightings
Weight <- ifelse(Distance < 0.01, exp(-Distance), 0)
#Take average rate and assign to output vector
Out[i] <- sum(df[[4]]*Weight)/sum(Weight)
}

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

applying the pvclust R function to a precomputed dist object

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Resources