How to input dissimilarity matrix in spatial analysis in spdep R - r

Aim: I want to create a dissimilarity matrix between pairs of coordinates. I want to use this matrix as an input to calculate local spatial clusters using Moran's I (LISA) and latter in geographically weighted regression (GWR).
Problem: I know I can use dnearneigh{spdep} to calculate a distance matrix. However, I want to use the travel-time between polygons I already have estimated. In practice, I think this would be like inputting a dissimilarity matrix that tells the distance/difference between polygons based on a another characteristic. I've tried inputting my matrix to dnearneigh{spdep}, but I get the error Error: ncol(x) == 2 is not TRUE
dist_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat = F, row.names=rn)
Any suggestions? There is a reproducible example below:
EDIT: Digging a bit further, I think I could use mat2listw{spdep} but I'm still not sure it keeps the correspondence between the matrix and the polygons. If I add row.names = T it returns an error row.names wrong length :(
listw_dissi <- mat2listw(diss_matrix_invers)
lmoran <- localmoran(oregon.tract#data$white, listw_dissi,
zero.policy=T, alternative= "two.sided")
Reproducible example
library(UScensus2000tract)
library(spdep)
library(ggplot2)
library(dplyr)
library(reshape2)
library(magrittr)
library(data.table)
library(reshape)
library(rgeos)
library(geosphere)
# load data
data("oregon.tract")
# get centroids as a data.frame
centroids <- as.data.frame( gCentroid(oregon.tract, byid=TRUE) )
# Convert row names into first column
setDT(centroids, keep.rownames = TRUE)[]
# create Origin-destination pairs
od_pairs <- expand.grid.df(centroids, centroids) %>% setDT()
colnames(od_pairs) <- c("origi_id", "long_orig", "lat_orig", "dest_id", "long_dest", "lat_dest")
# calculate dissimilarity between each pair.
# For the sake of this example, let's use ellipsoid distances. In my real case I have travel-time estimates
od_pairs[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2),
matrix(c(long_dest, lat_dest), ncol = 2))]
# This is the format of how my travel-time estimates are organized, it has some missing values which include pairs of origin-destination that are too far (more than 2hours apart)
od_pairs <- od_pairs[, .(origi_id, dest_id, dist)]
od_pairs$dist[3] <- NA
> origi_id dest_id dist
> 1: oregon_0 oregon_0 0.00000
> 2: oregon_1 oregon_0 NA
> 3: oregon_2 oregon_0 39874.63673
> 4: oregon_3 oregon_0 31259.63100
> 5: oregon_4 oregon_0 33047.84249
# Convert to matrix
diss_matrix <- acast(od_pairs, origi_id~dest_id, value.var="dist") %>% as.matrix()
# get an inverse matrix of distances, make sure diagonal=0
diss_matrix_invers <- 1/diss_matrix
diag(diss_matrix_invers) <- 0
Calculate simple distance matrix
# get row names
rn <- sapply(slot(oregon.tract, "polygons"), function(x) slot(x, "ID"))
# get centroids coordinates
coords <- coordinates(oregon.tract)
# get distance matrix
diss_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat =T, row.names=rn)
class(diss_matrix)
> [1] "nb"
Now how to use my diss_matrix_invers here?

you are right about the use of matlistw{spdep}. By default the function preserves the names of rows to keep correspondence between the matrix. You can also specify the row.names like so:
listw_dissi <- mat2listw(diss_matrix_invers, row.names = row.names(diss_matrix_invers))
The list that is created will contain the appropriate names for the neighbours along with their distance as weights. You can check this by looking at the neighbours.
listw_dissi$neighbours[[1]][1:5]
And you should be able to use this directly to calculate Moran's I.
dnearneigh{sdep}
There is no way you can use diss_matrix within dnearneigh{spdep}, as this function takes in a list of coordinates.
however, if you need to define a set of neighbours given a distance threshold (d1,d2) using your own distance matrix (travel-time). I think this function can do the trick.
dis.neigh<-function(x, d1 = 0, d2=50){
#x must be a symmetrical distance matrix
#create empty list
style = "M" #for style unknown
neighbours<-list()
weights<-list()
#set attributes of neighbours list
attr(neighbours, "class")<-"nb"
attr(neighbours, "distances")<-c(d1,d2)
attr(neighbours, "region.id")<-colnames(x)
#check each row for neighbors that satisfy distance threshold
neighbour<-c()
weight<-c()
i<-1
for(row in c(1:nrow(x))){
j<-1
for(col in c(1:ncol(x))){
if(x[row,col]>d1 && x[row,col]<d2){
neighbour[j]<-col
weight[j]<-1/x[row,col] #inverse distance (dissimilarity)
j<-1+j
}
}
neighbours[i]<-list(neighbour)
weights[i]<-list(weight)
i<-1+i
}
#create neighbour and weight list
res <- list(style = style, neighbours = neighbours, weights = weights)
class(res) <- c("listw", "nb")
attr(res, "region.id") <- attr(neighbours, "region.id")
attr(res, "call") <- match.call()
return(res)
}
And use it like so:
nb_list<-dis.neigh(diss_matrix, d1=0, d2=10000)
lmoran <- localmoran(oregon.tract#data$white, nb_lists, alternative= "two.sided")

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

Using ppx and nndist to find nearest neighbour of each type

For a dataset with three covariates and a treatment indicator I'm trying to find each individual's nearest neighbour. In particular, I want to find the nearest neighbour in each of the treatment groups.
# Generate a treatment indicator factor
treatment <- factor(data_train[,"a"], levels = c("0", "1"), labels = c("Untreated", "Treated"))
# Put the covariate data into 'points' format
pointpattern <- ppx(data = data.frame(data_train[, c("Z1", "Z2", "Z3")], "Treatment" = treatment), coord.type = c("s", "s", "s", "m"))
# Find the nearest neighbour of each type
dists <- nndist(X = pointpattern, by = marks(pointpattern))
However the object 'dists' is only a vector, which appears to be the nearest neighbour with the treatment group completely ignored.
I've spent nearly a full day trying to figure out what Im doing wrong - please help!
The argument by is not recognised by the function nndist.ppx which you are using in this example.
This is about "classes and methods" in R. The function nndist is generic; when you call nndist on an object of class "ppx", the system invokes the function nndist.ppx which is the "method" for this class.
You can check the capabilities of nndist.ppx by looking at its help file; it does not support the argument by.
There are other nndist methods which do recognise the argument by, for example nndist.ppp, and I guess you were looking at the documentation for that.
We will update the code in spatstat so that this capability is available for nndist.ppx as well.
In the meantime, you can use the function nncross.ppx to find nearest distances from one group of points to another. Here's how to get the result you wanted:
Y <- split(pointpattern) # divide into groups
m <- length(Y) # number of groups
n <- npoints(pointpattern)
result <- matrix(, n, m) # final results will go here
partresults <- list() # collect results for each group here
for(i in 1:m) {
Yi <- Y[[i]]
ni <- npoints(Yi)
a <- matrix(, ni, m)
a[,i] <- nndist(Yi)
for(j in (1:m)[-i])
a[,j] <- nncross(Yi, Y[[j]], what="d")
partresults[[i]] <- a
}
split(result, marks(pointpattern)) <- partresults
Then result is the matrix of distances you wanted.

Large dataset and autocorrelation computation

I have geographical data at the town level for 35 000 towns.
I want to estimate the impact of my covariates X on a dependent variable Y, taking into account autocorrelation.
I have first computed weight matrix and then I used the command spautolm from the package spam but it returned me an error message because my dataset is too large.
Do you have any ideas of how can I fix it? Is there any other equivalent commands that would work?
library(haven)
library(tibble)
library(sp)
library(data.table)
myvars <- c("longitude","latitude","Y","X")
newdata2 <- na.omit(X2000[myvars]) #drop observations with no values for one observation
df <- data.frame(newdata2)
newdata3<- unique(df) #drop duplicates in terms of longitude and latitude
coordinates(newdata3) <- c("longitude2","latitude2") #set the coordinates
coords<-coordinates(newdata3)
Sy4_nb <- knn2nb(knearneigh(coords, k = 4)) # Display the k closest neighbours
Sy4_lw_idwB <- nb2listw(Sy8_nb, glist = idw, style = "B") #generate a list weighted by the distance
When I try to run such formulas:
spautolm(formula = Y~X, data = newdata3, listw = Sy4_lw_idwB)
It returns me : Error: cannot allocate vector of size 8.3 Gb

How to draw dendrogram using ape package in R?

I have a distance matrix of ~200 x 200 size I an unable to plot a dendrogram using the BioNJ option of ape library in R
The size is big to make the plot visible
What ways can I improve the visibility
Two options depending on your data
If you need to calculate the distance matrix of your data then use
set.seed(1) # makes random sampling with rnorm reproducible
# example matrix
m <- matrix(rnorm(100), nrow = 5) # any MxN matrix
distm <- dist(m) # distance matrix
hm <- hclust(distm)
plot(hm)
If your data is a distance matrix (must be a square matrix!)
set.seed(1)
# example matrix
m <- matrix(rnorm(25), nrow=5) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)
A 200 x 200 distance matrix gives me a reasonable plot
set.seed(1)
# example matrix
m <- matrix(rnorm(200*200), nrow=200) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)

K-means clustering with my own distance function

I have defined a distance function as follow
jaccard.rules.dist <- function(x,y) ({
# implements feature distance. Feature "Airline" gets a different treatment, the rest
# are booleans coded as 1/0. Airline column distance = 0 if same airline, 1 otherwise
# the rest of the atributes' distance is cero iff both are 1, 1 otherwise
airline.column <- which(colnames(x)=="Aerolinea")
xmod <- x
ymod <-y
xmod[airline.column] <-ifelse(x[airline.column]==y[airline.column],1,0)
ymod[airline.column] <-1 # if they are the same, they are both ones, else they are different
andval <- sum(xmod&ymod)
orval <- sum(xmod|ymod)
return (1-andval/orval)
})
which modifies a little bit jaccard distance for dataframes of the form
t <- data.frame(Aerolinea=c("A","B","C","A"),atr2=c(1,1,0,0),atr3=c(0,0,0,1))
Now, I would like to perform some k-means clustering on my dataset, using the distance just defined. If I try to use the function kmeans, there is no way to specify my distance function. I tried the to use hclust, which accepts a distanca matrix, which I calculated as follows
distmat <- matrix(nrow=nrow(t),ncol=nrow(t))
for (i in 1:nrow(t))
for (j in i:nrow(t))
distmat[j,i] <- jaccard.rules.dist(t[j,],t[i,])
distmat <- as.dist(distmat)
and then invoked hclust
hclust(distmat)
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
what am i doing wrong? is there another way to do clustering that just accepts an arbitrary distance function as its input?
thanks in advance.
I think distmat (from your code) has to be a distance structure (which is different from a matrix). Try this instead:
require(proxy)
d <- dist(t, jaccard.rules.dist)
clust <- hclust(d=d)
clust#centers
[,1] [,2]
[1,] 0.044128322 -0.039518142
[2,] -0.986798495 0.975132418
[3,] -0.006441892 0.001099211
[4,] 1.487829642 1.000431146

Resources