Let's assume I have done several operations and created cluster vectors of correlation values shown below
D <- matrix(rexp(10*10,rate=.1), ncol=10) #create a randomly filled 10x10 matrix
C <- matrix(rexp(10*10,rate=.1),ncol=10)
DCor <- cor(D) # generate correlation matrix
CCor <- cor(C)
DUpper<- DCor[upper.tri(DCor)] # extract upper triangle
CUpper<- CCor[upper.tri(CCor)]
ClusterD <- kmeans(DUpper,3) # cluster correlations
ClusterC <- kmeans(CUpper,3)
ClusterC <- cbind(c(1:45),matrix(ClusterC$cluster)) # add row numbers as column
ClusterD <- cbind(c(1:45),matrix(ClusterD$cluster))
I would like to generate a matrix shows the intersection of each cluster group. In this matrix, 5 rows belong to both C1 and D2 group.
How can I generate a matrix like this?
Before the cbind lines, you could do:
table(ClusterC$cluster, ClusterD$cluster)
Related
I am trying to calculate the square of the difference between a raster cell i and its neighbors js (i.e.,(j-i)^2) in a 3 x 3 neighborhood, and then calculate the mean value of those differences and assign that result to cell i.
I found this answer, given by Forrest R. Stevens, that comes close to what I want to achieve, but I have only one raster (not a stack) with 136710 cells (1 089 130 combinations with the adjacent function), so a for loop is taking forever.
I want to use the function focal from the raster package, so the for loop is only run for the 3x3 matrix, but it is not working for me.
Here is an example using Forrest R. Stevens' code I mentioned above:
r <- raster(matrix(1:25,nrow=5))
r[] <-c(2,3,2,3,2,
3,2,3,2,NA,
NA,3,2,3,2,
NA,2,3,2,3,
2,3,2,3,NA)
## Calculate adjacent raster cells for each focal cell:
a <- raster::adjacent(r, cell=1:ncell(r), directions=8, sorted=T)
# Function
sq_dff<- function(w){
## Create column to store calculation:
out <- data.frame(a)
out$sqrd_diff <- NA
## Loop over all focal cells and their adjacencies,
## extract the values across all layers and calculate
## the squared difference, storing it in the appropriate row of
## our output data.frame:
cores <- 8
beginCluster(cores, type='SOCK')
for (i in 1:nrow(a)) {
print(i)
out$sqrd_diff[i] <- (r[a[i,2]]- r[a[i,1]])^2
print(Sys.time())
}
endCluster()
## Take the mean of the squared differences by focal cell ID:
r_out_vals <- aggregate(out$sqrd_diff, by=list(out$from), FUN=mean,na.rm=T)
names(r_out_vals)<- c('cell_numb','value')
return(r_out_vals$value)
}
r1 <- focal(x=r, w=matrix(1,3,3), fun=sq_dff)
The function works well if I apply it like this:
r1 <-sq_dff(r), and using #r_out <- r[[1]]; #r_out[] <- r_out_vals$value; return(r_out) (as suggested by. Forrest R. Stevens in his answer) instead of return(r_out_vals$value)
But, when I apply it inside the focal function as written above, it returns a raster with values for only the nine cells in the center and all of them with the same value of 0.67 assigned.
Thanks!
You could try this:
library(terra)
r <- rast(matrix(1:25,nrow=5))
r[] <-c(2,3,2,3,2,
3,2,3,2,NA,
NA,3,2,3,2,
NA,2,3,2,3,
2,3,2,3,NA)
f <- function(x) {
mean((x[-5] - x[5])^2, na.rm=TRUE)
}
rr <- focal(r, 3 ,f)
plot(rr)
text(rr, dig=2)
I have a distance matrix of ~200 x 200 size I an unable to plot a dendrogram using the BioNJ option of ape library in R
The size is big to make the plot visible
What ways can I improve the visibility
Two options depending on your data
If you need to calculate the distance matrix of your data then use
set.seed(1) # makes random sampling with rnorm reproducible
# example matrix
m <- matrix(rnorm(100), nrow = 5) # any MxN matrix
distm <- dist(m) # distance matrix
hm <- hclust(distm)
plot(hm)
If your data is a distance matrix (must be a square matrix!)
set.seed(1)
# example matrix
m <- matrix(rnorm(25), nrow=5) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)
A 200 x 200 distance matrix gives me a reasonable plot
set.seed(1)
# example matrix
m <- matrix(rnorm(200*200), nrow=200) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)
Aim: I want to create a dissimilarity matrix between pairs of coordinates. I want to use this matrix as an input to calculate local spatial clusters using Moran's I (LISA) and latter in geographically weighted regression (GWR).
Problem: I know I can use dnearneigh{spdep} to calculate a distance matrix. However, I want to use the travel-time between polygons I already have estimated. In practice, I think this would be like inputting a dissimilarity matrix that tells the distance/difference between polygons based on a another characteristic. I've tried inputting my matrix to dnearneigh{spdep}, but I get the error Error: ncol(x) == 2 is not TRUE
dist_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat = F, row.names=rn)
Any suggestions? There is a reproducible example below:
EDIT: Digging a bit further, I think I could use mat2listw{spdep} but I'm still not sure it keeps the correspondence between the matrix and the polygons. If I add row.names = T it returns an error row.names wrong length :(
listw_dissi <- mat2listw(diss_matrix_invers)
lmoran <- localmoran(oregon.tract#data$white, listw_dissi,
zero.policy=T, alternative= "two.sided")
Reproducible example
library(UScensus2000tract)
library(spdep)
library(ggplot2)
library(dplyr)
library(reshape2)
library(magrittr)
library(data.table)
library(reshape)
library(rgeos)
library(geosphere)
# load data
data("oregon.tract")
# get centroids as a data.frame
centroids <- as.data.frame( gCentroid(oregon.tract, byid=TRUE) )
# Convert row names into first column
setDT(centroids, keep.rownames = TRUE)[]
# create Origin-destination pairs
od_pairs <- expand.grid.df(centroids, centroids) %>% setDT()
colnames(od_pairs) <- c("origi_id", "long_orig", "lat_orig", "dest_id", "long_dest", "lat_dest")
# calculate dissimilarity between each pair.
# For the sake of this example, let's use ellipsoid distances. In my real case I have travel-time estimates
od_pairs[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2),
matrix(c(long_dest, lat_dest), ncol = 2))]
# This is the format of how my travel-time estimates are organized, it has some missing values which include pairs of origin-destination that are too far (more than 2hours apart)
od_pairs <- od_pairs[, .(origi_id, dest_id, dist)]
od_pairs$dist[3] <- NA
> origi_id dest_id dist
> 1: oregon_0 oregon_0 0.00000
> 2: oregon_1 oregon_0 NA
> 3: oregon_2 oregon_0 39874.63673
> 4: oregon_3 oregon_0 31259.63100
> 5: oregon_4 oregon_0 33047.84249
# Convert to matrix
diss_matrix <- acast(od_pairs, origi_id~dest_id, value.var="dist") %>% as.matrix()
# get an inverse matrix of distances, make sure diagonal=0
diss_matrix_invers <- 1/diss_matrix
diag(diss_matrix_invers) <- 0
Calculate simple distance matrix
# get row names
rn <- sapply(slot(oregon.tract, "polygons"), function(x) slot(x, "ID"))
# get centroids coordinates
coords <- coordinates(oregon.tract)
# get distance matrix
diss_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat =T, row.names=rn)
class(diss_matrix)
> [1] "nb"
Now how to use my diss_matrix_invers here?
you are right about the use of matlistw{spdep}. By default the function preserves the names of rows to keep correspondence between the matrix. You can also specify the row.names like so:
listw_dissi <- mat2listw(diss_matrix_invers, row.names = row.names(diss_matrix_invers))
The list that is created will contain the appropriate names for the neighbours along with their distance as weights. You can check this by looking at the neighbours.
listw_dissi$neighbours[[1]][1:5]
And you should be able to use this directly to calculate Moran's I.
dnearneigh{sdep}
There is no way you can use diss_matrix within dnearneigh{spdep}, as this function takes in a list of coordinates.
however, if you need to define a set of neighbours given a distance threshold (d1,d2) using your own distance matrix (travel-time). I think this function can do the trick.
dis.neigh<-function(x, d1 = 0, d2=50){
#x must be a symmetrical distance matrix
#create empty list
style = "M" #for style unknown
neighbours<-list()
weights<-list()
#set attributes of neighbours list
attr(neighbours, "class")<-"nb"
attr(neighbours, "distances")<-c(d1,d2)
attr(neighbours, "region.id")<-colnames(x)
#check each row for neighbors that satisfy distance threshold
neighbour<-c()
weight<-c()
i<-1
for(row in c(1:nrow(x))){
j<-1
for(col in c(1:ncol(x))){
if(x[row,col]>d1 && x[row,col]<d2){
neighbour[j]<-col
weight[j]<-1/x[row,col] #inverse distance (dissimilarity)
j<-1+j
}
}
neighbours[i]<-list(neighbour)
weights[i]<-list(weight)
i<-1+i
}
#create neighbour and weight list
res <- list(style = style, neighbours = neighbours, weights = weights)
class(res) <- c("listw", "nb")
attr(res, "region.id") <- attr(neighbours, "region.id")
attr(res, "call") <- match.call()
return(res)
}
And use it like so:
nb_list<-dis.neigh(diss_matrix, d1=0, d2=10000)
lmoran <- localmoran(oregon.tract#data$white, nb_lists, alternative= "two.sided")
Does anyone know how to generate matrix with certain rank in R?
I ultimately want to create data matrix Y = X + E
where rank(X)=k and E~i.i.d.N(0,sigma^2).
The easiest is the identity matrix, which has always full rank. So e.g. use:
k <- 10
mymatrix <- diag(k)
Here, rows and columns are equal to the rank you specify
I suppose you want to mimic a regression model, so you might want to have more rows (meaning 'observations') than columns, (e.g. 'variables'). The following code allows you to specify both:
k <- 5 # rank of your matrix
nobs <- 10 # number of lines within X
X <- rbind(diag(k), matrix(rep(0,k*(nobs-k)), ncol=k))
y <- X + rnorm(nobs)
Note, that X - and therefore also y - now have full column rank. there is no multicollinearity in this 'model'.
I have a matrix with a hundred rows.
Is there a way to obtain a subset of ten rows which are most similar to the first row.
res2 <- matrix(rexp(200, rate=.1), ncol=10, nrow=100)
set1 <- subset(res2, res2 >condition1)
set1[with(set1, order(condition)), ]
set2 <- head(set1,10)
Perhaps:
Generate data:
set.seed(101)
res2 <- matrix(rexp(200, rate=.1), ncol=10, nrow=100)
Calculate the distance matrix. This is very inefficient because we're computing all of the pairwise distances, but it's efficiently coded and easy to use and you have lots of choices of distance metric (see ?dist, look for method). For this size problem it's very quick.
dd <- dist(res2)
rr <- rank(as.matrix(dd)[1,])
You'll notice that the rank of the first element of the first row (which is the distance between row 1 and itself) is 1, and its value (as.matrix(dd)[1,1]) is zero. So all we need now are the rows with the next ten smallest distances ...
res2[rr>1 & rr<=11,]