How to draw dendrogram using ape package in R? - r

I have a distance matrix of ~200 x 200 size I an unable to plot a dendrogram using the BioNJ option of ape library in R
The size is big to make the plot visible
What ways can I improve the visibility

Two options depending on your data
If you need to calculate the distance matrix of your data then use
set.seed(1) # makes random sampling with rnorm reproducible
# example matrix
m <- matrix(rnorm(100), nrow = 5) # any MxN matrix
distm <- dist(m) # distance matrix
hm <- hclust(distm)
plot(hm)
If your data is a distance matrix (must be a square matrix!)
set.seed(1)
# example matrix
m <- matrix(rnorm(25), nrow=5) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)
A 200 x 200 distance matrix gives me a reasonable plot
set.seed(1)
# example matrix
m <- matrix(rnorm(200*200), nrow=200) # must be square matrix!
distm <- as.dist(m)
hm <- hclust(distm)
plot(hm)

Related

Error in R: Cosine similarity and MDS

I calculate the cosine similarity with cosine() from the package 'lsa'. Here with three test vectors:
d <- data.frame(c(-1,1,0,-1,1,1,-1,1,0),c(-1,1,1,1,-1,1,-1,0,1),c(0,0,1,0,-1,-1,0,1,-1))
colnames(d) <- c("vector1","vector2","vector3")
d_dist <- cosine(as.matrix(d))
Now, I want to do dimensionality reduction with cmdscale and after that plot it as a scatterplot:
fit <- cmdscale(d_dist,k=2)
x <- fit[,2]
y <- fit[,1]
plot(x,y)
But I always get the warning In cmdscale (d_dist, k = 2): only 0 of the first 2 eigenvalues ​​are> 0 [translated from German] and an empty fit object.
What am I doing wrong? Thank you so much for your help!
The input should be a distance matrix. E.g.:
d_dist <- 1-d_dist
fit <- cmdscale(d_dist,k=2)
x <- fit[,2]
y <- fit[,1]
plot(x,y)

To find intersection of clusters in R

Let's assume I have done several operations and created cluster vectors of correlation values shown below
D <- matrix(rexp(10*10,rate=.1), ncol=10) #create a randomly filled 10x10 matrix
C <- matrix(rexp(10*10,rate=.1),ncol=10)
DCor <- cor(D) # generate correlation matrix
CCor <- cor(C)
DUpper<- DCor[upper.tri(DCor)] # extract upper triangle
CUpper<- CCor[upper.tri(CCor)]
ClusterD <- kmeans(DUpper,3) # cluster correlations
ClusterC <- kmeans(CUpper,3)
ClusterC <- cbind(c(1:45),matrix(ClusterC$cluster)) # add row numbers as column
ClusterD <- cbind(c(1:45),matrix(ClusterD$cluster))
I would like to generate a matrix shows the intersection of each cluster group. In this matrix, 5 rows belong to both C1 and D2 group.
How can I generate a matrix like this?
Before the cbind lines, you could do:
table(ClusterC$cluster, ClusterD$cluster)

How to input dissimilarity matrix in spatial analysis in spdep R

Aim: I want to create a dissimilarity matrix between pairs of coordinates. I want to use this matrix as an input to calculate local spatial clusters using Moran's I (LISA) and latter in geographically weighted regression (GWR).
Problem: I know I can use dnearneigh{spdep} to calculate a distance matrix. However, I want to use the travel-time between polygons I already have estimated. In practice, I think this would be like inputting a dissimilarity matrix that tells the distance/difference between polygons based on a another characteristic. I've tried inputting my matrix to dnearneigh{spdep}, but I get the error Error: ncol(x) == 2 is not TRUE
dist_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat = F, row.names=rn)
Any suggestions? There is a reproducible example below:
EDIT: Digging a bit further, I think I could use mat2listw{spdep} but I'm still not sure it keeps the correspondence between the matrix and the polygons. If I add row.names = T it returns an error row.names wrong length :(
listw_dissi <- mat2listw(diss_matrix_invers)
lmoran <- localmoran(oregon.tract#data$white, listw_dissi,
zero.policy=T, alternative= "two.sided")
Reproducible example
library(UScensus2000tract)
library(spdep)
library(ggplot2)
library(dplyr)
library(reshape2)
library(magrittr)
library(data.table)
library(reshape)
library(rgeos)
library(geosphere)
# load data
data("oregon.tract")
# get centroids as a data.frame
centroids <- as.data.frame( gCentroid(oregon.tract, byid=TRUE) )
# Convert row names into first column
setDT(centroids, keep.rownames = TRUE)[]
# create Origin-destination pairs
od_pairs <- expand.grid.df(centroids, centroids) %>% setDT()
colnames(od_pairs) <- c("origi_id", "long_orig", "lat_orig", "dest_id", "long_dest", "lat_dest")
# calculate dissimilarity between each pair.
# For the sake of this example, let's use ellipsoid distances. In my real case I have travel-time estimates
od_pairs[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2),
matrix(c(long_dest, lat_dest), ncol = 2))]
# This is the format of how my travel-time estimates are organized, it has some missing values which include pairs of origin-destination that are too far (more than 2hours apart)
od_pairs <- od_pairs[, .(origi_id, dest_id, dist)]
od_pairs$dist[3] <- NA
> origi_id dest_id dist
> 1: oregon_0 oregon_0 0.00000
> 2: oregon_1 oregon_0 NA
> 3: oregon_2 oregon_0 39874.63673
> 4: oregon_3 oregon_0 31259.63100
> 5: oregon_4 oregon_0 33047.84249
# Convert to matrix
diss_matrix <- acast(od_pairs, origi_id~dest_id, value.var="dist") %>% as.matrix()
# get an inverse matrix of distances, make sure diagonal=0
diss_matrix_invers <- 1/diss_matrix
diag(diss_matrix_invers) <- 0
Calculate simple distance matrix
# get row names
rn <- sapply(slot(oregon.tract, "polygons"), function(x) slot(x, "ID"))
# get centroids coordinates
coords <- coordinates(oregon.tract)
# get distance matrix
diss_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat =T, row.names=rn)
class(diss_matrix)
> [1] "nb"
Now how to use my diss_matrix_invers here?
you are right about the use of matlistw{spdep}. By default the function preserves the names of rows to keep correspondence between the matrix. You can also specify the row.names like so:
listw_dissi <- mat2listw(diss_matrix_invers, row.names = row.names(diss_matrix_invers))
The list that is created will contain the appropriate names for the neighbours along with their distance as weights. You can check this by looking at the neighbours.
listw_dissi$neighbours[[1]][1:5]
And you should be able to use this directly to calculate Moran's I.
dnearneigh{sdep}
There is no way you can use diss_matrix within dnearneigh{spdep}, as this function takes in a list of coordinates.
however, if you need to define a set of neighbours given a distance threshold (d1,d2) using your own distance matrix (travel-time). I think this function can do the trick.
dis.neigh<-function(x, d1 = 0, d2=50){
#x must be a symmetrical distance matrix
#create empty list
style = "M" #for style unknown
neighbours<-list()
weights<-list()
#set attributes of neighbours list
attr(neighbours, "class")<-"nb"
attr(neighbours, "distances")<-c(d1,d2)
attr(neighbours, "region.id")<-colnames(x)
#check each row for neighbors that satisfy distance threshold
neighbour<-c()
weight<-c()
i<-1
for(row in c(1:nrow(x))){
j<-1
for(col in c(1:ncol(x))){
if(x[row,col]>d1 && x[row,col]<d2){
neighbour[j]<-col
weight[j]<-1/x[row,col] #inverse distance (dissimilarity)
j<-1+j
}
}
neighbours[i]<-list(neighbour)
weights[i]<-list(weight)
i<-1+i
}
#create neighbour and weight list
res <- list(style = style, neighbours = neighbours, weights = weights)
class(res) <- c("listw", "nb")
attr(res, "region.id") <- attr(neighbours, "region.id")
attr(res, "call") <- match.call()
return(res)
}
And use it like so:
nb_list<-dis.neigh(diss_matrix, d1=0, d2=10000)
lmoran <- localmoran(oregon.tract#data$white, nb_lists, alternative= "two.sided")

Calculating divergence between joint posterior distributions

I wish to calculate the distance between two 3-dimensional posterior distributions. The draws are stored at two 30,000x3 matrices.
So far I have been successful in calculating Total Variation distance between two 2-dimensional posteriors (two 30,000x2 matrices) by splitting the grid into bins. However, I am having trouble calculating the divergence between posteriors with more parameters. Some examples of related distance measures can be found here.
NOTE: I do not wish to calculate the distance between the marginals (column-wise entries), rather than obtain an overall value after comparing the joint distributions in R.
I would really appreciate it if somebody could point out what I am missing here.
EDIT 1: Some example code for calculating Total variation distance between posterior samples stored in two matrices has been added below:
EDIT 2: This is a R question.
set.seed(123)
comparison.2D <- matrix(rnorm(40000*2,0,1),ncol=2)
ground.truth.2D <- matrix(rnorm(40000*2,0,2),ncol=2)
# Function to calculate TVD between matrices with 2 columns:
Total.Variation.Distance.2D<-function(true,
comparison,
burnin,
window.size){
# Bandwidth for theta.1.
my_bw_x<-window.size
# Bandwidth for theta.2.
my_bw_y<-window.size
range_x<-range(c(true[-c(1:burnin),1],comparison[-c(1:burnin),1]))
range_y<-range(c(true[-c(1:burnin),2],comparison[-c(1:burnin),2]))
xx <- seq(range_x[1],range_x[2],by=my_bw_x)
yy <- seq(range_y[1],range_y[2],by=my_bw_y)
true.pointidxs <- matrix( c( findInterval(true[-c(1:burnin),1], xx),
findInterval(true[-c(1:burnin),2], yy) ), ncol=2)
comparison.pointidxs <- matrix( c( findInterval(comparison[-c(1:burnin),1], xx),
findInterval(comparison[-c(1:burnin),2], yy) ), ncol=2)
# Count the frequencies in the corresponding cells:
square.mat.dims <- max(length(xx),nrow=length(yy))
frequencies.true <- frequencies.comparison <- matrix(0, ncol=square.mat.dims, nrow=square.mat.dims)
for (i in 1:dim(true.pointidxs)[1]){
frequencies.true[true.pointidxs[i,1], true.pointidxs[i,2]] <- frequencies.true[true.pointidxs[i,1],
true.pointidxs[i,2]] + 1
frequencies.comparison[comparison.pointidxs[i,1], comparison.pointidxs[i,2]] <- frequencies.comparison[comparison.pointidxs[i,1],
comparison.pointidxs[i,2]] + 1
}# End for
# Normalize frequencies matrix:
frequencies.true <- frequencies.true/dim(true.pointidxs)[1]
frequencies.comparison <- frequencies.comparison/dim(comparison.pointidxs)[1]
TVD <-0.5*sum(abs(frequencies.comparison-frequencies.true))
return(TVD)
}# End function
TVD.2D <- Total.Variation.Distance.2D(true=ground.truth.2D, comparison=comparison.2D,burnin=10000,window.size=0.05)

applying the pvclust R function to a precomputed dist object

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Resources