ggtree setting height scale - r

I'm doing microssatellite analysis to understand genetic relationship between fungal isolates. For that I first calculated the Jaccard’s coefficient and then want to generate a dendrogram using UPGMA cluster analysis.
I did the following:
Distance matrix computation
jacc_coef <- vegdist(HC_df, method = "jaccard") *100
(HC_df is my data frame)
Hierarchical clustering
afu_clin.hc <- hclust(d = jacc_coef, method = "average")
When I plot afu_clin.hc with another package I obtain the heigth scale corresponding to the % of dissimilarity that was calculated before.
HC <- fviz_dend(x = afu_clin.hc, cex = 0.7, lwd = 0.7, horiz = TRUE)
print(HC)
I obtain the following plot:
enter image description here
However, when I try to use the ggtree the scale is different. I'm wondering how can I use ggtree to display my dendrogram with the height scale as % dissimilarity calculated with Jaccard’s coefficient.
I used this code:
hc_tree <- ggtree(afu_clin.hc, size =0.8) + geom_tiplab(angle = 90, hjust=1, offset=-.05) + layout_dendrogram() + theme_dendrogram()
obtained this plot (I don't understand this scale, where does it come from?)
enter image description here
How can I use ggtree to plot a similar dendrogram to the one that I showed first?
Thank you,
Best
Daryna

Related

How autoplot (ggplot) gets scores and loadings from prcomp

I know that there are lots of discussions out there that treat this subject matter... but every time I encounter this, I never find a consistent, satisfying answer.
I'm trying to create a very basic graphical depiction of a principal components analysis model. I always aim to not use packages that automatically generate plots for me because I want the control.
Every time I try to make a PCA plot with loadings, I am stumped by how the canned functions relate their site-specific scores with the model's loading vectors. This is despite the myriad nodes out there treating this matter, most of which just use the canned functions without explaining how the numbers got from a basic PCA model to the biplot (they just use the canned function).
For the example code below, I'll use autoplot. If I make a PCA model and use autoplot, I get a very cool graph. But I want to know how they get these numbers- the scores get rescaled, and I have no idea how the vectors are relativized the way they are on the plot. Can anyone walk me through how I would get these numbers relativized data in dataframes of my own (both scores and vectors) so I can make the aesthetic changes I want without using autoplot??
d <- iris
m1 <- prcomp(d[,1:4], scale=T)
scores <- data.frame(m1$x[,1:2])
library(ggplot2)
#Scores range from about -2.5 to +3
ggplot(scores, aes(x=PC1, y=PC2))+
geom_point()
#Scores range from about -0.15 to 0.22, no clue where the relativized loadings come from
autoplot(m1, loadings = T)
I'll attempt to walk you through and simplify the steps that autoplot uses to draw a PCA plot, so you can do this yourself quite easily in ggplot.
autoplot is actually an S3 generic function, so it's more accurate to talk about the method ggfortify:::autoplot.prcomp uses, since this is the function that is dispatched when you call autoplot on a prcomp object.
Let's start with your own example:
library(ggfortify)
library(ggplot2)
d <- iris
m1 <- prcomp(d[, 1:4], scale = TRUE)
scores <- data.frame(m1$x[, 1:2])
The scores are normalized by dividing each column by its own root mean squared error
scores[] <- lapply(scores, function(x) x / sqrt(sum((x - mean(x))^2)))
The loadings are simply obtained from the rotation member of the prcomp object:
loadings <- as.data.frame(m1$rotation)[1:2]
There is some internal scaling to ensure that the loadings appear on the same scale as the adjusted PC scores, but as far as I can tell this is simply for visual effect. The scaling amounts to about 0.2 here, and is calculated as follows:
scale <- min(max(abs(scores$PC1))/max(abs(loadings$PC1)),
max(abs(scores$PC2))/max(abs(loadings$PC2))) * 0.8
scale
#> [1] 0.1987812
We now have enough to recreate the autoplot using vanilla ggplot code.
ggplot(scores, aes(x = PC1, y = PC2))+
geom_point() +
geom_segment(data = loadings * scale,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
color = "red", arrow = arrow(angle = 25, length = unit(4, "mm")))
Aside from the axis titles, this is identical to the autoplot:
autoplot(m1, loadings = TRUE)

How to remove colour scale legend from plot() of spp density in R

I am plotting the density of a two-dimensional, weighted spatial point pattern. I'd like to make the plot without a colour scale legend, and save it with no (or minimal) boarders on all sides, like this: My problem is that I can't remove the colour scale legend. Reproducible code below:
## Install libraries:
library(spatstat) #spatial package
library(RColorBrewer) #create custom colour ramps
## Create reproducible data:
data <- data.frame(matrix(ncol = 3, nrow = 50))
x <- c("x", "y", "weight")
colnames(data) <- x
data$x <- runif(50, 0, 20)
data$y <- runif(50, 0, 20)
data$weight <- sample(1:200, 50)
## Set plotting window and colours:
plot.win <- owin(c(0,20), c(0,20)) # plot window as 20x20m
spat.bat.frame <- NULL # create a frame to store values in
cols1<-colorRampPalette(brewer.pal(9,"Blues"))(100) #define colour ramp for density plots
## Create and save plots:
jpeg(filename = "Bad plot.jpeg", res = 300, units = "cm", width = 20, height = 20)
par(mar=c(0,0,0,0),oma=c(0,0,0,0),lwd=1)
ppp_01 <- ppp(x = data$x, y = data$y, window = plot.win)
ppp_02 <- ppp(x = data$x, y = data$y, window = plot.win)
plot(density(ppp_01, weights = data$weights), main=NULL, col=cols1, sigma = 1)
plot(ppp_02, add=TRUE) #add spp points to density plot
dev.off()
I've tried legend=FALSE, auto.key=FALSE, colorkey=FALSE, which don't seem to be compatible with plot() (i.e. they don't give an error but don't change anything). I've also tried some work-arounds like saving a cropped image with dev.off.crop() or by adjusting margins with par(), but haven't been able to completely remove the legend. Does anyone have any suggestions on how to remove a colour scale legend of a density spp (real-valued pixel image) using plot()?
I specifically need to plot the density of the spatial point pattern, to specify a custom colour ramp, and to overlay the spp points onto the density image. I could try plotting with spplot() instead, but I'm not sure this will allow for these three things, and I feel like I'm missing a simple fix with plot(). I can't crop the figures manually after saving from R because there are 800 of them, and I need them all to be exactly the same size and in the exact same position.
Thank you!
Since plot is a generic function, the options available for controlling the plot will depend on the class of object that is being plotted. You want to plot the result of density(ppp_01, weights = data$weights). Let's call this Z:
Z <- density(ppp_01, weights = data$weights, sigma=1)
Note: the smoothing bandwidth sigma should be given inside the call to density
To find out about Z, you can just print it, or type class(Z).
The result is that Z is an object of class"im" (pixel image).
So you need to look up the help file for plot.im, the plot method for class "im". Typing ?plot.im shows that there is an argument ribbon that controls whether or not the colour ribbon is displayed. Simply set ribbon=FALSE in the call to plot:
plot(Z, ribbon=FALSE, main="", col=cols1)
Or in your original code
plot(density(ppp_01, weights=data$weights, sigma=1), main="", col=cols1)
However I strongly recommend separating this into two lines, one which creates the image object, and one which plots the image. This makes it much easier to spot mistakes like the misplacement of the sigma argument.

How to create grid of kernel density plots in R

I have some some samples of a high dimensional density that I would like to plot. I like to create a grid of them where their bivariate density is plotted when they cross. For example, in Bayes and Big Data - The Consensus Monte Carlo Algorithm, Scott et al. (2016) has the following plot:
In this plot, above the diagonal are the distributions on a scale just large enough to fit the plots. In the below diagonal, the bivariate densities are plotted on a common scale.
Does anyone know how I can achieve such a plot?
For instance if I have just generated a 5-dimensional Gaussian distribution using:
library(MASS)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
This is relatively easy using the facet_matrix() from the ggforce package. You just have to specify which layer goes on what part of the plot (i.e. layer.upper = 1 says that the first layer (geom_density2d()) should go in the upper triangular part of the matrix. The geom_autodensity() makes sure that the KDE touches the bottom part of the plot.
library(MASS)
library(ggforce)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
df <- as.data.frame(data)
ggplot(df) +
geom_density2d(aes(x = .panel_x, y = .panel_y)) +
geom_autodensity() +
geom_point(aes(x = .panel_x, y = .panel_y)) +
facet_matrix(vars(V1:V5), layer.upper = 1, layer.diag = 2)
More details about facet_matrix() are posted here.

Fitting a curve in the points

This is my data:
y<-c(1.8, 2, 2.8, 2.9, 2.46, 1.8,0.3,1.1,0.664,0.86,1,1.9)
x<- c(1:12)
data<-as.data.frame(cbind(y,x))
plot(data$y ~ data$x)
I want to fit a curve through these points so that I can generate the intermediate predicted values. I need a curve that goes through the points. I don't care what function it fits.
I consulted this link.
Fitting a curve to specific data
install.packages("rgp")
library(rgp)
result <- symbolicRegression(y ~ x,data=data,functionSet=mathFunctionSet,
stopCondition=makeStepsStopCondition(2000))
# inspect results, they'll be different every time...
(symbreg <- result$population[[which.min(sapply(result$population,
result$fitnessFunction))]])
function (x)
exp(sin(sqrt(x)))
# inspect visual fit
ggplot() + geom_point(data=data, aes(x,y), size = 3) +
geom_line(data=data.frame(symbx=data$x, symby=sapply(data$x, symbreg)),
aes(symbx, symby), colour = "red")
If I repeat this analysis again, every time the function above produces a different curve. Does anyone know why is this happening and whether this is a right way to fit a curve in these points? Also this function does not go through each points therefore I cannot obtain the intermediates points.
A standard approach is to fit a spline, this gives a nice curve that goeas through all points. See spline. Concretely you would use a call like:
spline(x = myX, y = myY, xout=whereToInterpolate)
or just calculating 100 points to your example:
ss <- spline(x,y, n=100)
plot(x,y)
lines(ss)
Note there is also a smoothing spline which may help for noisy data.
If the curve doesn't need to be smooth there is the simpler approx which does linear interpolation.
approx(x = myX, y = myY, xout=whereToInterpolate)

In R, how can I plot a similarity matrix (like a block graph) after clustering data?

I want to produce a graph that shows a correlation between clustered data and similarity matrix.
How can I do this in R? Is there any function in R that creates the graph like a picture in this link?
http://bp0.blogger.com/_VCI4AaOLs-A/SG5H_jm-f8I/AAAAAAAAAJQ/TeLzUEWbb08/s400/Similarity.gif (just googled and got the link that shows a graph that I want to produce)
Thanks, in advance.
The general solutions suggested in the comments by #Chase and #bill_080 need a little bit of enhancement to (partially) fulfil the needs of the OP.
A reproducible example:
require(MASS)
set.seed(1)
dat <- data.frame(mvrnorm(100, mu = c(2,6,3),
Sigma = matrix(c(10, 2, 4,
2, 3, 0.5,
4, 0.5, 2), ncol = 3)))
Compute the dissimilarity matrix of the standardised data using Eucildean distances
dij <- dist(scale(dat, center = TRUE, scale = TRUE))
and then calculate a hierarchical clustering of these data using the group average method
clust <- hclust(dij, method = "average")
Next we compute the ordering of the samples on basis of forming 3 ('k') groups from the dendrogram, but we could have chosen something else here.
ord <- order(cutree(clust, k = 3))
Next compute the dissimilarities between samples based on dendrogram, the cophenetic distances:
coph <- cophenetic(clust)
Here are 3 image plots of:
The original dissimilarity matrix, sorted on basis of cluster analysis groupings,
The cophenetic distances, again sorted as above
The difference between the original dissimilarities and the cophenetic distances
A Shepard-like plot comparing the original and cophenetic distances; the better the clustering at capturing the original distances the closer to the 1:1 line the points will lie
Here is the code that produces the above plots
layout(matrix(1:4, ncol = 2))
image(as.matrix(dij)[ord, ord], main = "Original distances")
image(as.matrix(coph)[ord, ord], main = "Cophenetic distances")
image((as.matrix(coph) - as.matrix(dij))[ord, ord],
main = "Cophenetic - Original")
plot(coph ~ dij, ylab = "Cophenetic distances", xlab = "Original distances",
main = "Shepard Plot")
abline(0,1, col = "red")
box()
layout(1)
Which produces this on the active device:
Having said all that, however, only the Shepard plot shows the "correlation between clustered data and [dis]similarity matrix", and that is not an image plot (levelplot). How would you propose to compute the correlation between two numbers for all pairwise comparisons of cophenetic and original [dis]similarities?

Resources