Vegan::meandist() has a really nice plot method that creates a dendrogram of the mean dissimilarities. How can I incorporate the output into ggplot to have full control over the aesthetics? Here is some sample code using Dune. As an example, I'd like to recreate the dendrogram in ggplot and color each Management level by 'Use' (see factors in Dune.env).
# Species and environmental data
require(vegan)
dune <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
data(dune)
data(dune.env)
dune_dist <- vegdist(dune, method = "bray", na.rm=T)
dissim <- meandist(dune_dist, grouping = dune.env$Management)
plot(dissim)
From ?vegan:::plot.meandist it is clear hclust function is used for kind = "dendrogram". To recreate:
zz <- hclust(as.dist(dissim), method = "average") #use desired method, "average" is the default in vegan:::plot.meandist
Now to visualize the tree using ggplot:
library(ggdendro)
create a data.frame from the tree:
dd <- as.dendrogram(zz)
dd <- dendro_data(zz)
get the diagonal elements from the dissimilarity matrix since they represent within-cluster variability (see #Jari Oksanens comments bellow):
data.frame(diag = diag(dissim)) %>%
rownames_to_column("label") -> dissim_diag
dissim_diag
label diag
1 BF 0.4159972
2 HF 0.4418115
3 NM 0.6882438
4 SF 0.5813015
now there is a need to change segment data so the leaves do not end at 0 but at the appropriate distance.
segment(dd)
x y xend yend
1 1.875 0.7412760 1.00 0.7412760
2 1.000 0.7412760 1.00 0.0000000
3 1.875 0.7412760 2.75 0.7412760
4 2.750 0.7412760 2.75 0.5960416
5 2.750 0.5960416 2.00 0.5960416
6 2.000 0.5960416 2.00 0.0000000
7 2.750 0.5960416 3.50 0.5960416
8 3.500 0.5960416 3.50 0.4736637
9 3.500 0.4736637 3.00 0.4736637
10 3.000 0.4736637 3.00 0.0000000
11 3.500 0.4736637 4.00 0.4736637
12 4.000 0.4736637 4.00 0.0000000
In other words where x is a whole number and yend is 0 we need to change the yend to the appropriate distance. The following code accomplishes this in two joins. First join adds the label(dd) data and the second join adds dissim_diag data to the segment data:
segment_data <- segment(dd) %>%
left_join(
label(dd),
by = c("xend" = "x",
"yend" = "y")) %>%
left_join(dissim_diag) %>%
mutate(yend = pmax(yend, diag, na.rm = TRUE)) #use as yend whichever is higher yend or diag, ignoring NA.
segment_data
x y xend yend label diag
1 1.875 0.7412760 1.00 0.7412760 <NA> NA
2 1.000 0.7412760 1.00 0.6882438 NM 0.6882438
3 1.875 0.7412760 2.75 0.7412760 <NA> NA
4 2.750 0.7412760 2.75 0.5960416 <NA> NA
5 2.750 0.5960416 2.00 0.5960416 <NA> NA
6 2.000 0.5960416 2.00 0.5813015 SF 0.5813015
7 2.750 0.5960416 3.50 0.5960416 <NA> NA
8 3.500 0.5960416 3.50 0.4736637 <NA> NA
9 3.500 0.4736637 3.00 0.4736637 <NA> NA
10 3.000 0.4736637 3.00 0.4159972 BF 0.4159972
11 3.500 0.4736637 4.00 0.4736637 <NA> NA
12 4.000 0.4736637 4.00 0.4418115 HF 0.4418115
A similar manipulation is needed to create appropriate label cooridnates:
text_data <- label(dd) %>%
left_join(dissim_diag) %>%
mutate(y = diag,
group = factor(rep(c("one", "two"), 2))) #just some random groups to color by
Now the actual plot:
ggplot(segment_data) +
geom_segment(aes(x = x,
y = y,
xend = xend,
yend = yend)) +
theme_dendro() +
theme(axis.line.y = element_line(),
axis.ticks.y = element_line(),
axis.text.y = element_text()) +
geom_text(aes(x = x,
y = y,
label = label,
color = group),
angle = -90, hjust = 0,
data = text_data)
Kudos to #Jari Oksanens for his comments!
The dendextend package was designed to assist with such tasks (and it is based on code from ggdendro - it is a more flexible and simple-to-us codebase at this point than ggdendro).
You can see a section on ggplot2 integration in the vignette.
Because of issues with my own computer (old linux mint), I can't install vegan to reproduce your example.
But if I use similar data, here is an example of usage:
dune <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
dissim <- dist(dune) # a hack, just because I can't get vegan to work
zz <- hclust(as.dist(dissim), method = "average") #use desired method, "average" is the default in vegan:::plot.meandist
The code to get a dendrogram:
library(dendextend)
library(ggplot2)
ggplot(as.dendrogram(zz))
Output (again, not the same analysis as above, but you can use your own code to fix it)
This also supports coloring of branches, labels, rotations, and more.
You can go to the vignette here, to learn more.
Related
I want to achieve the same end goal as this question: Create a single heatmap based on two symmetric matrices in R but to take it further than the answer currently provided.
The answer given does not explain how one would go about having different colours for the upper and lower sections of the matrix and different scales?
Here is the example dataset:
library(Matrix)
set.seed(123)
s1<-forceSymmetric(matrix(round(rnorm(25),2),5))
colnames(s1)<-LETTERS[1:5]
rownames(s1)<-LETTERS[6:10]
diag(s1)<-1
s2<-forceSymmetric(matrix(round(rbinom(25,25,0.3),2),5))
colnames(s2)<-LETTERS[1:5]
rownames(s2)<-LETTERS[6:10]
diag(s2)<-1
s1
# 5 x 5 Matrix of class "dsyMatrix"
# A B C D E
# F 1.00 1.72 1.22 1.79 -1.07
# G 1.72 1.00 0.36 0.50 -0.22
# H 1.22 0.36 1.00 -1.97 -1.03
# I 1.79 0.50 -1.97 1.00 -0.73
# J -1.07 -0.22 -1.03 -0.73 1.00
s2
# 5 x 5 Matrix of class "dsyMatrix"
# A B C D E
# F 1 6 8 7 9
# G 6 1 5 9 8
# H 8 5 1 10 9
# I 7 9 10 1 1
# J 9 8 9 1 1
The suggested answer suggests to add the two matrices together as so:
#Get upper diagonal
reverse = s1[,ncol(s1):1]
diag(reverse) = 0
reverse[lower.tri(reverse, diag = FALSE)] <- 0
upper = reverse[,ncol(reverse):1]
# Get lower diagonal
reverse1 = s2[,ncol(s2):1]
diag(reverse1) = 0
reverse1[upper.tri(reverse1, diag = FALSE)] <- 0
upper1 = reverse1[,ncol(reverse1):1]
# Add them together
merged = as.matrix(upper+upper1)
merged
A B C D E
F 1.00 1.72 1.22 1.79 0
G 1.72 1.00 0.36 0.00 8
H 1.22 0.36 0.00 10.00 9
I 1.79 0.00 10.00 1.00 1
J 0.00 8.00 9.00 1.00 1
It then suggests using heatmap(merged) - however, how would you go about having different colours and scales for the upper and lower part of the matrix?
I am happy with using any package including ggplot2 to make this work.
Thanks in advance!
You can subset the relevant parts of the matrix in the data argument of a layer, and use {ggnewscale} to assign different fill scales to different layers. The trick is to declare a fill scale before adding new_scale_fill(), otherwise the order of operations goes wrong (which usually doesn't matter a lot, but here they do).
You can then tweak every individual scale. In the example below I just tweaked the palettes, but you can also adjust limits, breaks, labels etc.
# Assuming code from question has been executed and we have a 'merged' in memory
library(ggplot2)
library(ggnewscale)
# Wide matrix to long dataframe
# Later, we'll be relying on the notion that the dimnames have been
# converted to factor variables to separate out the upper from the lower
# matrix.
df <- reshape2::melt(merged)
ggplot(df, aes(Var1, Var2)) +
# The first layer, with its own fill scale
geom_raster(
data = ~ subset(.x, as.numeric(Var1) > as.numeric(Var2)),
aes(fill = value)
) +
scale_fill_distiller(palette = "Blues") +
# Declare new fill scale for the second layer
new_scale_fill() +
geom_raster(
data = ~ subset(.x, as.numeric(Var1) < as.numeric(Var2)),
aes(fill = value)
) +
scale_fill_distiller(palette = "Reds") +
# I'm not sure what to do with the diagonal. Make it grey?
new_scale_fill() +
geom_raster(
data = ~ subset(.x, as.numeric(Var1) == as.numeric(Var2)),
aes(fill = value)
) +
scale_fill_distiller(palette = "Greys", guide = "none")
In my opinion #teunbrand's answer is what you're looking for, but another potential option is to use the ComplexHeatmap package, e.g. based on one of the examples in the docs:
library(Matrix)
set.seed(123)
s1<-forceSymmetric(matrix(round(rnorm(25),2),5))
colnames(s1)<-LETTERS[1:5]
rownames(s1)<-LETTERS[6:10]
diag(s1)<-1
s2<-forceSymmetric(matrix(round(rbinom(25,25,0.3),2),5))
colnames(s2)<-LETTERS[1:5]
rownames(s2)<-LETTERS[6:10]
diag(s2)<-1
#Get upper diagonal
reverse = s1[,ncol(s1):1]
diag(reverse) = 0
reverse[lower.tri(reverse, diag = FALSE)] <- 0
upper = reverse[,ncol(reverse):1]
# Get lower diagonal
reverse1 = s2[,ncol(s2):1]
diag(reverse1) = 0
reverse1[upper.tri(reverse1, diag = FALSE)] <- 0
upper1 = reverse1[,ncol(reverse1):1]
# Add them together
m = as.matrix(upper+upper1)
m
#> A B C D E
#> F 1.00 1.72 1.22 1.79 0
#> G 1.72 1.00 0.36 0.00 8
#> H 1.22 0.36 0.00 10.00 9
#> I 1.79 0.00 10.00 1.00 1
#> J 0.00 8.00 9.00 1.00 1
library(ComplexHeatmap)
#> Loading required package: grid
#> ========================================
#> ComplexHeatmap version 2.8.0
#> Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
#> Github page: https://github.com/jokergoo/ComplexHeatmap
#> Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
#>
#> If you use it in published research, please cite:
#> Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
#> genomic data. Bioinformatics 2016.
#>
#> The new InteractiveComplexHeatmap package can directly export static
#> complex heatmaps into an interactive Shiny app with zero effort. Have a try!
#>
#> This message can be suppressed by:
#> suppressPackageStartupMessages(library(ComplexHeatmap))
#> ========================================
library(circlize)
#> ========================================
#> circlize version 0.4.13
#> CRAN page: https://cran.r-project.org/package=circlize
#> Github page: https://github.com/jokergoo/circlize
#> Documentation: https://jokergoo.github.io/circlize_book/book/
#>
#> If you use it in published research, please cite:
#> Gu, Z. circlize implements and enhances circular visualization
#> in R. Bioinformatics 2014.
#>
#> This message can be suppressed by:
#> suppressPackageStartupMessages(library(circlize))
#> ========================================
col1 = colorRamp2(c(-1, 10), c("white", "red"))
col2 = colorRamp2(c(-1, 10), c("white", "blue3"))
# here reordering the symmetric matrix is necessary
od = hclust(dist(m))$order
m = m[od, od]
ht = Heatmap(m, rect_gp = gpar(type = "none"), show_heatmap_legend = FALSE,
cluster_rows = FALSE, cluster_columns = FALSE,
layer_fun = function(j, i, x, y, w, h, fill) {
l = i > j
grid.rect(x[l], y[l], w[l], h[l],
gp = gpar(fill = col1(pindex(m, i[l], j[l])), col = NA))
l = i < j
grid.rect(x[l], y[l], w[l], h[l],
gp = gpar(fill = col2(pindex(m, i[l], j[l])), col = NA))
})
draw(ht, heatmap_legend_list = list(
Legend(title = "Group_A", col_fun = col1),
Legend(title = "Group_B", col_fun = col2)
))
Created on 2022-03-07 by the reprex package (v2.0.1)
I am a little confused about row scaling in pheatmap. This is my data frame
gene s1 s2 s3
1 -3.83 -8.17 -8.59
2 0.33 -4.51 -7.27
3 0.15 -5.26 -6.2
4 -0.08 -6.13 -5.95
5 -1.15 -4.82 -5.75
6 -0.99 -4.11 -4.85
7 0.42 -4.18 -4.54
8 -0.32 -3.43 -4.4
9 -0.72 -3.37 -4.39
I need to extract those values of the data frame after pheatmap generates the graph with row z score
library(pheatmap)
my_colors <- c(min(d),seq(-4,4,by=0.01),max(d))
my_palette <- c("green",colorRampPalette(colors = c("green", "red"))
(n = length(my_colors)-2), "red")
pheatmap(as.matrix(d),
scale = "row",
cluster_cols=FALSE,
cluster_rows = FALSE,
treeheight_row=0,
show_rownames=FALSE,
main = "test.txt",
color = my_palette,
breaks = my_colors)
How can I get a mew matrix which pheatmap uses to make the heatmap?
I am making a function which receives three vectors, interpolates them using akima and plots them using plot_ly(). Although the general code works, I am encountering issues with scaling of the z-matrix that interp() outputs.
Let me give you an example:
x is a non-NA numeric containing some duplicate values.
y is a non-NA numeric containing some duplicate values.
z is a non-NA continuous vector
Some summary statistics:
> unique(x)
[1] 60 48 36 32 18 24 30 15 12 28 21 19 54 20 16 27 10 39 14 17 9 6 50 8 13
> range(x)
[1] 6 60
> unique(y)
[1] 10.00 10.50 13.50 12.50 14.00 12.00 11.00 9.00 11.50 9.25 13.00 10.25 6.50 6.75 8.25 9.50
[17] 8.00 8.85 9.75 7.90 7.00 8.60 8.75 7.50 8.90 8.50 7.49 7.40 5.50 7.60 7.25 8.35
[33] 6.00 5.00 7.75 7.35 6.30 4.50 5.75 8.40 5.60 5.90 7.74 9.90 6.20 5.80
> range(y)
[1] 4.5 14.0
> head(z)
[1] 2.877272 3.267328 3.175478 3.843326 4.809792 2.827825
> range(z)
[1] 2.316529 28.147808
I implement the baseline function below:
labs = list(x = 'x', y = 'y', z = 'z')
mat = interp(x, y, z, duplicate = 'mean', extrap = T, xo = sort(unique(x)))
plot_ly(x = mat$x, y = mat$y, z = mat$z, type = 'surface') %>%
layout(title = title,
scene = list(xaxis = list(title = labs$x),
yaxis = list(title = labs$y),
zaxis = list(title = labs$z)))
When I run this, the output is the following:
The issue is that a portion of the data is not covered in this picture. For instance, there is a sizeable data portion around x > 50, y < 11 that is omitted by the interpolation (and hence not plotted).
length(x[x > 50])
[1] 304
> length(y[x > 50 & y < 11])
[1] 290
> length(z[x > 50 & y < 11])
[1] 290
I suspected that this has to do with the duplicate x values. Hence, I configured the xo argument in interp() such that:
mat = interp(x, y, z, duplicate = 'mean', xo = sort(unique(x)), decreasing = T)
In which case the previously omitted region is partially plotted. It looks like the following:
Nonetheless, the x and y axes still do not correspond to their respective data ranges (despite data availability). Bottom line: How do I tweak the function such that the surface always extends the full range of x and y?
Best
It turns out that the error arose from plot_ly(). Apparently, the z-matrix cannot be passed straight through from interp() to plot_ly(), as the axis become erroneously passed through to the graph. Hence, the interpolated z-matrix needs to be transformed.
If you use these two functions in combination, ensure to carry out the transformation of z as shown below:
mat = interp(x,y,z, duplicate = 'mean')
x = mat$x
y = mat$y
z = matrix(mat$z, nrow = length(mat$y), byrow = TRUE)
plot_ly(x, y ,z, type = 'surface')
Here is the code:
barplot(colMeans(sample_data, na.rm = TRUE),
las = 1,
main = "Main Title",
xlab = "Variable",
ylab = "How Characteristic",
col = rainbow(20),
cex.names = 0.9,
horiz = FALSE)
A sample data set is available here:
https://github.com/akaEmma/public_data/blob/master/sample_data.csv
Or you can type some of it in yourself. These are the variable names:
Love of Chocolate,Asian Knowledge,Stable Cleanliness,Love of People,Attention,Ethics,Aggression,Swimming,Style Points,Felinity
And here are some of the data that go with the names:
8.67 9 6.25 7.33 6.33 5 6.67 5 5.25
8 3 6 6.67 8 7 7.67 4.5 5.25
7.33 7.5 5.75 8.67 8.67 8 5.33 2.5 3
8 6.5 6 6.33 8.33 5.33 5.67 6 6.5
6 5.5 5.25 5.33 5 4.67 4 4 3.5
7.67 7 6 4.67 7.33 5.67 7.67 5 3.75
8.67 8 7.5 5.67 7.33 5 8.33 7 7.75
If I use the code above I get the following (ignore the periods; they aren't important):
If I create a larger plot (like fill my screen with it) I get this:
(ignore the missing label; I accidentally left it off and it's supposed to be "Felinity," whatever that is)
This sort of bar chart is for a PowerPoint on a huge screen, so I can go very small with the labels.
Here is what I want: I want clean pretty labels, one per bar, and since this is a wish list, I want the labels to adjust their own sizes so that they are small enough to fit one per bar, and I want them to be at the right vertical point so that they do not overlap with the bars. Any ideas?
Go crazy. I want beautiful bar charts and I have to make a lot of them, so twiddling for each one is simply not an option. This has to work every time with data files of this type regardless of the length of the variable names.
Thanks!
Please note that theme_set and theme_tufte are ggplot2-specific functions.
Using ggplot2 you can do something like this
df <- read.csv("https://raw.githubusercontent.com/akaEmma/public_data/master/sample_data.csv")
library(tidyverse)
library(ggthemes)
df %>%
gather(key, value) %>%
group_by(key) %>%
summarise(mean.value = mean(value, na.rm = T)) %>%
mutate(key = factor(key, levels = key[rev(order(mean.value))])) %>%
ggplot(aes(key, mean.value, fill = as.numeric(key))) +
geom_col() +
theme_tufte() +
scale_fill_gradientn(colours = rainbow(5), guide = F) +
theme(axis.text.x = element_text(size = 6)) +
labs(x = "", y = "How characteristic")
I am trying write code that will do autocorrelation for multiple subsets. For example. I have health data for multiple countries over time. I want to get each country's autocorrelation for each variable. Any help would be great!
Here are some things I have tried, unsuccessfully:
require(plyr)
POP_ACF=acf(PhD_data_list_view$POP, lag.max=NULL, type=c("correlation"),
plot=TRUE, na.action=na.pass, demean=TRUE)
dlply(PhD_data_list_view, .(Country), function(x) POP_ACF %+% x)
POP_ACF=function(PhD_data_list_view$POP) c(acf(PhD_data_list_view$POP, plot=TRUE)$acf)
acf is a function takes a vector and returns a list. That makes it a natural fit for the purrr package, which maps functions over lists, but it can also be done using base R.
I'll use the beaver1 dataset from the datasets package since you didn't provide yours. I'll use different days of observations as the analogue to your different countries, and temperature for your POP variable.
Base R:
split turns the vector beaver1$temp into a list of vectors along the second argument, beaver1$day.
Then mapply runs the function acf on each element of that list.
Since we're using mapply instead of lapply, we can also provide another list of arguments, here the titles for each plot, main = unique(beaver1$day).
The last argument, SIMPLIFY = F, tells it to return the default output, not attempt to coerce the list into anything else.
par(mfrow = c(1,2))
mapply(acf,
split(beaver1$temp, beaver1$day),
main = unique(beaver1$day),
SIMPLIFY = F)
# $`346`
#
# Autocorrelations of series ‘dots[[1L]][[1L]]’, by lag
#
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 1.000 0.838 0.698 0.593 0.468 0.355 0.265 0.167 0.113 0.069 0.028 0.037 0.087 0.108 0.145 0.177 0.151 0.125 0.123 0.106
# $`347`
#
# Autocorrelations of series ‘dots[[1L]][[2L]]’, by lag
#
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
# 1.000 0.546 0.335 0.130 0.080 0.024 -0.025 -0.103 -0.090 -0.032 0.168 0.036 -0.089 -0.306
purrr and the tidy way:
This way is a bit more flexible depending what you want to do with the output. We can use purrr::map as a direct drop-in for mapply:
library(purrr)
beaver1 %>%
split(.$day) %>%
map(~acf(.$temp, main = unique(.$day)))
Which returns the exact same output. But we can also go fully tidy and return the data from acf as a dataframe so that we can explore it further with ggplot2.
The first map is returning a list of outputs, each of which is a list containing, among other things, variables lag, acf, and n.used.
The map_dfr is running the function data.frame, assigning each of those variables to a new column.
We also make a column to calculate the CIs. Refer to: How is the confidence interval calculated for the ACF function?
Then we can use ggplot to make any kind of plot we want, and we still have the data for any other analysis you want to do.
library(ggplot2)
beaver_acf <-
beaver1 %>%
split(.$day) %>%
map(~acf(.$temp, plot = F)) %>%
map_dfr(
~data.frame(lag = .$lag,
acf = .$acf,
ci = qnorm(0.975)/sqrt(.$n.used)
), .id = "day")
head(beaver_acf)
# day lag acf ci
# 1 346 0 1.0000000 0.2054601
# 2 346 1 0.8378889 0.2054601
# 3 346 2 0.6983476 0.2054601
# 4 346 3 0.5928198 0.2054601
# 5 346 4 0.4680912 0.2054601
# 6 346 5 0.3554939 0.2054601
ggplot(beaver_acf, aes(lag, acf)) +
geom_segment(aes(xend = lag, yend = 0)) +
geom_hline(aes(yintercept = ci), linetype = "dashed", color = "blue") +
geom_hline(aes(yintercept = -ci), linetype = "dashed", color = "blue") +
facet_wrap(~variable)