Scaled plotting of multiple pairwise Venn diagrams in R - r

I want to plot >50 Venn/Euler diagrams of two sets each to scale.
It is necessary that not only the overlap of the two sets and the set size themselves should scale but also the size of the individual diagrams compared to each other.
Since I know of no R package that allows the plotting of >50 pairwise Venn diagrams at the same time, I was planning to plot them first individually (e.g., using eulerr) and then put all of them together using the gridExtra package or something similar.
However, in this way, the size of the individual pairwise diagrams is not comparable:
require(gridExtra)
require(eulerr)
fit1 <- euler(c(A=300, B=500, "A&B"=100))
fit2 <- euler(c(A=40, B=70, "A&B"=30))
grid.arrange(plot(fit1), plot(fit2), nrow=1)
Does anyone know of an R package or a combination of packages that would allow size-appropriate plotting of several pairwise Venn diagrams?

You could try using the widths argument of grid.arrange. You would have to determine the ratio of each of the venn diagrams' totals. In your example, the total size ratio is 800:110, which is 7.27, so if you do grid.arrange(plot(fit1), plot(fit2), ncol = 2, widths = c(7.27, 1)) then fit2 will be much smaller than fit1. The ggarrange() function from ggpubr should work also.
fit1 <- euler(c(A=300, B=500, "A&B"=100))
fit2 <- euler(c(A=40, B=70, "A&B"=30))
tot1 <- 800
tot2 <- 110
ratio_v <- tot1/tot2
grid.arrange(plot(fit1), plot(fit2), ncol = 2, widths = c(ratio_v, 1))
ggpubr:ggarrange(plotlist = list(plot(fit1), plot(fit2)), ncol = 2, widths = c(ratio_v, 1))
Edit: Want the individual pairwise sets to have their own size ratios, instead of everything relative to a global maximum. This is a simple example, but you can write a function to do this automatically for each one. Basically set the maximum number of columns (I just chose 100), and then convert each of your ratios to be out of 100. Make a row for each venn diagram set, then rbind them all into a matrix and use the layout_matrix argument.
### Make fits
fit1 <- euler(c(A=300, B=500, "A&B"=100))
fit2 <- euler(c(A=40, B=70, "A&B"=30))
fit3 <- euler(c(C=100, D=300, "C&D"=50))
fit4 <- euler(c(C=50, D=80, "C&D"=30))
### Assign totals
tot1 <- 800
tot2 <- 110
tot3 <- 400
tot4 <- 130
### Find ratios
ratioAB_v <- round(tot1/tot2)
ratioCD_v <- round(tot3/tot4)
### Convert ratios
smallAB_v <- round(1/ratioAB_v*100)
smallCD_v <- round(1/ratioCD_v*100)
### Make rows
row1_v <- c(rep(1, (100-smallAB_v)), rep(2, smallAB_v))
row2_v <- c(rep(3, (100-smallCD_v)), rep(4, smallCD_v))
### Make matrix
mat <- rbind(row1_v, row2_v)
### Plot
grid.arrange(plot(fit1), plot(fit2), plot(fit3), plot(fit4), layout_matrix = mat)

Related

Undertanding Sample Sizes in qqplots with R

I'm trying to plot QQ graphs with the MASS Boston data set and comparing how the plots will change with increased random data points. I'm looking at the R documentation on qqnorm() but it doesn't seem to let me select an n value as a parameter? I'd like to plot the QQplots of “random” samples of size 10, 100, and 1000 samples all from a normal distribution for the same variable all in a 3x1 matrix.
Example would be if I wanted to look at the QQplot for Boston Crime, how would I get
qqnorm(Boston$crim) #find how to set n = 10
qqnorm(Boston$crim) #find how to set n = 100
qqnorm(Boston$crim) #find how to set n = 1000
Also if someone could elaborate when to use qqplot() vs qqnorm(), I'd appreciate it.
I'm inclined to believe that I should use qqplot() as such, as it does seem to give me the output I want, but I want to make sure that using rnorm(n) and then using that variable as a second argument is okay to do:
x <- rnorm(10)
y <- rnorm(100)
z <- rnorm(1000)
par(mfrow = c(1,3))
qqplot(Boston$crim, x)
qqplot(Boston$crim, y)
qqplot(Boston$crim, z)
The question is not clear but to plot samples of a vector, define a vector N of sample sizes and loop through it. The lapply loop will sample from the vector, plot with the Q-Q line and return the qqnorm plot.
data(Boston, package = "MASS")
set.seed(2021)
N <- c(10, 100, nrow(Boston))
qq_list <- lapply(N, function(n){
subtitle <- paste("Sample size:", n)
i <- sample(nrow(Boston), n, replace = FALSE)
qq <- qqnorm(Boston$crim[i], sub = subtitle)
qqline(Boston$crim[i])
qq
})

R pheatmap scale different than scale before pheatmap

The heatmap when scaling before plotting:
mat_scaled <- scale(t(mat))
pheatmap(t(mat_scaled), show_rownames=F, show_colnames=F,
border_color=F, color=colorRampPalette(brewer.pal(6,name="PuOr"))(12))
with the scale going from [-2, 6] is completely different than when using the scaling within the pheatmap function
pheatmap(t(mat_scaled), scale="row", show_rownames=F,
show_colnames=F, border_color=F, color=colorRampPalette(brewer.pal(6,name="PuOr"))(12))
where the scale is set from [-6,6].
Why is this difference and how could I obtain the matrix represented in the second figure?
In the second figure you plot the heatmap of the scaled matrix mat_scaled scaled a second time using the option scale="row" of pheatmap.
This is not the right way to compare external and internal scaling.
Here is the solution:
library(gridExtra)
library(pheatmap)
library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(6,name="PuOr"))(12)
brks <- seq(-3,3,length.out=12)
data(attitude)
mat <- as.matrix(attitude)
# Scale by row
mat_scaled <- t(scale(t(mat)))
p1 <- pheatmap(mat_scaled, show_rownames=F, show_colnames=F,
breaks=brks, border_color=F, color=cols)
p2 <- pheatmap(mat, scale="row", show_rownames=F, show_colnames=F,
breaks=brks, border_color=F, color=cols)
grid.arrange(grobs=list(p1$gtable, p2$gtable))

How to add labels to original data given clustering result using hclust

Just say I have some unlabeled data which I know should be clustered into six catergories, like for example this dataset:
library(tidyverse)
ts <- read_table(url("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data"), col_names = FALSE)
If I create an hclust object with a sample of 60 from the original dataset like so:
n <- 10
s <- sample(1:100, n)
idx <- c(s, 100+s, 200+s, 300+s, 400+s, 500+s)
ts.samp <- ts[idx,]
observedLabels <- c(rep(1,n), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
# compute DTW distances
library(dtw)#Dynamic Time Warping (DTW)
distMatrix <- dist(ts.samp, method= 'DTW')
# hierarchical clustering
hc <- hclust(distMatrix, method='average')
I know that I can then add the labels to the dendrogram for viewing like this:
observedLabels <- c(rep(1,), rep(2,n), rep(3,n), rep(4,n), rep(5,n), rep(6,n))
plot(hc, labels=observedLabels, main="")
However, I would like to the correct labels to the initial data frame that was clustered. So for ts.samp I would like to add a extra column with the correct label that each observation has been clustered into.
It would seems that ts.samp$cluster <- hc$label should add the cluster to the data frame, however hc$label returns NULL.
Can anyone help with extracting this information?
You need to define a level where you cut your dendrogram, this will form the groups.
Use:
labels <- cutree(hc, k = 3) # you set the number of k that's more appropriate, see how to read a dendrogram
ts.samp$grouping <- labels
Let's look at the dendrogram in order to find the best number for k:
plot(hc, main="")
abline(h=500, col = "red") # cut at height 500 forms 2 groups
abline(h=300, col = "blue") # cut at height 300 forms 3/4 groups
It looks like either 2 or 3 might be good. You need to find the highest jump in the vertical lines (Height).
Use the horizontal lines at that height and count the cluster "formed".

Plotting quantile regression by variables in a single page

I am running quantile regressions for several independent variables separately (same dependent). I want to plot only the slope estimates over several quantiles of each variable in a single plot.
Here's a toy data:
set.seed(1988)
y <- rnorm(50, 5, 3)
x1 <- rnorm(50, 3, 1)
x2 <- rnorm(50, 1, 0.5)
# Running Quantile Regression
require(quantreg)
fit1 <- summary(rq(y~x1, tau=1:9/10), se="boot")
fit2 <- summary(rq(y~x2, tau=1:9/10), se="boot")
I want to plot only the slope estimates over quantiles. Hence, I am giving parm=2 in plot.
plot(fit1, parm=2)
plot(fit2, parm=2)
Now, I want to combine both these plots in a single page.
What I have tried so far;
I tried setting par(mfrow=c(2,2)) and plotting them. But it's producing a blank page.
I have tried using gridExtra and gridGraphics without success. Tried to convert base graphs into Grob objects as stated here
Tried using function layout function as in this document
I am trying to look into the source code of plot.rqs. But I am unable to understand how it's plotting confidence bands (I'm able to plot only the coefficients over quantiles) or to change mfrow parameter there.
Can anybody point out where am I going wrong? Should I look into the source code of plot.rqs and change any parameters there?
While quantreg::plot.summary.rqs has an mfrow parameter, it uses it to override par('mfrow') so as to facet over parm values, which is not what you want to do.
One alternative is to parse the objects and plot manually. You can pull the tau values and coefficient matrix out of fit1 and fit2, which are just lists of values for each tau, so in tidyverse grammar,
library(tidyverse)
c(fit1, fit2) %>% # concatenate lists, flattening to one level
# iterate over list and rbind to data.frame
map_dfr(~cbind(tau = .x[['tau']], # from each list element, cbind the tau...
coef(.x) %>% # ...and the coefficient matrix,
data.frame(check.names = TRUE) %>% # cleaned a little
rownames_to_column('term'))) %>%
filter(term != '(Intercept)') %>% # drop intercept rows
# initialize plot and map variables to aesthetics (positions)
ggplot(aes(x = tau, y = Value,
ymin = Value - Std..Error,
ymax = Value + Std..Error)) +
geom_ribbon(alpha = 0.5) +
geom_line(color = 'blue') +
facet_wrap(~term, nrow = 2) # make a plot for each value of `term`
Pull more out of the objects if you like, add the horizontal lines of the original, and otherwise go wild.
Another option is to use magick to capture the original images (or save them with any device and reread them) and manually combine them:
library(magick)
plots <- image_graph(height = 300) # graphics device to capture plots in image stack
plot(fit1, parm = 2)
plot(fit2, parm = 2)
dev.off()
im1 <- image_append(plots, stack = TRUE) # attach images in stack top to bottom
image_write(im1, 'rq.png')
The function plot used by quantreg package has it's own mfrow parameter. If you do not specify it, it enforces some option which it chooses on it's own (and thus overrides your par(mfrow = c(2,2)).
Using the mfrow parameter within plot.rqs:
# make one plot, change the layout
plot(fit1, parm = 2, mfrow = c(2,1))
# add a new plot
par(new = TRUE)
# create a second plot
plot(fit2, parm = 2, mfrow = c(2,1))

New outliers appear after I remove existing ones using QQ Plot Results

I'm working on the PCA section from Michael Faraway's Linear Models with R (chapter 11, page 164).
PCA analysis is sensitive to outliers and the Mahalanobis distance helps us identify them.
The author checks for outliers by plotting the Mahalanobis distance against the quantiles of a chi-squared distribution.
if require(faraway)==F install.packages("faraway"); require(faraway)
data(fat, package='faraway')
cfat <- fat[,9:18]
n <- nrow(cfat); p <- ncol(cfat)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2,
"quantiles")),
ylab = "Sorted Mahalanobis distances")
abline(0,1)
I identify the points:
identify(qchisq(1:n/(n+1),p), sort(md))
It appears that the outliers are in rows 242:252. I remove these outliers and re-create the QQ Plot:
cfat.mod <- cfat[-c(242:252),] #remove outliers
robfat <- cov.rob(cfat.mod)
md <- mahalanobis(cfat.mod, center=robfat$center, cov=robfat$cov)
n <- nrow(cfat.mod); p <- ncol(cfat.mod)
plot(qchisq(1:n/(n+1),p), sort(md), xlab=expression(paste(chi^2,
"quantiles")),
ylab = "Sorted Mahalanobis distances")
abline(0,1)
identify(qchisq(1:n/(n+1),p), sort(md))
Alas, it appears now that a new set of points (rows 234:241) are now outliers. This keeps happening every time I remove additional outliers.
Look forward to understanding what I'm doing wrong.
To identify the points correctly, make sure the labels correspond to the positions of the points in the data. The functions order or sort with index.return=TRUE will give the sorted indices. Here is an example, arbitrarily removing the points with md greater than a threshold.
## Your data
data(fat, package='faraway')
cfat <- fat[, 9:18]
n <- nrow(cfat)
p <- ncol(cfat)
md <- sort(mahalanobis(cfat, colMeans(cfat), cov(cfat)), index.return=TRUE)
xs <- qchisq(1:n/(n+1), p)
plot(xs, md$x, xlab=expression(paste(chi^2, 'quantiles')))
## Use indices in data as labels for interactive identify
identify(xs, md$x, labels=md$ix)
## remove those with md>25, for example
inds <- md$x > 25
cfat.mod <- cfat[-md$ix[inds], ]
nn <- nrow(cfat.mod)
md1 <- mahalanobis(cfat.mod, colMeans(cfat.mod), cov(cfat.mod))
## Plot the new data
par(mfrow=c(1, 2))
plot(qchisq(1:nn/(nn+1), p), sort(md1), xlab='chisq quantiles', ylab='')
abline(0, 1, col='red')
car::qqPlot(md1, distribution='chisq', df=p, line='robust', main='With car::qqPlot')

Resources