I am trying to plot the contents of a Principal Component Analysis using ggplot2. I would like to generate a single pdf with all plots in it with a user-specified number of principal components to show (so if user says 3, it would plot PC 1 vs 2, 1 vs 3 and 2 vs 3).
I've looked into gridextra and but not entirely sure how to add multiple plots when I don't know exactly how many components will be selected.
Here is a start, get user input (input_nPC), get combination then loop through and plot subset of data:
library(ggplot2)
# example data
myPC <- data.frame(
pc1 = runif(10),
pc2 = runif(10),
pc3 = runif(10),
pc4 = runif(10))
# user input, e.g.: 3 out of 4 PCs
input_nPC <- 3
# check: must be at least 2 PCs
input_nPC <- max(c(2, input_nPC))
# get combination
combo <- combn(input_nPC, 2)
pdf("myOutput.pdf")
for(i in seq(ncol(combo))){
d <- myPC[, combo[, i]]
d_cols <- colnames(d)
gg <- ggplot(d, aes_string(x = d_cols[1], y = d_cols[2])) +
geom_point() +
ggtitle(paste(d_cols, collapse = " vs "))
print(gg)
}
dev.off()
If we need to have output as one page PDF, then using cowplot package:
ggList <-
apply(combo, 2, function(i){
d <- myPC[, i]
d_cols <- colnames(d)
ggplot(d, aes_string(x = d_cols[1], y = d_cols[2])) +
geom_point() +
ggtitle(paste(d_cols, collapse = " vs "))
})
pdf("myOutput.pdf")
cowplot::plot_grid(plotlist = ggList)
dev.off()
Or we can use GGally::ggpairs as below:
library(GGally)
ggpairs(myPC[, 1:input_nPC])
Related
Users of a Shiny app can test data sets for Poisson, normality, and exponentiality. I am returning the results of the statistical test they chose. In addition, I thought it would be nice to plot the density from the data along with the theoretical distribution. They could be testing multiple sets of data at once, so I am faceting the plot.
From ggplot add Normal Distribution while using `facet_wrap` I found the really great ggh4x package. However, since this could be industry data, there may be a minimum that is not zero.
The problem is that theodensity(distri="exp") uses dexp which doesn't account for a minimum number, so the theoretical distribution plot doesn't match the data.
How can I tell the stat_theodensity that there is an xmin for each facet, which is the min of the data in the facet? I see that fitdistrplus can use different methods to fit an exponential curve, and that, for example, method="mse" would work. Is there a way to pass this through stat_theodensity?
library(ggh4x)
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/100)+100
data2 <- rexp(n = 500,rate = 1/250)+500
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
geom_density()+
stat_theodensity(distri = "exp")+
facet_wrap(facets = ~ID,scales = "free")
p
#what the first point of the graphs should be
dexp(x = 100-100,rate = 1/100)
#[1] 0.01
dexp(x = 500-500,rate = 1/250)
#[1] 0.004
********EDIT
OK I am getting closer. The following code works, but only for the second pass through the loop. If I change the numbers around for data1 and data2, it is always only the second one that plots the theoretical distribution.
I did ggplot_build after the first loop and it gives an error in fitdist(), which is code 100. I don't know why it would always fail on the first one but not on the second one, even with the same data.
Any ideas?
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/250)+500
data2 <- rexp(n = 500,rate = 1/100)+250
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
geom_density(color="red")
#loop through sets and add facets
for (set in unique(plot_dat$ID)){
xmin <- min(plot_dat$data[ID == set])
p<-p+
stat_theodensity(
data = ~subset(.x, ID == set),
aes(x = stage(data - xmin, after_stat = x + xmin)),
distri = "exp"
)
}
#stat_theodensity(distri = "exp")+
p<-p+
facet_wrap(facets = ~ID,scales = "free")
p
I don't know about the statistics of your problem, but if the issue is subtracting a number before calculating the density and afterwards adding it, you might do that with stage(). I couldn't find a more elegant way than hardcoding these values for each set separately, but I'd be happy to hear about more creative solutions.
library(ggh4x)
#> Loading required package: ggplot2
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/100)+100
data2 <- rexp(n = 500,rate = 1/250)+500
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
ggplot(data = plot_dat, aes(x=data))+
geom_density() +
stat_theodensity(
data = ~ subset(.x, ID == "Set 1"),
aes(x = stage(data - 100, after_stat = x + 100)),
distri = "exp"
) +
stat_theodensity(
data = ~ subset(.x, ID == "Set 2"),
aes(x = stage(data - 500, after_stat = x + 500)),
distri = "exp"
) +
facet_wrap(facets = ~ID,scales = "free")
Created on 2022-09-26 by the reprex package (v2.0.1)
EDIT
I think OP's update had a problem with non-standard evaluation. It should work when you use a lapply() loop instead of a for-loop because then xmin is not a global variable that might be mistakingly looked up.
library(ggh4x)
#> Loading required package: ggplot2
library(ggplot2)
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/250)+500
data2 <- rexp(n = 500,rate = 1/100)+250
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
geom_density(color="red") +
facet_wrap(facets = ~ ID, scales = "free")
#loop through sets and add facets
p + lapply(unique(plot_dat$ID), function(i) {
xmin <- min(plot_dat$data[plot_dat$ID == i])
stat_theodensity(
data = ~ subset(.x, ID == i),
aes(x = stage(data - xmin, after_stat = x + xmin)),
distri = "exp"
)
})
Created on 2022-09-27 by the reprex package (v2.0.1)
When we take the following example from ggplot2 docs
df <- data.frame(x = c(rnorm(100, 0, 3), rnorm(100, 0, 10)),
g = gl(2, 100))
library(ggplot2)
p <- ggplot(df, aes(x, colour = g)) +
stat_ecdf(geom = "step", na.rm = T) + # interchange point and step
theme_bw()
p
We can create a standard cdf plot. Now if we want to play with the plot in plotly, I obtain a very confusing image when I use the step command. See below. However, when I use the point command plotly behaves like it should. What is happening with the step command? Why can't I recreate the image from using ggplot only?
library(plotly)
ggplotly(p)
I found the solution here https://community.plotly.com/t/bug-with-ggplot2-stat-ecdf-function/1187/3.
You should reorder the dataframe along x.
df <- dplyr::arrange(df, x)
library(ggplot2)
p <- ggplot(df, aes(x, colour = g)) +
stat_ecdf(geom = "step", na.rm = T) +
theme_bw()
p
library(plotly)
ggplotly(p)
This can be solved using ecdf() function.
## ecdf function to get y and 1-y
rcdf <- function (x) {
cdf <- ecdf(x)
y1 <- cdf(x)
y <- unique(y1)
# xrcdf <- 1-y ## to get reverse cdf
xrcdf <- y ## to get cdf
}
ug <- unique(df$g)
ng <- length(ug)
xll <- min(df$x)
xul <- max(df$x)
adr <- data.frame(myxx=c(), myyy=c(), mygg=c())
lapply(1:ng, function(i){
ad2r <- subset(df, g==ug[i])
myx1 <- unique(ad2r$x)
myxx <- c(xll,myx1,xul) ## add lowest value - dummy to assign 100%
myy1 <- rcdf(ad2r$x)
# myyy <- c(1.0,myy1,0.0) ## add 100% to get reverse cdf
myyy <- c(0.0,myy1,1.0) ## add 0% to get cdf
mygg <- ug[i]
ad2rf <- data.frame(myxx,myyy,mygg)
adr <<- rbind(adr,ad2rf)
})
adf <- adr[order(adr$myxx),]
pp <- ggplot(data=adf,
aes_(x=adf$myxx, y=100*adf$myyy, col=adf$mygg, group=adf$mygg)) +
geom_step() +
labs(title="CDF", y = "Y", x = "X", col=NULL)
ppp <- ggplotly(pp, tooltip=c("x","y"))
ppp
This gives the following output:
CDF
In the example below I have a dataset containing two experiments F1 and F2. A classification is performed based on F1 signal, and both F1 and F2 values are ordered accordingly. In this diagram, each facet has the same dimension although the number of rows is not the same (e.g class #7 contains only few elements compare to the other classes). I would like to modify the code to force row height to be the same across facets (facets would thus have various blank space below). Any hints would be greatly appreciated.
Thank you
library(ggplot2)
library(reshape2)
set.seed(123)
# let's create a fake dataset
nb.experiment <- 4
n.row <- 200
n.col <- 5
nb.class <- 7
d <- matrix(round(runif(n.row * n.col),2), nc=n.col)
colnames(d) <- sprintf("%02d", 1:5)
# These strings will be the row names of each heatmap
# in the subsequent facet plot
elements <- sample(replicate(n.row/2, rawToChar(as.raw(sample(65:90, 6, replace=T)))))
# let's create a data.frame d
d <- data.frame(d,
experiment = sort(rep(c("F1","F2"), n.row/2)),
elements= elements)
# Now we split the dataset by experiments
d.split <- split(d, d$experiment)
# Now we create classes (here using hierarchical clustering )
# based on F1 experiment
dist.mat <- as.dist(1-cor(t(d.split$F1[,1:5]), method="pearson"))
hc <- hclust(dist.mat)
cuts <- cutree(hc, nb.class)
levels(cuts) <- sprintf("Class %02d", 1:nb.experiment)
# We split F1 and F2 based on classification result
for(s in names(d.split)){
d.split[[s]] <- split(d.split[[s]], cuts)
}
# Data are melt (their is perhaps a better solution...)
# in order to use the ggplot function
dm <- melt(do.call('rbind',lapply(d.split, melt)), id.var=c( "experiment", "elements", "variable", "L1"))
dm <- dm[, -5]
colnames(dm) <- c("experiment","elements", "pos", "class", "exprs")
dm$class <- as.factor(dm$class)
levels(dm$class) <- paste("Class", levels(dm$class))
# Now we plot the data
p <- ggplot(dm, aes(x = pos, y = elements, fill = exprs))
p <- p + geom_raster()
p <- p + facet_wrap(~class +experiment , scales = "free", ncol = 2)
p <- p + theme_bw()
p <- p + theme(text = element_text(size=4))
p <- p + theme(text = element_text(family = "mono", face = "bold"))
print(p)
Use facet_grid instead of facet_wrap and set the space attribute:
ggplot(dm, aes(x = pos, y = elements, fill = exprs)) +
geom_raster() +
facet_grid(rowMeanClass ~ experiment , scales = "free", space = "free_y") +
theme_bw()
I'm having a problem with faceted heatmap rendering in ggplot2. The idea is that I have several elements (these are genes in the real life) and several experiments (F1 and F2 in the example below). Using the F1 experiment, I'm able to create class of elements/genes based on their mean expression (high, ..., moderate, ..., low). In the heatmap produced through the example below, I would like to order each elements in each class (01, 02, 03, 04) based on its mean expression value in F1. Unfortunately, the elements appear in alphabetic order. I would be very happy to get some hints...
Best
library(ggplot2)
library(reshape2)
set.seed(123)
# let's create a fake dataset
nb.experiment <- 4
n.row <- 200
n.col <- 5
d <- matrix(round(runif(n.row * n.col),2), nc=n.col)
colnames(d) <- sprintf("%02d", 1:5)
# These strings will be the row names of each heatmap
# in the subsequent facet plot
elements <- sample(replicate(n.row/2, rawToChar(as.raw(sample(65:90, 6, replace=T)))))
# let's create a data.frame d
d <- data.frame(d,
experiment = sort(rep(c("F1","F2"), n.row/2)),
elements= elements)
# For elements related to experiment F1
# we artificially produce a gradient of values that will
# create elements with increasing row means
d[d$experiment =="F1",1:5] <- round(sweep(d[d$experiment =="F1",1:5],
1,
seq(from=1, 10, length.out = 100),
"+"), 2)
# For elements related to experiment F2
# we artificially produce a gradient of values that will
# create elements with decreasing row means
d[d$experiment =="F2",1:5] <- round(sweep(d[d$experiment =="F2",1:5],
1,
seq(from=10, 1, length.out = 100),
"+"), 2)
#print(d[d$experiment =="F1",1:5])
# Now we split the dataset by experiments
d.split <- split(d, d$experiment)
# For all experiments, we order elements based on the mean expression signal in
# F1.
row.means.F1 <- rowMeans(d.split$F1[,1:5])
pos <- order(row.means.F1)
for(s in names(d.split)){
d.split[[s]] <- d.split[[s]][pos,]
}
# We create several classes of elements based on their
# mean expression signal in F1.
cuts <- cut(1:nrow(d.split$F1), nb.experiment)
levels(cuts) <- sprintf("%02d", 1:nb.experiment)
for(s in names(d.split)){
d.split[[s]] <- split(d.split[[s]], cuts)
}
# Data are melt (their is perhaps a better solution...)
# in order to use the ggplot function
dm <- melt(do.call('rbind',lapply(d.split, melt)), id.var=c( "experiment", "elements", "variable", "L1"))
dm <- dm[, -5]
colnames(dm) <- c("experiment","elements", "pos", "rowMeanClass", "exprs")
# Now we plot the data
p <- ggplot(dm, aes(x = pos, y = elements, fill = exprs))
p <- p + geom_raster()
p <- p + facet_wrap(~rowMeanClass +experiment , scales = "free", ncol = 2)
p <- p + theme_bw()
p <- p + theme(text = element_text(size=4))
p <- p + theme(text = element_text(family = "mono", face = "bold"))
ggsave("RPlot_test.jpeg", p)
Using your advises I was able to find a solution (which implies to clearly specify the order of levels for the 'elements' factor). Thank you #hrbrmstr (and all others).
NB: I only added few lines compare to the original code that are denoted below with 'Added: begin' and 'Added: end' flags.
library(ggplot2)
library(reshape2)
set.seed(123)
# let's create a fake dataset
nb.experiment <- 4
n.row <- 200
n.col <- 5
d <- matrix(round(runif(n.row * n.col),2), nc=n.col)
colnames(d) <- sprintf("%02d", 1:5)
# These strings will be the row names of each heatmap
# in the subsequent facet plot
elements <- sample(replicate(n.row/2, rawToChar(as.raw(sample(65:90, 6, replace=T)))))
# let's create a data.frame d
d <- data.frame(d,
experiment = sort(rep(c("F1","F2"), n.row/2)),
elements= elements)
# For elements related to experiment F1
# we artificially produce a gradient of values that will
# create elements with increasing row means
d[d$experiment =="F1",1:5] <- round(sweep(d[d$experiment =="F1",1:5],
1,
seq(from=1, 10, length.out = 100),
"+"), 2)
# For elements related to experiment F2
# we artificially produce a gradient of values that will
# create elements with decreasing row means
d[d$experiment =="F2",1:5] <- round(sweep(d[d$experiment =="F2",1:5],
1,
seq(from=10, 1, length.out = 100),
"+"), 2)
#print(d[d$experiment =="F1",1:5])
# Now we split the dataset by experiments
d.split <- split(d, d$experiment)
# For all experiments, we order elements based on the mean expression signal in
# F1.
row.means.F1 <- rowMeans(d.split$F1[,1:5])
pos <- order(row.means.F1)
for(s in names(d.split)){
d.split[[s]] <- d.split[[s]][pos,]
}
## Added: begin ###
#Get the list of elements in proper order (based on row mean)
mean.order <- as.character(d.split$F1$elements)
## Added: end###
# We create several classes of elements based on their
# mean expression signal in F1.
cuts <- cut(1:nrow(d.split$F1), nb.experiment)
levels(cuts) <- sprintf("%02d", 1:nb.experiment)
for(s in names(d.split)){
d.split[[s]] <- split(d.split[[s]], cuts)
}
# Data are melt (their is perhaps a better solution...)
# in order to use the ggplot function
dm <- melt(do.call('rbind',lapply(d.split, melt)), id.var=c( "experiment", "elements", "variable", "L1"))
dm <- dm[, -5]
colnames(dm) <- c("experiment","elements", "pos", "rowMeanClass", "exprs")
## Added: begin###
#Ensure that dm$elements is an ordered factor with levels
# ordered as expected
dm$elements <- factor(dm$elements, levels = mean.order, ordered = TRUE)
## Added: end###
# Now we plot the data
p <- ggplot(dm, aes(x = pos, y = elements, fill = exprs))
p <- p + geom_raster()
p <- p + facet_wrap(~rowMeanClass +experiment , scales = "free", ncol = 2)
p <- p + theme_bw()
p <- p + theme(text = element_text(size=4))
p <- p + theme(text = element_text(family = "mono", face = "bold"))
ggsave("RPlot_test.jpeg", p)
Below is a plot which should contain lines of different types (solid and dashed) according to two groups (specified by the variable "m"). As you can see, it's close to being correct, but the lines kind of randomly connect the points instead of only connecting the points in the corresponding group.
This looks quite similar to what's in Hadley's book (page 50), but although I used the "group"-variable, it's still not as expected.
## new minimal example
require(ggplot2)
require(reshape2)
require(plyr)
set.seed(1)
## parameters
m <- c("g1", "g2")
x <- c(10, 20, 50, 100)
z <- c(5, 20, 50)
N <- 1000
## lengths
lm <- length(m)
lx <- length(x)
lz <- length(z)
## build result array containing the measurements
arr <- array(rep(NA, lm*lx*lz*N), dim=c(lm, lx, lz, N),
dimnames=list(
m=m,
x=x,
z=z,
N=1:N))
## fill with dummy data
for(i in 1:lx){
for(j in 1:lz){
arr[1,i,j,] <- 0+i+j+runif(N, min=-4, max=4)
}
}
arr[2,,,] <- arr[1,,,] + 2
names(dimnames(arr)) # "m" "x" "z" "N"
## compute a (dummy) summary statistic
means <- apply(arr, MARGIN=1:3, FUN=mean)
## create molten data
mdf <- reshape2:::melt.array(means, formula = . ~ m + x + z, value.name="Mean")
mdf. <- mutate(mdf, xz=x*z) # add x*z
mdf.$x <- as.factor(mdf.$x) # change to factor (for grouping with different shapes)
## trial
sym <- c(1, 2, 4, 20) # plot symbols
ggplot(mdf., aes(x=xz, y=Mean, shape=x)) +
geom_line(aes(group=x)) + geom_point() + # indicate group 1 by solid lines
geom_line(aes(group=m), linetype=2) + # indicate group 2 by dashed lines
scale_shape_manual(values=sym, breaks=x) +
labs(x="x times z", y="Mean")
## => Each of the two groups specified by m should be depicted by a special line type
## (solid for "g1", dashed for "g2"), but the lines are not correctly drawn...
## The goal is to connect the dots of the second group by a dashed line and to
## highlight the nodes by the same plot symbols (sym) as for the first group.
As #lselzer says, try removing the first geom_line and moving the linetype argument into the aes:
ggplot(mdf., aes(x=xz, y=Mean, shape=x)) +
geom_point() +
geom_line(aes(group=m,linetype = m)) + # indicate group 2 by dashed lines
scale_shape_manual(values=sym, breaks=x) +
labs(x="x times z", y="Mean")