how to ggplot the CDF of multiple variables in r? - r

I have the the DF data.frame. I want to plot the cumulative distribution function (CDF) of the variables in DF using ggplot. using the following code produce the plot but because of big range in the data for variables i don't see the plot well. I don't want to use multiple facets- would like to have all of the variables plotted on the single panel.
library(ggplot2)
set.seed(123)
DF <- melt(data.frame(p1 = runif(200,1,10), p2 = runif(200,-2,1), p3 = runif(200,0,0.05),p4 = runif(200,100,4000)))
ggplot(DF, aes(x = value, col = variable))+
stat_ecdf(lwd = 1.2)

We can use facet_wrap to identify
ggplot(DF, aes(x = value, col = variable))+
stat_ecdf(lwd = 1.2) +
facet_wrap(~ variable)

If you don't want to use facets, you could use a log scale:
library(ggplot2)
set.seed(123)
DF <- reshape::melt(data.frame(p1 = runif(200,1,10), p2 = runif(200,-2,1), p3 = runif(200,0,0.05),p4 = runif(200,100,4000)))
ggplot(DF, aes(x = value, col = variable),log='x')+
stat_ecdf(lwd = 1.2)+
scale_x_log10()

Related

ggplotly: unable to add a frame in PCA score plot in ggplot2

I would like to make a PCA score plot using ggplot2, and then convert the plot into interactive plot using plotly.
What I want to do is to add a frame (not ellipse using stat_ellipse, I know it worked).
My problem is that when I try to use sample name as tooltip in ggplotly, the frame will disappear. I don't know how to fix it.
Below is my code
library(ggplot2)
library(plotly)
library(dplyr)
## Demo data
dat <- iris[1:4]
Group <- iris$Species
## Calculate PCA
df_pca <- prcomp(dat, center = T, scale. = FALSE)
df_pcs <- data.frame(df_pca$x, Group = Group)
percentage <-round(df_pca$sdev^2 / sum(df_pca$sdev^2) * 100, 2)
percentage <-paste(colnames(df_pcs),"(", paste(as.character(percentage), "%", ")", sep = ""))
## Visualization
Sample_Name <- rownames(df_pcs)
p <- ggplot(df_pcs, aes(x = PC1, y = PC2, color = Group, label = Sample_Name)) +
xlab(percentage[1]) +
ylab(percentage[2]) +
geom_point(size = 3)
ggplotly(p, tooltip = "label")
Until here it works! You can see that sample names can be properly shown in the ggplotly plot.
Next I tried to add a frame
## add frame
hull_group <- df_pcs %>%
dplyr::mutate(Sample_Name = Sample_Name) %>%
dplyr::group_by(Group) %>%
dplyr::slice(chull(PC1, PC2))
p2 <- p +
ggplot2::geom_polygon(data = hull_group, aes(fill = Group), alpha = 0.1)
You can see that the static plot still worked! The frame is properly added.
However, when I tried to convert it to plotly interactive plot. The frame disappeared.
ggplotly(p2, tooltip = "label")
Thanks a lot for your help.
It works if you move the data and mapping from the ggplot() call to the geom_point() call:
p2 <- ggplot() +
geom_point(data = df_pcs, mapping = aes(x = PC1, y = PC2, color = Group, label = Sample_Name), size = 3) +
ggplot2::geom_polygon(data = hull_group, aes(x = PC1, y = PC2, fill = Group, group = Group), alpha = 0.2)
ggplotly(p2, tooltip = "label")
You might want to change the order of the geom_point and geom_polygon to make sure that the points are on top of the polygon (this also affects the tooltip location).

Plot different parts of a vector with different colors on the same graph

As from the title suppose this vector and plot:
plot(rnorm(200,5,2),type="l")
This returns this plot
What i would like to know is whether there is a way to make the first half of it to be in blue col="blue" and the rest of it to be in red "col="red".
Similar question BUT in Matlab not R: Here
You could simply use lines for the second half:
dat <- rnorm(200, 5, 2)
plot(1:100, dat[1:100], col = "blue", type = "l", xlim = c(0, 200), ylim = c(min(dat), max(dat)))
lines(101:200, dat[101:200], col = "red")
Not a base R solution, but I think this is how to plot it using ggplot2. It is necessary to prepare a data frame to plot the data.
set.seed(1234)
vec <- rnorm(200,5,2)
dat <- data.frame(Value = vec)
dat$Group <- as.character(rep(c(1, 2), each = 100))
dat$Index <- 1:200
library(ggplot2)
ggplot(dat, aes(x = Index, y = Value)) +
geom_line(aes(color = Group)) +
scale_color_manual(values = c("blue", "red")) +
theme_classic()
We can also use the lattice package with the same data frame.
library(lattice)
xyplot(Value ~ Index, data = dat, type = 'l', groups = Group, col = c("blue", "red"))
Notice that the blue line and red line are disconnected. Not sure if this is important, but if you want to plot a continuous line, here is a workaround in ggplot2. The idea is to subset the data frame for the second half, plot the entire data frame with color as blue, and then plot the second data frame with color as red.
dat2 <- dat[dat$Index %in% 101:200, ]
ggplot(dat, aes(x = Index, y = Value)) +
geom_line(color = "blue") +
geom_line(data = dat2, aes(x = Index, y = Value), color = "red") +
theme_classic()

ggplot2: how to add sample numbers to density plot?

I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?
The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)

How to create a heatmap with continuous scale using ggplot2 in R

I have got a data frame with several 1000 rows in the form of
group = c("gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3")
pos = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
color = c(2,2,2,2,3,3,2,2,3,2,1,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,2,2)
df = data.frame(group, pos, color)
and would like to make a kind of heatmap in which one axes has a continuous scale (position). The color column is categorical. However due to the large amount of data points I want to use binning, i.e. use it as a continuous variable.
This is more or less how the plot should look like:
I can't think of a way to create such a plot using ggplot2/R. I have tried several geometries, e.g. geom_point()
ggplot(data=df, aes(x=strain, y=pos, color=color)) +
geom_point() +
scale_colour_gradientn(colors=c("yellow", "black", "orange"))
Thanks for your help in advance.
Does this help you?
library(ggplot2)
group = c("gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3")
pos = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
color = c(2,2,2,2,3,3,2,2,3,2,1,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,2,2)
df = data.frame(group, pos, color)
ggplot(data = df, aes(x = group, y = pos)) + geom_tile(aes(fill = color))
Looks like this
Improved version with 3 color gradient if you like
library(scales)
ggplot(data = df, aes(x = group, y = pos)) + geom_tile(aes(fill = color))+ scale_fill_gradientn(colours=c("orange","black","yellow"),values=rescale(c(1, 2, 3)),guide="colorbar")

Individual binwidths in faceted histogram on ggplot2

I do a series of histograms with facet_grid and I want every histogram in the grid to have the same number of classes, in the example below e.g. 6 classes. The problem in this example below is that binwidth = diff(range(x$data))/6) defines the classes according to the overall range of a, b and c, i.e. defines one binwidth for all three facets.
How do I define binwidth individually for the facets a, b and c?
require("ggplot2")
a <- c(1.21,1.57,1.21,0.29,0.36,0.29,0.93,0.26,0.28,0.48,
0.12,0.38,0.83,0.82,0.41,0.69,0.25,0.98,0.52,0.11)
b <- c(0.42,0.65,0.17,0.38,0.44,0.01,0.01,0.03,0.15,0.01)
c <- c(1.09,3.55,1.07,4.55,0.55,0.11,0.72,0.66,1.22,3.04,
2.01,0.64,0.47,1.33,3.44)
x <- data.frame(data = c(a,b,c), variable = c(rep("a",20),rep("b",10),rep("c",15)),area="random")
qplot(data, data = x, geom = "histogram", binwidth = diff(range(x$data))/6) +
facet_grid(area~variable, scales = "free")
This is not optimal but you can do the histogram in different layers:
ggplot(x, aes(x=data)) +
geom_histogram(data=subset(x, variable=="a"), binwidth=.1) +
geom_histogram(data=subset(x, variable=="b"), binwidth=.2) +
geom_histogram(data=subset(x, variable=="c"), binwidth=.5) +
facet_grid(area~variable, scales="free")
One way is to pre-summarize your data in the way you want it, then to create the plot.
In your case, you need to bin your variables using the function cut(). The package dplyr is convenient for this, because it allows you to specify a mutate function for each group of your data:
library(dplyr)
zz <- x %>%
group_by(variable) %>%
mutate(
bins = cut(data, breaks=6)
)
qplot(bins, data = zz, geom = "histogram", fill=I("blue")) +
facet_grid(area~variable, scales = "free") +
theme(axis.text.x = element_text(angle=90))

Resources