ggplot boxplot + jitter plot showing random sampling of data - r

I'd like to use ggplot to generate a series of boxplots derived from all data within a dataset, but then with jittered points showing a random sampling of the respective data (e.g., 100 data points) to avoid over-plotting (there are thousands of data points). Can anyone please help me with the code for this? The basic framework I have now is below, but I don't know what if any arguments can be added to draw a random sampling of data to display as the jittered points. Thanks for any help.
ggplot(datafile, aes(x=factor(var1), y=var2, fill=var3)) + geom_jitter(size=0.1, position=position_jitter(width=0.3, height=0.2)) + geom_boxplot(alpha=0.5) + facet_grid(.~var3) + theme_bw() + scale_fil_manual(values=c("red", "green", "blue")

You could take a random subset of your data using dplyr:
library(dplyr)
library(ggplot)
ggplot(data = datafile, aes(x = factor(var1), y = var2, fill = var3)) +
geom_jitter(
# use random subset of data
data = datafile %>% group_by(var1) %>% sample_n(100),
aes(x = factor(var1), y = var2, fill = var3)),
size = 0.1,
position = position_jitter(width = 0.3, height = 0.2)) +
geom_boxplot(alpha = 0.5) +
facet_grid(.~var3) +
theme_bw() +
scale_fill_manual(values = c("red", "green", "blue")

Related

How to create legend with differing alphas for multiple geom_line plots in ggplot2 (R)

I have the following data on school enrollment for two years. I want to highlight data from school H in my plot and in the legend by giving it a different alpha.
library(tidyverse)
schools <- c("A","B","C","D","E",
"F","G","H","I","J")
yr2010 <- c(601,809,604,601,485,485,798,662,408,451)
yr2019 <- c(971,1056,1144,933,732,833,975,617,598,822)
data <- data.frame(schools,yr2010,yr2019)
I did some data management to get the data ready for plotting.
data2 <- data %>%
gather(key = "year", value = "students", 2:3)
data2a <- data2 %>%
filter(schools != "H")
data2b <- data2 %>%
filter(schools == "H")
Then I tried to graph the data using two separate geom_line plots, one for school H with default alpha and size=1.5, and one for the remaining schools with alpha=.3 and size=1.
ggplot(data2, aes(x=year,y=students,color=schools,group=schools)) +
theme_classic() +
geom_line(data = data2a, alpha=.3, size=1) +
scale_color_manual(values=c("red","orange","green","skyblue","aquamarine","purple",
"pink","brown","black")) +
geom_line(data = data2b, color="blue", size=1.5)
However, the school I want to highlight is not included in the legend. So I tried to include the color of school H in scale_color_manual instead of in the geom_line call.
ggplot(data2, aes(x=year,y=students,color=schools,group=schools)) +
theme_classic() +
geom_line(data = data2a, alpha=.3, size=1) +
scale_color_manual(values=c("red","orange","green","skyblue","aquamarine","purple",
"pink","blue","brown","black")) +
geom_line(data = data2b, size=1.5)
However, now the alphas in the legend are all the same, which doesn't highlight school H as much as I'd like.
How can I call the plot so that the legend matches the alpha of the line itself for all schools?
You need to put alpha and size categories in aes like you put color. Then, you can use scale_alpha_manual and scale_size_manual with respect to your need. Also, by that there is no need for creating data2a and data2b.
See below code:
ggplot(data2, aes(x=year,y=students,color=schools,group=schools,
alpha=schools, size = schools)) +
theme_classic() +
geom_line() +
scale_color_manual(values=c("red","orange","green","skyblue","aquamarine","purple",
"pink","blue","brown","black")) +
scale_alpha_manual(values = c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,NA, 0.3, 0.3)) +
#for the default alpha, you can write 1 or NA
scale_size_manual(values= c(1,1,1,1,1,1,1,1.5,1,1))
The code brings this plot. Please click.
I hope it will be useful.

Scale density plots in ggpairs based on total datapoints?

I'm plotting correlations in ggpairs and am splitting the data based on a filter.
The density plots are normalising themselves on the number of data points in each filtered group. I would like them to normalise on the total number of data points in the entire data set. Essentially, I would like to be able to have the sum of the individual density plots be equal to the density plot of the entire dataset.
I know this probably breaks the definition of "density plot", but this is a presentation style I'd like to explore.
In plain ggplot, I can do this by adding y=..count.. to the aesthetic, but ggpairs doesn't accept x or y aesthetics.
Some sample code and plots:
set.seed(1234)
group = as.numeric(cut(runif(100),c(0,1/2,1),c(1,2)))
x = rnorm(100,group,1)
x[group == 1] = (x[group == 1])^2
y = (2 * x) + rnorm(100,0,0.1)
data = data.frame(group = as.factor(group), x = x, y = y)
#plot of everything
data %>%
ggplot(aes(x)) +
geom_density(color = "black", alpha = 0.7)
#the scaling I want
data %>%
ggplot(aes(x,y=..count.., fill=group)) +
geom_density(color = "black", alpha = 0.7)
#the scaling I get
data %>%
ggplot(aes(x, fill=group)) +
geom_density(color = "black", alpha = 0.7)
data %>% ggpairs(., columns = 2:3,
mapping = ggplot2::aes(colour=group),
lower = list(continuous = wrap("smooth", alpha = 0.5, size=1.0)),
diag = list(continuous = wrap("densityDiag", alpha=0.5 ))
)
Are there any suggestions that don't involve reformatting the entire dataset?
I am not sure I understand the question but if the densities of both groups plus the density of the entire data is to be plotted, it can easily be done by
Getting rid of the grouping aesthetics, in this case, fill.
Placing another call to geom_density but this time with inherit.aes = FALSE so that the previous aesthetics are not inherited.
And then plot the densities.
library(tidyverse)
data %>%
ggplot(aes(x, y=..count.., fill = group)) +
geom_density(color = "black", alpha = 0.7) +
geom_density(mapping = aes(x, y = ..count..),
inherit.aes = FALSE)

How to create a heatmap with continuous scale using ggplot2 in R

I have got a data frame with several 1000 rows in the form of
group = c("gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3")
pos = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
color = c(2,2,2,2,3,3,2,2,3,2,1,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,2,2)
df = data.frame(group, pos, color)
and would like to make a kind of heatmap in which one axes has a continuous scale (position). The color column is categorical. However due to the large amount of data points I want to use binning, i.e. use it as a continuous variable.
This is more or less how the plot should look like:
I can't think of a way to create such a plot using ggplot2/R. I have tried several geometries, e.g. geom_point()
ggplot(data=df, aes(x=strain, y=pos, color=color)) +
geom_point() +
scale_colour_gradientn(colors=c("yellow", "black", "orange"))
Thanks for your help in advance.
Does this help you?
library(ggplot2)
group = c("gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3","gr3")
pos = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
color = c(2,2,2,2,3,3,2,2,3,2,1,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,2,2)
df = data.frame(group, pos, color)
ggplot(data = df, aes(x = group, y = pos)) + geom_tile(aes(fill = color))
Looks like this
Improved version with 3 color gradient if you like
library(scales)
ggplot(data = df, aes(x = group, y = pos)) + geom_tile(aes(fill = color))+ scale_fill_gradientn(colours=c("orange","black","yellow"),values=rescale(c(1, 2, 3)),guide="colorbar")

overlay colored boxplots on parallel coordinate plots with faceting in ggplot2

I have the following example.
require(ggplot2)
# Example Data
x <- data.frame(var1=rnorm(800,0,1),
var2=rnorm(800,0,1),
var3=rnorm(800,0,1),
type=factor(rep(c("x", "y"), length.out=800)),
set=factor(rep(c("A","B","C","D"), each=200))
)
Now, I would like to plot (thin) parallel coordinate plots of these lines, with points for each of the variable values. I would like to overlay a boxplot (each of a different color for each method) on these parallel coordinate plots at the variables values. On top of this, I would like to facet for the groups and types, say using set~type. Is this possible to do using ggplot2?
Any suggestions? Thanks!
You need to put data in long format first. I didn't put in points, since the graph is already cluttered enough, but you can do so by adding a geom_point.
require(tidyr)
x$id <- 1:nrow(x)
x2 <- gather(x, var, value, var1:var3)
Boxplots
ggplot(x2, aes(var, value)) +
geom_line(aes(group = id), size = 0.05, alpha = 0.3) +
geom_boxplot(aes(fill = var), alpha = 0.5) +
facet_grid(set ~ type) +
theme_bw()
Or perhaps violins
Replacing the boxplots with violins looks pretty cool as well.
ggplot(x2, aes(var, value)) +
geom_line(aes(group = id), size = 0.05, alpha = 0.3) +
geom_violin(aes(fill = var), col = NA, alpha = 0.6) +
facet_grid(set ~ type) +
theme_bw()

How to scale density plots (for several variables) in ggplot having melted data

I have a melted data set which also includes data generated from normal distribution. I want to plot empirical density function of my data against normal distribution but the scales of the two produced density plots are different. I could find this post for two separate data sets:
Normalising the x scales of overlaying density plots in ggplot
but I couldn't figure out how to apply it to melted data. Suppose I have a data frame like this:
df<-data.frame(type=rep(c('A','B'),each=100),x=rnorm(200,1,2)/10,y=rnorm(200))
df.m<-melt(df)
using the code below:
qplot(value,data=df.m,col=variable,geom='density',facets=~type)
produces this graph:
How can I make the two densities comparable given the fact that normal distribution is the reference plot? (I prefer to use qplot instead of ggplot)
UPDATE:
I want to produce something like this (i.e. in terms of plot-comparison) but with ggplot2:
plot(density(rnorm(200,1,2)/10),col='red',main=NA) #my data
par(new=T)
plot(density(rnorm(200)),axes=F,main=NA,xlab=NA,ylab=NA) # reference data
which generates this:
Is this what you had in mind?
There's a built-in variable, ..scaled.. that does this automatically.
set.seed(1)
df<-data.frame(type=rep(c('A','B'),each=100),x=rnorm(200,1,2)/10,y=rnorm(200))
df.m<-melt(df)
ggplot(df.m) +
stat_density(aes(x=value, y=..scaled..,color=variable), position="dodge", geom="line")
df<-data.frame(type=rep(c('A','B'),each=100),x = rnorm(200,1,2)/10, y = rnorm(200))
df.m<-melt(df)
require(data.table)
DT <- data.table(df.m)
Insert a new column with the scaled value into DT. Then plot.
This is the image code:
DT <- DT[, scaled := scale(value), by = "variable"]
str(DT)
ggplot(DT) +
geom_density(aes(x = scaled, color = variable)) +
facet_grid(. ~ type)
qplot(data = DT, x = scaled, color = variable,
facets = ~ type, geom = "density")
# Using fill (inside aes) and alpha outside(so you don't get a legend for it)
ggplot(DT) +
geom_density(aes(x = scaled, fill = variable), alpha = 0.2) +
facet_grid(. ~ type)
qplot(data = DT, x = scaled, fill = variable, geom = "density", alpha = 0.2, facets = ~type)
# Histogram
ggplot(DT, aes(x = scaled, fill = variable)) +
geom_histogram(binwidth=.2, alpha=.5, position="identity") +
facet_grid(. ~ type, scales = "free")
qplot(data = DT, x = scaled, fill = variable, alpha = 0.2, facets = ~type)

Resources