Scale density plots in ggpairs based on total datapoints? - r

I'm plotting correlations in ggpairs and am splitting the data based on a filter.
The density plots are normalising themselves on the number of data points in each filtered group. I would like them to normalise on the total number of data points in the entire data set. Essentially, I would like to be able to have the sum of the individual density plots be equal to the density plot of the entire dataset.
I know this probably breaks the definition of "density plot", but this is a presentation style I'd like to explore.
In plain ggplot, I can do this by adding y=..count.. to the aesthetic, but ggpairs doesn't accept x or y aesthetics.
Some sample code and plots:
set.seed(1234)
group = as.numeric(cut(runif(100),c(0,1/2,1),c(1,2)))
x = rnorm(100,group,1)
x[group == 1] = (x[group == 1])^2
y = (2 * x) + rnorm(100,0,0.1)
data = data.frame(group = as.factor(group), x = x, y = y)
#plot of everything
data %>%
ggplot(aes(x)) +
geom_density(color = "black", alpha = 0.7)
#the scaling I want
data %>%
ggplot(aes(x,y=..count.., fill=group)) +
geom_density(color = "black", alpha = 0.7)
#the scaling I get
data %>%
ggplot(aes(x, fill=group)) +
geom_density(color = "black", alpha = 0.7)
data %>% ggpairs(., columns = 2:3,
mapping = ggplot2::aes(colour=group),
lower = list(continuous = wrap("smooth", alpha = 0.5, size=1.0)),
diag = list(continuous = wrap("densityDiag", alpha=0.5 ))
)
Are there any suggestions that don't involve reformatting the entire dataset?

I am not sure I understand the question but if the densities of both groups plus the density of the entire data is to be plotted, it can easily be done by
Getting rid of the grouping aesthetics, in this case, fill.
Placing another call to geom_density but this time with inherit.aes = FALSE so that the previous aesthetics are not inherited.
And then plot the densities.
library(tidyverse)
data %>%
ggplot(aes(x, y=..count.., fill = group)) +
geom_density(color = "black", alpha = 0.7) +
geom_density(mapping = aes(x, y = ..count..),
inherit.aes = FALSE)

Related

ggplot boxplot + jitter plot showing random sampling of data

I'd like to use ggplot to generate a series of boxplots derived from all data within a dataset, but then with jittered points showing a random sampling of the respective data (e.g., 100 data points) to avoid over-plotting (there are thousands of data points). Can anyone please help me with the code for this? The basic framework I have now is below, but I don't know what if any arguments can be added to draw a random sampling of data to display as the jittered points. Thanks for any help.
ggplot(datafile, aes(x=factor(var1), y=var2, fill=var3)) + geom_jitter(size=0.1, position=position_jitter(width=0.3, height=0.2)) + geom_boxplot(alpha=0.5) + facet_grid(.~var3) + theme_bw() + scale_fil_manual(values=c("red", "green", "blue")
You could take a random subset of your data using dplyr:
library(dplyr)
library(ggplot)
ggplot(data = datafile, aes(x = factor(var1), y = var2, fill = var3)) +
geom_jitter(
# use random subset of data
data = datafile %>% group_by(var1) %>% sample_n(100),
aes(x = factor(var1), y = var2, fill = var3)),
size = 0.1,
position = position_jitter(width = 0.3, height = 0.2)) +
geom_boxplot(alpha = 0.5) +
facet_grid(.~var3) +
theme_bw() +
scale_fill_manual(values = c("red", "green", "blue")

Smoothen Heatmap in ggplot

I have a dataframe that looks as follows:
X = c(6,6.2,6.4,6.6,6.8,5.6,5.8,6,6.2,6.4,6.6,6.8,7,7.2,7.4,7.6,7.8,8,2.8,3,3.2,3.4,3.6,3.8,4,4.2,4.4,4.6,4.8,5)
Y = c(2.2,2.2,2.2,2.2,2.2,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8,2.8)
Value = c(0,0.00683254,0,0.007595654,0.015517884,0,0,0,0,0,0,0,0,0,0.005219395,0,0,0,0,0,0,0,0,0,0,0,0.002892342,0,0.002758141,0)
table = data.frame(X, Y, Value)
I have put together a heatmap in R, based on the following command:
ggplot(data = table, mapping = aes(x = X, y = Y)) +
geom_tile(aes(fill = Value), colour = 'black') +
theme_void() +
scale_fill_gradient2(low = "white", high = "black") + xlab(label = "X") + ylab(label = "Y")
Since there is not a value for every X and Y, it leads to plots that appear as follows.
I am attempting to smoothen the plot and have the following question:
As there are small white spaces between the plotted values, how could one color these white spaces to be the median intensity? Said differently, how would I first create an initial layer with non-zero median 'Value' before plotting the non-zero 'Value' on top (overlayed)?
A sample is shown below, which has been 'smoothed', which looks closer to the desired output.
I'm not sure if it will totally fit your need but from my understanding you have some missing values and combination of X and Y.
So, you can use complete function from tidyr to get all different combinations of X and Y (those without values will be filled with NA) and then by using na.value argument in scale_fill_gradient2 function, you can set the values of these NA values to the same color of the midpoint value:
library(tidyr)
library(dplyr)
library(ggplot2)
table %>% complete(X,Y) %>%
ggplot(aes(x = X, y = Y))+
geom_raster(aes(fill = Value), interpolate = TRUE)+
scale_fill_gradient2(low = "white", mid = "grey",high = "black",
na.value = "grey")
Does it answer your question ?

Add legend using geom_point and geom_smooth from different dataset

I really struggle to set the correct legend for a geom_point plot with loess regression, while there is 2 data set used
I got a data set, who is summarizing activity over a day, and then I plot on the same graph, all the activity per hours and per days recorded, plus a regression curve smoothed with a loess function, plus the mean of each hours for all the days.
To be more precise, here is an example of the first code, and the graph returned, without legend, which is exactly what I expected:
# first graph, which is given what I expected but with no legend
p <- ggplot(dat1, aes(x = Hour, y = value)) +
geom_point(color = "darkgray", size = 1) +
geom_point(data = dat2, mapping = aes(x = Hour, y = mean),
color = 20, size = 3) +
geom_smooth(method = "loess", span = 0.2, color = "red", fill = "blue")
and the graph (in grey there is all the data, per hours, per days. the red curve is the loess regression. The blue dots are the means for each hours):
When I tried to set the legend I failed to plot one with the explanation for both kind of dots (data in grey, mean in blue), and the loess curve (in red). See below some example of what I tried.
# second graph, which is given what I expected + the legend for the loess that
# I wanted but with not the dot legend
p <- ggplot(dat1, aes(x = Hour, y = value)) +
geom_point(color = "darkgray", size = 1) +
geom_point(data = dat2, mapping = aes(x = Hour, y = mean),
color = "blue", size = 3) +
geom_smooth(method = "loess", span = 0.2, aes(color = "red"), fill = "blue") +
scale_color_identity(name = "legend model", guide = "legend",
labels = "loess regression \n with confidence interval")
I obtained the good legend for the curve only
and another trial :
# I tried to combine both date set into a single one as following but it did not
# work at all and I really do not understand how the legends works in ggplot2
# compared to the normal plots
A <- rbind(dat1, dat2)
p <- ggplot(A, aes(x = Heure, y = value, color = variable)) +
geom_point(data = subset(A, variable == "data"), size = 1) +
geom_point(data = subset(A, variable == "Moy"), size = 3) +
geom_smooth(method = "loess", span = 0.2, aes(color = "red"), fill = "blue") +
scale_color_manual(name = "légende",
labels = c("Data", "Moy", "loess regression \n with confidence interval"),
values = c("darkgray", "royalblue", "red"))
It appears that all the legend settings are mixed together in a "weird" way, the is a grey dot covering by a grey line, and then the same in blue and in red (for the 3 labels). all got a background filled in blue:
If you need to label the mean, might need to be a bit creative, because it's not so easy to add legend manually in ggplot.
I simulate something that looks like your data below.
dat1 = data.frame(
Hour = rep(1:24,each=10),
value = c(rnorm(60,0,1),rnorm(60,2,1),rnorm(60,1,1),rnorm(60,-1,1))
)
# classify this as raw data
dat1$Data = "Raw"
# calculate mean like you did
dat2 <- dat1 %>% group_by(Hour) %>% summarise(value=mean(value))
# classify this as mean
dat2$Data = "Mean"
# combine the data frames
plotdat <- rbind(dat1,dat2)
# add a dummy variable, we'll use it later
plotdat$line = "Loess-Smooth"
We make the basic dot plot first:
ggplot(plotdat, aes(x = Hour, y = value,col=Data,size=Data)) +
geom_point() +
scale_color_manual(values=c("blue","darkgray"))+
scale_size_manual(values=c(3,1),guide=FALSE)
Note with the size, we set guide to FALSE so it will not appear. Now we add the loess smooth, one way to introduce the legend is to introduce a linetype, and since there's only one group, you will have just one variable:
ggplot(plotdat, aes(x = Hour, y = value,col=Data,size=Data)) +
geom_point() +
scale_color_manual(values=c("blue","darkgray"))+
scale_size_manual(values=c(3,1),guide=FALSE)+
geom_smooth(data=subset(plotdat,Data="Raw"),
aes(linetype=line),size=1,alpha=0.3,
method = "loess", span = 0.2, color = "red", fill = "blue")

ggplot Loess Line Color Scale from 3rd Variable

I am trying to apply a color scale to a loess line based on a 3rd variable (Temperature). I've only been able to get the color to vary based on either the variable in the x or y axis.
set.seed(1938)
a2 <- data.frame(year = seq(0, 100, length.out = 1000),
values = cumsum(rnorm(1000)),
temperature = cumsum(rnorm(1000)))
library(ggplot2)
ggplot(a2, aes(x = year, y = values, color = values)) +
geom_line(size = 0.5) +
geom_smooth(aes(color = ..y..), size = 1.5, se = FALSE, method = 'loess') +
scale_colour_gradient2(low = "blue", mid = "yellow", high = "red",
midpoint = median(a2$values)) +
theme_bw()
This code produces the following plot, but I would like the loess line color to vary based on the temperature variable instead.
I tried using
color = loess(temperature ~ values, a2)
but I got an error of
"Error: Aesthetics must be either length 1 or the same as the data (1000): colour, x, y"
Thank you for any and all help! I appreciate it.
You can't do that when you calculate the loess with a geom_smooth since it only has access to:
..y.. which is the vector of y-values internally calculated by geom_smooth to create the regression curve"
Is it possible to apply color gradient to geom_smooth with ggplot in R?
To do this, you should calculate the loess curve manually with loess and then plot it with geom_line:
set.seed(1938)
a2 <- data.frame(year = seq(0,100,length.out=1000),
values = cumsum(rnorm(1000)),
temperature = cumsum(rnorm(1000)))
# Calculate loess curve and add values to data.frame
a2$predict <- predict(loess(values~year, data = a2))
ggplot(a2, aes(x = year, y = values)) +
geom_line(size = 0.5) +
geom_line(aes(y = predict, color = temperature), size = 2) +
scale_colour_gradient2(low = "blue", mid = "yellow" , high = "red",
midpoint=median(a2$values)) +
theme_bw()
The downside of this is that it won't fill in gaps in your data as nicely as geom_smooth

Plot Gaussian Mixture in R using ggplot2

I'm approximating a distribution with gaussian mixtures and was wondering whether there was an easy way to automatically plot the estimated kernel density of the whole (uni-dimensional) dataset as the sum of the component densities in a nice fashion like this using ggplot2:
Given the following example data, my approach in ggplot2 would be to manually plot the subset densities into the scaled overall density like this:
#example data
a<-rnorm(1000,0,1) #component 1
b<-rnorm(1000,5,2) #component 2
d<-c(a,b) #overall data
df<-data.frame(d,id=rep(c(1,2),each=1000)) #add group id
##ggplot2
require(ggplot2)
ggplot(df) +
geom_density(aes(x=d,y=..scaled..)) +
geom_density(data=subset(df,id==1), aes(x=d), lty=2) +
geom_density(data=subset(df,id==2), aes(x=d), lty=4)
Note that this does not work out regarding the scales. It also does not work when you scale all 3 densities or no density at all. So I was not able to replicate above plot.
In addition, I am not able to automatically generate this plot without having to subset manually. I tried using position = "stacked" as parameter in geom_density.
I usually have around 5-6 Components per dataset, so manually subsetting would be possible. However, I would like to have different colors or line-types per component density which are displayed in the legend of ggplot, so doing all subsets manually would increase the workload quite a bit.
Any ideas?
Thanks!
Here is a possible solution by specifying each density in the aes call with position = "identity" in one layer and in the second layer using stacked density without the legend.
ggplot(df) +
stat_density(aes(x = d, linetype = as.factor(id)), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = as.factor(id)), position = "identity", geom = "line")
Do note that when using more then two groups:
a <- rnorm(1000, 0, 1)
b <- rnorm(1000, 5, 2)
c <- rnorm(1000, 3, 2)
d <- rnorm(1000, -2, 1)
d <- c(a, b, c, d)
df <- data.frame(d, id = as.factor(rep(c(1, 2, 3, 4), each = 1000)))
curves for each stack appear (this is a problem with the two group example but linetype in first layer disguised it - use group instead to check) :
gplot(df) +
stat_density(aes(x = d, group = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = id), position = "identity", geom = "line")
A relatively easy fix to this is to add alpha mapping and manually set it to 0 for unwanted curves:
ggplot(df) +
stat_density(aes(x=d, alpha = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x=d, linetype = id), position = "identity", geom = "line")+
scale_alpha_manual(values = c(1,0,0,0))

Resources