GGplot with multiple/conditional scale_fill_manual for stat_density_ridges - r

Here's a question for ggplot experts...
My dataset has 432000 observations of 4 variables (one is numeric, the others are factors). Predictors has 6 levels, Estimate has 4 levels, Model has 2 levels. Value has a max of 2.6 and a min of -3. (I hope you can create data with that information.)
The plot set-up is a 4x6 faceted plot here's is a 2x3 example of the plot:
each row is a different level of a factor (Predictors)
each column a different level of another factor (Estimate)
there are two distributions within each mini-plot which is another factor (Model)
The goal is to plot:
the distributions in each column in a different color (blue, green, red, yellow) (according to Estimate)
within each mini-plot, the shade/hue of that color should be different (e.g., within the green column, repeat the order of colors according to Model)
fill the tails of two quantiles on each distribution of each mini-plot (as the tail lines in the picture indicate; color the tail from each line to the end of the tail in black/gray). The tails can be the same throughout the entire plot.
Here's an example of the code that I'm using. It doesn't plot the quantiles in a separate color:
pp <- ggplot(dd, aes(x=Value, y=as.factor(Model), fill=factor(Model))) +
stat_density_ridges(quantile_lines = TRUE, quantiles = c(0.05, 0.95), alpha = 0.95,vline_size = 0.5)+
scale_fill_manual(values = c("red", "white")) +
geom_vline(xintercept = 0, linetype="dashed", color = "black", size=0.5) +
facet_grid(Predictors~Estimate, scales = "free") + labs(x="Parameter value", y=" ") +
theme(text = element_text(size = 16)) + theme(axis.title=element_text(face="bold"), strip.text = element_text(
size = 16)) + theme(legend.position = "none")
To color the quantiles, you can swap fill=factor(Model) with fill=factor(..quantile..), but getting both "fills" in the same plot has been impossible thus far. Among many other things, I tried entering multiple factors into "fill", like this: fill=c(factor(Model), factor(Estimate), ..quantile..) , but it didn't work.
Any ideas?

I think from your description your data looks a bit like this (though I've limited it to 6000 rows):
set.seed(69)
Value <- rnorm(6000)
Predictors <- factor(rep(LETTERS[1:6], each = 1000))
Estimate <- factor(rep(rep(letters[1:4], each = 250), 6))
Model <- factor(rep(rep(c("Model1", "Model2"), each = 125), 24))
Value <- Value + rep(rnorm(6), each = 1000)
Value <- Value + rep(rep(rnorm(4), each = 250), 6)
Value <- Value + rep(rep(rnorm(2), each = 125), 24)
dd <- data.frame(Value, Predictors, Estimate, Model)
It sounds like most of what you want to do can be achieved by creating a new factor variable that is a conjunction of two other factors:
dd$fill_factor <- as.factor(paste0(Model, Estimate))
Which means that we should get close to the desired effect with minimal changes to your code:
library(ggplot2)
library(ggridges)
my_colors <- c("#0000FF", "#00FF00", "#FF0000", "#FFFF00")
ggplot(dd, aes(x = Value, y = Model, fill = fill_factor)) +
stat_density_ridges(quantile_lines = TRUE,
quantiles = c(0.05, 0.95),
alpha = 0.95,
vline_size = 0.5) +
scale_fill_manual(values = c(gsub("0", "6", my_colors),
gsub("F", "A", my_colors))) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 0.5) +
facet_grid(Predictors ~ Estimate, scales = "free") +
labs(x = "Parameter value", y = " ") +
theme(text = element_text(size = 16),
axis.title = element_text(face = "bold"),
strip.text = element_text(size = 16),
legend.position = "none")

Related

How do I Facet_wrap without repeating coordinate values?

I am trying to create a plot which contains a discrete variable in one of the axis. I am furthermore trying to group these variables with respect to another variable and represent it in a graph using ggplot2. The code I have used is as follows:
size_vs_paper %>%
ggplot(aes(x=Reference,y=S_Max)) +
theme_classic()+
geom_segment(aes(xend =Reference,yend = S_Min),size=0.5) +
#geom_text(size = 5, vjust=-3) +
geom_point(aes(group = Environment), size = 3, shape = "|", color = "black", alpha = 0.7)+
geom_point(aes(y=S_Min, group = Environment), size = 3, shape = "|", color = "black", alpha = 0.7) +
geom_point(aes(y=S_Mean, group = Environment), size = 3, color = "black", alpha = 0.7) +
facet_wrap(Environment ~ ., ncol = 1) +
scale_colour_brewer(palette="Set2") +
scale_y_log10(breaks=c(0.01,1,10,100,1000, 5000, 10000)) +
theme(axis.text = element_text(size = 10),
legend.position = "none") +
coord_flip() +
labs(y = "Size (µm)")
This yields the graph as below:
As you can see, the Y axis representing the references repeats itself for all the facets. I am looking to facet them without the references repeating themselves. Any guidance will be appreciated!

How to remove zig-zag pattern in marginal distribution plot of integer values in R?

I am including marginal distribution plots on a scatterplot of a continuous and integer variable. However, in the integer variable maringal distribution plot (y-axis) there is this zig-zag pattern that shows up because the y-values are all integers. Is there any way to increase the "width" (not sure that's the right term) of the bins/values the function calculates the distribution density over?
The goal is to get rid of that zig-zag pattern that develops because the y-values are integers.
library(GlmSimulatoR)
library(ggplot2)
library(patchwork)
### Create right-skewed dataset that has one continous variable and one integer variable
set.seed(123)
df1 <- data.frame(matrix(ncol = 2, nrow = 1000))
x <- c("int","cont")
colnames(df1) <- x
df1$int <- round(rgamma(1000, shape = 1, scale = 1),0)
df1$cont <- round(rgamma(1000, shape = 1, scale = 1),1)
p1 <- ggplot(data = df1, aes(x = cont, y = int)) +
geom_point(shape = 21, size = 2, color = "black", fill = "black", stroke = 1, alpha = 0.4) +
xlab("Continuous Value") +
ylab("Integer Value") +
theme_bw() +
theme(panel.grid = element_blank(),
text = element_text(size = 16),
axis.text.x = element_text(size = 16, color = "black"),
axis.text.y = element_text(size = 16, color = "black"))
dens1 <- ggplot(df1, aes(x = cont)) +
geom_density(alpha = 0.4) +
theme_void() +
theme(legend.position = "none")
dens2 <- ggplot(df1, aes(x = int)) +
geom_density(alpha = 0.4) +
theme_void() +
theme(legend.position = "none") +
coord_flip()
dens1 + plot_spacer() + p1 + dens2 +
plot_layout(ncol = 2, nrow = 2, widths = c(6,1), heights = c(1,6))
From ?geom_density:
adjust: A multiplicate [sic] bandwidth adjustment. This makes it possible
to adjust the bandwidth while still using the a bandwidth
estimator. For example, ‘adjust = 1/2’ means use half of the
default bandwidth.
So as a start try e.g. geom_density(..., adjust = 2) (bandwidth twice as wide as default) and go from there.

Kernel Density Estimate (Probability Density Function) is wrong?

I've created a histogram to show the density of the age at which serial killers first killed and have tried to superimpose a probability density function on this. However, when I use the geom_density() function in ggplot2, I get a density function that looks far too small (area<1). What is strange is that by changing the bin width of the histogram, the density function also changes (the smaller the bin width, the seemingly better fitting the density function. I was wondering if anyone had some guidance to make this function fit better and its area is so far below 1?
#Histograms for Age of First Kill:
library(ggplot2)
AFKH <- ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)), show.legend = FALSE, binwidth = 3) + # density wasn't working, so had to use the ..count/../sum(..count..)
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) + # c =, for color, and l = for brightness, the #h = c() changes the color gradient
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density")+
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7, fill = "white",lwd =1, stat = "density")
AFKH
We don't have your data set, so let's make one that's reasonably close to it:
set.seed(3)
df <- data.frame(AgeFirstKill = rgamma(100, 3, 0.2) + 10)
The first thing to notice is that the density curve doesn't change. Look carefully at the y axis on your plot. You will notice that the peak of the density curve doesn't change, but remains at about 0.06. It's the height of the histogram bars that change, and the y axis changes accordingly.
The reason for this is that you aren't dividing the height of the histogram bars by their width to preserve their area. Your y aesthetic should be ..count../sum(..count..)/binwidth to keep this constant.
To show this, let's wrap your plotting code in a function that allows you to specify the bin width but also takes the binwidth into account when plotting:
draw_it <- function(bw) {
ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)/bw), show.legend = FALSE,
binwidth = bw) +
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) +
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density") +
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7,
fill = "white",lwd =1, stat = "density")
}
And now we can do:
draw_it(bw = 1)
draw_it(bw = 3)
draw_it(bw = 7)

Error message when trying to add extra geoim_point layer to ggplot lineplot

I am trying to make a line graph using the following code:
ggplot(out2, aes(factor(out2$term, levels=unique(as.character(out2$term)) ),estimate, group = 1)) +
geom_line(aes(group = 1), size = 1.2) +
mytheme2 +
geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 2) +
scale_shape(solid = FALSE) +
theme(axis.text.x = element_text(angle = 50, hjust = 1, size = 15, family = "serif")) +
scale_x_discrete(labels = labels1) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_ribbon(data=out2,aes(ymin=conf.low,ymax=conf.high),alpha=0.1)
Which gives me this graph:
However, based on a variable in the data frame called p.val I would like to add one asterisk if the value of p.val is less then .05, and two asterisks if the value is less than .001.
I tried to add a line at the bottom of the code to achieve this:
ggplot(out2, aes(factor(out2$term, levels=unique(as.character(out2$term)) ),estimate, group = 1)) +
geom_line(aes(group = 1), size = 1.2) +
mytheme2 +
geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 2) +
scale_shape(solid = FALSE) +
theme(axis.text.x = element_text(angle = 50, hjust = 1, size = 15, family = "serif")) +
scale_x_discrete(labels = labels1) +
#labs(y= "Standardized regression coefficient", x = "TAT threshold (Lux) minutes") +
#labs(title = "Sensitivity Analyses showing standardized regression coefficients for models with a range of \nTAT Light Thresholds (lux), Sleep Quality, Activity Level and BMI predicting T1 Hyperactivity.") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_ribbon(data=out2,aes(ymin=conf.low,ymax=conf.high),alpha=0.1) +
geom_point(data=out2[out2$p.value > 0.05,], color="red", size=3)
However, this gives me the error message:
Error: Aesthetics must be either length 1 or the same as the data (6): x, y, group
You are passing in a data frame to the last geom_point() layer that is a smaller subset of the original out2 and ggplot doesn't know how to distribute this shortened data across the original larger data, thus that warning.
It might be easier if you built a column in your data frame for the significance label first and then used geom_text() to layer it on instead of geom_point().
out2$signif_label <- ifelse(out2$p.value < .05, "*", "")
out2$signif_label <- ifelse(out2$p.value < .001, "**", out2$signif_label)
then add this instead of the last geom_point()
geom_text(aes(label = signif_label), color = "red", size = 3)
If you assign data in the initial ggplot(data = ,...) call then all subsequent layers will try to inherit the same data, so we don't need to assign it again, unless it's different.

R - ggplot histogram with colors based on range

I've written a function that produces a histogram two vertical bars indicating a range of values. I'd like to modify this function so that the bars within the specified range are a different color.
Heres' my function and a quick demonstration:
require(ggplot2)
niceHist <- function(data, cutpoint1, cutpoint2, title = "Supply a title, genius") {
temp_dat = data.frame(Data = data, Col = 0)
temp_dat = temp_dat[! is.na(temp_dat$Data),]
temp_dat[temp_dat$Data >= cutpoint1 & temp_dat$Data <= cutpoint2,]$Col = 1
my_hist = qplot(data) +
geom_histogram(fill = "forestgreen") +
geom_vline(xintercept = cutpoint1) +
geom_vline(xintercept = cutpoint2) +
ggtitle(paste(title)) +
theme_minimal() +
theme(text = element_text(size = 16), axis.line.y = element_line(color = "black", size = 0.5), axis.line.x = element_line(color = "black", size = 0.5))
my_hist
}
u = rnorm(100)
c1 = mean(u) - sd(u)
c2 = mean(u) + sd(u)
niceHist(u, c1, c2)
I've seen a similar question whose accepted solution is not well-suited to my needs because I want to maintain the shape of the original histogram. I'd also prefer not to change the number of bins and, if at all possible, apply the color difference such that a single bar in the histogram can be two colors if the vertical black lines happen to bisect it.
*My main goal is to clearly display how much of the distribution is captured within the supplied range without changing the shape of the distribution. * So an alternative but less desirable solution would be to simply color the background with respect to the range provided. Also, I need my function to return a ggplot object because it will occasionally be further modified using ggplot syntax.
UPDATE:
At the suggestion of a comment, I have tried using scale_fill_gradientn but this doesn't work:
niceHist <- function(data, cutpoint1, cutpoint2, title = "Supply a title, genius") {
temp_dat = data.frame(Data = data, Col = 0)
temp_dat = temp_dat[! is.na(temp_dat$Data),]
temp_dat[temp_dat$Data >= cutpoint1 & temp_dat$Data <= cutpoint2,]$Col = 1
my_hist = qplot(data) +
geom_histogram() +
scale_fill_gradientn(colours = c("blue", "red", "red", "blue"), values = c(min(data, na.rm = TRUE), cutpoint1, cutpoint2, max(data, na.rm = TRUE))) +
geom_vline(xintercept = cutpoint1) +
geom_vline(xintercept = cutpoint2) +
ggtitle(paste(title)) +
theme_minimal() +
theme(text = element_text(size = 16), axis.line.y = element_line(color = "black", size = 0.5), axis.line.x = element_line(color = "black", size = 0.5))
my_hist
}
My solution is to create the count data.frame and add a factor to indicate the region of bin. for example:
df <- ggplot_build(niceHist(u,c1,c2))$data[[1]] #recreate the count df
require(dplyr)
df <- df %>% mutate(col=cut(x,c(min(x)-0.001,c1,c2,max(x)+0.001)))
ggplot(df,aes(x,count))+ geom_col(aes(fill=col)) + geom_vline(xintercept = c(c1,c2))

Resources