ggplot2 density plotting different size of data in R - r

I have two data sets, their size is 500 and 1000. I want to plot density for these two data sets in one plot.
I have done some search in google.
r-geom-density-values-in-y-axis
ggplot2-plotting-two-or-more-overlapping-density-plots-on-the-same-graph/
the data sets in above threads are the same
df <- data.frame(x = rnorm(1000, 0, 1), y = rnorm(1000, 0, 2), z = rnorm(1000, 2, 1.5))
But if I have different data size, I should normalize the data first in order to compare the density between data sets.
Is it possible to make density plot with different data size in ggplot2?

By default, all densities are scaled to unit area. If you have two datasets with different amounts of data, you can plot them together like so:
df1 <- data.frame(x = rnorm(1000, 0, 2))
df2 <- data.frame(y = rnorm(500, 1, 1))
ggplot() +
geom_density(data = df1, aes(x = x),
fill = "#E69F00", color = "black", alpha = 0.7) +
geom_density(data = df2, aes(x = y),
fill = "#56B4E9", color = "black", alpha = 0.7)
However, from your latest comment, I take that that's not what you want. Instead, you want the areas under the density curves to be scaled relative to the amount of data in each group. You can do that with the ..count.. aesthetics:
df1 <- data.frame(x = rnorm(1000, 0, 2), label=rep('df1', 1000))
df2 <- data.frame(x = rnorm(500, 1, 1), label=rep('df2', 500))
df=rbind(df1, df2)
ggplot(df, aes(x, y=..count.., fill=label)) +
geom_density(color = "black", alpha = 0.7) +
scale_fill_manual(values = c("#E69F00", "#56B4E9"))

Related

Overlaid histograms in R (ggplot2) with percentage value within each group

The code below poduces three overlaid histograms for each of the groups A, B, C in my dataset:
library(ggplot2)
set.seed(97531)
data <- data.frame(values = c(rnorm(1000, 5, 3),
rnorm(1000, 7, 2),
runif(1000, 8, 11)),
group = c(rep("A", 1000),
rep("B", 1000),
rep("C", 1000)))
ggplot(data, aes(x = values, y=100*(..count..)/sum(..count..), fill = group)) +
geom_histogram(position = "identity", alpha = 0.3, bins = 50)+
ylab("percent")
However, the y axis measures the frequency of a given x value within the entire sample (i.e. groups A + B + C), while I want the y axis to measure the frequency within each subgroup. In other words, I would like to obtain the same result of three overlaid histograms for three different dataframes, one for each group A, B and C.
We could subset the data:
ggplot(data,aes(x=values)) +
geom_histogram(data=subset(data,group == 'A'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(data, group == 'B'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(data, group == 'C'),fill = "green", alpha = 0.2)

Overlay a Normal Density Plot On Top of Data ggplot2

I plot a density curve using ggplot2. After I plot the data, I would like to add a normal density plot right on top of it with a fill.
Currently, I am using rnorm() to create the data but this is not efficient and would work poorly on small data sets.
library(tidyverse)
#my data that I want to plot
my.data = rnorm(1000, 3, 10)
#create the normal density plot to overlay the data
overlay.normal = rnorm(1000, 0, 5)
all = tibble(my.data = my.data, overlay.normal = overlay.normal)
all = melt(all)
ggplot(all, aes(value, fill = variable))+geom_density()
The goal would be to plot my data and overlay a normal distribution on top of it (with a fill). Something like:
ggplot(my.data)+geom_density()+add_normal_distribution(mean = 0, sd = 5, fill = "red)
Here's an approach using stat_function to define a normal curve and draw it within the ggplot call.
ggplot(my.data %>% enframe(), aes(value)) +
geom_density(fill = "mediumseagreen", alpha = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = 0, sd = 5),
color = "red", linetype = "dotted", size = 1)
I figured out the solution from mixing Jon's answer and an answer from Hadley.
my.data = rnorm(1000, 3, 10)
ggplot(my.data %>% enframe(), aes(value)) +
geom_density(fill = "mediumseagreen", alpha = 0.1) +
geom_area(stat = "function", fun = function(x) dnorm(x, mean = 0, sd = 5), fill = "red", alpha = .5)

Possible for tooltips to show original vs transformed values in ggplotly with a transformed scale?

I'm a longtime ggplot2 user, but new to plotly. I'm currently working on a plot of ratios (eg, from a Cox proportional hazards model). I'd like to log-transform the scale, but when I do this and then use ggplotly(), the tooltips show the transformed values rather than the original value (eg, for an odds ratio of 1, it would show 0).
Is there a way to ask ggplotly() to show the original values instead? The originalData argument doesn't seem to be what I want, and I'm not finding any other leads.
Example:
library(plotly)
library(ggplot2)
df <- data.frame(x = 1:10,
ref.level = 1,
or = seq(from = 1, by = 0.25, length.out = 10),
lcl = c(1, seq(from = 1.1, by = 0.1, length.out = 9)),
ucl = c(1, seq(from = 1.35, by = 0.35, length.out = 9)))
ggobj <- ggplot(data = df, aes(x = x, y = or)) +
geom_pointrange(aes(ymin = lcl, ymax = ucl))
ggobj %>% ggplotly()
## Plot ratios on log scale
ggobj <- ggobj +
scale_y_continuous(trans = 'log')
## Plot with ggplotly(); y, ymin, ymax tooltips are on transformed scale
ggobj %>% ggplotly()

Producing a "fuzzy" RD plot with ggplot2

My question is similar to this but the answers there will not work for me. Basically, I'm trying to produce a regression discontinuity plot with a "fuzzy" design that uses all the data for the treatment and control groups, but only plots the regression line within the "range" of the treatment and control groups.
Below, I've simulated some data and produced the fuzzy RD plot with base graphics. I'm hoping to replicate this plot with ggplot2. Note that the most important part of this is that the light blue regression line is fit using all the blue points, while the peach colored regression line is fit using all the red points, despite only being plotted over the ranges in which individuals were intended to receive treatment. That's the part I'm having a hard time replicating in ggplot.
I'd like to move to ggplot because I'd like to use faceting to produce this same plot across various units in which participants were nested. In the code below, I show a non-example using geom_smooth. When there's no fuzziness within a group, it works fine, but otherwise it fails. If I could get geom_smooth to be limited to only specific ranges, I think I'd be set. Any and all help is appreciated.
Simulate data
library(MASS)
mu <- c(0, 0)
sigma <- matrix(c(1, 0.7, 0.7, 1), ncol = 2)
set.seed(100)
d <- as.data.frame(mvrnorm(1e3, mu, sigma))
# Create treatment variable
d$treat <- ifelse(d$V1 <= 0, 1, 0)
# Introduce fuzziness
d$treat[d$treat == 1][sample(100)] <- 0
d$treat[d$treat == 0][sample(100)] <- 1
# Treatment effect
d$V2[d$treat == 1] <- d$V2[d$treat == 1] + 0.5
# Add grouping factor
d$group <- gl(9, 1e3/9)
Produce regression discontinuity plot with base
library(RColorBrewer)
pal <- brewer.pal(5, "RdBu")
color <- d$treat
color[color == 0] <- pal[1]
color[color == 1] <- pal[5]
plot(V2 ~ V1,
data = d,
col = color,
bty = "n")
abline(v = 0, col = "gray", lwd = 3, lty = 2)
# Fit model
m <- lm(V2 ~ V1 + treat, data = d)
# predicted achievement for treatment group
pred_treat <- predict(m,
newdata = data.frame(V1 = seq(-3, 0, 0.1),
treat = 1))
# predicted achievement for control group
pred_no_treat <- predict(m,
newdata = data.frame(V1 = seq(0, 4, 0.1),
treat = 0))
# Add predicted achievement lines
lines(seq(-3, 0, 0.1), pred_treat, col = pal[4], lwd = 3)
lines(seq(0, 4, 0.1), pred_no_treat, col = pal[2], lwd = 3)
# Add legend
legend("bottomright",
legend = c("Treatment", "Control"),
lty = 1,
lwd = 2,
col = c(pal[4], pal[2]),
box.lwd = 0)
non-example with ggplot
d$treat <- factor(d$treat, labels = c("Control", "Treatment"))
library(ggplot2)
ggplot(d, aes(V1, V2, group = treat)) +
geom_point(aes(color = treat)) +
geom_smooth(method = "lm", aes(color = treat)) +
facet_wrap(~group)
Notice the regression lines extending past the treatment range for groups 1 and 2.
There's probably a more graceful way to make the lines with geom_smooth, but it can be hacked together with geom_segment. Munge the data.frames outside of the plotting call if you like.
ggplot(d, aes(x = V1, y = V2, color = factor(treat, labels = c('Control', 'Treatment')))) +
geom_point(shape = 21) +
scale_color_brewer(NULL, type = 'qual', palette = 6) +
geom_vline(aes(xintercept = 0), color = 'grey', size = 1, linetype = 'dashed') +
geom_segment(data = data.frame(t(predict(m, data.frame(V1 = c(-3, 0), treat = 1)))),
aes(x = -3, xend = 0, y = X1, yend = X2), color = pal[4], size = 1) +
geom_segment(data = data.frame(t(predict(m, data.frame(V1 = c(0, 4), treat = 0)))),
aes(x = 0, xend = 4, y = X1, yend = X2), color = pal[2], size = 1)
Another option is geom_path:
df <- data.frame(V1 = c(-3, 0, 0, 4), treat = c(1, 1, 0, 0))
df <- cbind(df, V2 = predict(m, df))
ggplot(d, aes(x = V1, y = V2, color = factor(treat, labels = c('Control', 'Treatment')))) +
geom_point(shape = 21) +
geom_vline(aes(xintercept = 0), color = 'grey', size = 1, linetype = 'dashed') +
scale_color_brewer(NULL, type = 'qual', palette = 6) +
geom_path(data = df, size = 1)
For the edit with facets, if I understand what you want correctly, you can calculate a model for each group with lapply and predict for each group. Here I'm recombine with dplyr::bind_rows instead of do.call(rbind, ...) for the .id parameter to insert the group number from the list element name, though there are other ways to do the same thing.
df <- data.frame(V1 = c(-3, 0, 0, 4), treat = c('Treatment', 'Treatment', 'Control', 'Control'))
m_list <- lapply(split(d, d$group), function(x){lm(V2 ~ V1 + treat, data = x)})
df <- dplyr::bind_rows(lapply(m_list, function(x){cbind(df, V2 = predict(x, df))}), .id = 'group')
ggplot(d, aes(x = V1, y = V2, color = treat)) +
geom_point(shape = 21) +
geom_vline(aes(xintercept = 0), color = 'grey', size = 1, linetype = 'dashed') +
geom_path(data = df, size = 1) +
scale_color_brewer(NULL, type = 'qual', palette = 6) +
facet_wrap(~group)

How can I use different color palettes for different layers in ggplot2?

Is it possible to plot two sets of data on the same plot, but use different color palettes for each set?
testdf <- data.frame( x = rnorm(100),
y1 = rnorm(100, mean = 0, sd = 1),
y2 = rnorm(100, mean = 10, sd = 1),
yc = rnorm(100, mean = 0, sd = 3))
ggplot(testdf, aes(x, y1, colour = yc)) + geom_point() +
geom_point(aes(y = y2))
What I would like to see is one set of data, say y1, in blues (color set by yc), and the other set in reds (again color set by yc).
The legend should then show 2 color scales, one in blue, the other red.
Thanks for your suggestions.
If you translate the "blues" and "reds" to varying transparency, then it is not against ggplot's philosophy. So, using Thierry's Moltenversion of the data set:
ggplot(Molten, aes(x, value, colour = variable, alpha = yc)) + geom_point()
Should do the trick.
That's not possible with ggplot2. I think it against the philosophy of ggplot2 because it complicates the interpreatation of the plot.
Another option is to use different shapes to separate the points.
testdf <- data.frame( x = rnorm(100),
y1 = rnorm(100, mean = 0, sd = 1),
y2 = rnorm(100, mean = 10, sd = 1),
yc = rnorm(100, mean = 0, sd = 3))
Molten <- melt(testdf, id.vars = c("x", "yc"))
ggplot(Molten, aes(x, value, colour = yc, shape = variable)) + geom_point()

Resources