Fit curve to histogram ggplot [duplicate] - r

This question already has answers here:
"Density" curve overlay on histogram where vertical axis is frequency (aka count) or relative frequency?
(3 answers)
Closed 7 years ago.
I know that i can fit a density curve to my histogram in ggplot in the following way.
df = data.frame(x=rnorm(100))
ggplot(df, aes(x=x, y=..density..)) + geom_histogram() + geom_density()
However, I want my yaxis to be frequency(counts) instead of density, and retain a curve that fits the distribution. How do I do that?

Depending on your goals, something like this may work by just scaling the density curve using multiplication:
ggplot(df, aes(x=x)) + geom_histogram() + geom_density(aes(y=..density..*10))
or
ggplot(df, aes(x=x)) + geom_histogram() + geom_density(aes(y=..count../10))
Choose other values (instead of 10) if you want to scale things differently.
Edit:
Since you are defining your scaling factor in the global environment, you can define it within aes:
ggplot(df, aes(x=x)) + geom_histogram() + geom_density(aes(n=n, y=..density..*n))
# or
ggplot(df, aes(x=x, n=n)) + geom_histogram() + geom_density(aes(y=..density..*n))
or another, less nice way using get:
ggplot(df, aes(x=x)) +
geom_histogram() +
geom_density(aes(y=..density.. * get("n", pos = .GlobalEnv)))

Related

ggplot2 - why does changing axis scale affect summary statistics of variables? [duplicate]

This question already has an answer here:
R ggplot boxplot: change y-axis limit
(1 answer)
Closed last month.
I have a the following data:
x <- data.frame('myvar'=c(10,10,9,9,8,8, runif(100)), 'mygroup' = c(rep('a', 26), rep('b', 80)))
I want to describe the data using a box-and-whiskers plot in ggplot2. I have also included the mean using a stat_summary.
library(ggplot2)
ggplot(x, aes(x=myvar, y=mygroup)) +
geom_boxplot() +
stat_summary(fun=mean, geom='point', shape=20, color='red', fill='red')
This is fine, but for some of my graphs, the outliers are so huge, that it's hard to make sense of the total distribution. In these cases, I have cut the x axis:
ggplot(x, aes(x=myvar, y=mygroup)) +
geom_boxplot() +
stat_summary(fun=mean, geom='point', shape=20, color='red', fill='red') +
scale_x_continuous(limit=c(0,5))
Note, now that the means (and medians?) are calculated using only the subset of data that is visible on the graph. Is there a ggplot way to include the outlier observations in the calculation but drop them from the visualisation?
My desired output would be a graph with x limits at c(0,5) and a red dot at 2.48 for group mygroup='a'.
scale_x_continuous will remove those points not lying within the limits. You want to use coord_cartesian to "zoom in" without removing your data:
ggplot(x, aes(x=myvar, y=mygroup)) +
geom_boxplot() +
stat_summary(fun=mean, geom='point', shape=20, color='red', fill='red') +
coord_cartesian(c(0,5))

ggplot2 different facet width for categorical x-axis [duplicate]

This question already has an answer here:
different size facets proportional of x axis on ggplot 2 r
(1 answer)
Closed 5 years ago.
I have am plotting different facets of categorical data:
df <- as.data.frame(as.factor(c("A","B","C","D","E","F")))
names(df) <- "Xvar"
df$Yvar <- c(2,1,4,5,3,7)
df$facet <- c(rep("facet 1",2),rep("facet 2",4))
ggplot(df, aes(x=Xvar, y=Yvar, group=1)) +
geom_line() +
facet_wrap(~facet, scales="free_x")
How can I make it such that facet 1 consisting of only two categories is half the size of facet 2 containing four categories? I.e. that the width of each facet is proportional to the number of categorical x-axis data points? I tried scales="free_x" to no avail.
If you're willing to use facet_grid instead of facet_wrap, you can do this with the space parameter.
ggplot(df, aes(x=Xvar, y=Yvar, group=1)) +
geom_line() +
facet_grid(~facet, scales="free_x", space = "free_x")

In ggplot2, how can I limit the range of geom_hline?

Taking a simple plot from ggplot2 manual
p <- ggplot(mtcars, aes(x = wt, y=mpg)) + geom_point()
p + geom_hline(yintercept=20)
I get a horizontal line at value 20, as advertised.
Is there a way to limit the range of this line on x axis, to let's say2 - 4 range?
You can use geom_segment() instead of geom_hline() and provide x= and xend= values you need.
p+geom_segment(aes(x=2,xend=4,y=20,yend=20))

Splitting distribution visualisations on the y-axis in ggplot2 in r

The most commonly cited example of how to visualize a logistic fit using ggplot2 seems to be something very much like this:
data("kyphosis", package="rpart")
ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
geom_point() +
stat_smooth(method="glm", family="binomial")
This visualisation works great if you don't have too much overlapping data, and the first suggestion for crowded data seems to be to use injected jitter in the x and y coordinates of the points then adjust the alpha value of the points. When you get to the point where individual points aren't useful but distributions of points are, is it possible to use geom_density(), geom_histogram(), or something else to visualise the data but continue to split the categorical variable along the y-axis as it is done with geom_point()?
From what I have found, geom_density() and geom_histogram() can easily be split/grouped by the categorical variable and both levels can easily be reversed using scale_y_reverse() but I can't figure out if it is even possible to move only one of the categorical variable distributions to the top of the plot. Any help/suggestions would be appreciated.
The annotate() function in ggplot allows you to add geoms to a plot with properties that "are not mapped from the variables of a data frame, but are instead in as vectors," meaning that you can add layers that are unrelated to your data frame. In this case your two density curves are related to the data frame (since the variables are in it), but because you're trying to position them differently, using annotate() is useful.
Here's one way to go about it:
data("kyphosis", package="rpart")
model.only <- ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) +
stat_smooth(method="glm", family="binomial")
absents <- subset(kyphosis, Kyphosis=="absent")
presents <- subset(kyphosis, Kyphosis=="present")
dens.absents <- density(absents$Age)
dens.presents <- density(presents$Age)
scaling.factor <- 10 # Make the density plots taller
model.only + annotate("line", x=dens.absents$x, y=dens.absents$y*scaling.factor) +
annotate("line", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1)
This adds two annotated layers with scaled density plots for each of the kyphosis groups. For the presents variable, y is scaled and increased by 1 to shift it up.
You can also fill the density plots instead of just using a line. Instead of annotate("line"...) you need to use annotate("polygon"...), like so:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green", colour="black", alpha=0.4)
Technically you could use annotate("density"...), but that won't work when you shift the present plot up by one. Instead of shifting, it fills the whole plot:
model.only + annotate("density", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red") +
annotate("density", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green")
The only way around that problem is to use a polygon instead of a density geom.
One final variant: flipping the top density plot along y-axis = 1:
model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) +
annotate("polygon", x=dens.presents$x, y=(1 - dens.presents$y*scaling.factor), fill="green", colour="black", alpha=0.4)
I am not sure I get your point, but here an attempt:
dat <- rbind(kyphosis,kyphosis)
dat$grp <- factor(rep(c('smooth','dens'),each = nrow(kyphosis)),
levels = c('smooth','dens'))
ggplot(dat,aes(x=Age)) +
facet_grid(grp~.,scales = "free_y") +
#geom_point(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1)) +
stat_smooth(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1),
method="glm", family="binomial") +
geom_density(data=subset(dat,grp=='dens'))

Coloring density plot in ggplot2

When I use following code to generate a density plot:
require(ggplot2)
set.seed(seed=10)
n <- 10000
s.data <- data.frame(score = rnorm(n,500,100),
gender = sample(c("Male","Female","No Response"),size=n,replace=T,prob=c(.4,.55,.05)),
major = sample(c("A","B","C","D"),size=n,replace=T,prob=c(.02,.25,.05,.68)))
ggplot(s.data, aes(major,..density..,fill=major,group=1)) +
geom_histogram() + facet_wrap(~ gender)
I cannot distinguish between categories of "major" by color.
What I want to get is density plot similar to this frequency plot in the sense of colors and legend:
ggplot(s.data, aes(major,fill=major)) +
geom_histogram() + facet_wrap(~ gender)
This question is following my question (here) which is already answered here.
You can still try frequency plot with facet parameter scale="free_y":
ggplot(s.data, aes(major,..count..,fill=major)) +
geom_histogram() + facet_wrap(~ gender, scale="free_y")

Resources