how to add a fitted distribution to a histogram - r

i am trying to add a fitted distribution to the histogram, but after I run it, it is just a straight line. How can i get a density line?
hist(data$price) lines(density(data$price)), lwd = 2, col ="red")

You are using graphics function hist. Use MASS function truehist instead
MASS::truehist(data$price)
lines(density(data$price)), lwd = 2, col ="red")

#Chriss gave a good solution--it does produce a density curve on top of the histogram; however, it changes the y-axis so that you only see the density values (losing the count values).
Here is an alternate solution that will place the frequency counts on the left-side y-axis and add density as a right-side y-axis. Tweak code as needed for things like bins, color, etc. I'm using the mtcars data as an example since there was no code or data provided in the question to replicate. In addition to the two libraries used here (ggpubr and cowplot), you may need to use some ggplot functions to better customize these plot options.
Code for this solution was modified from https://www.datanovia.com/en/blog/ggplot-histogram-with-density-curve-in-r-using-secondary-y-axis/
# packages needed
library(ggpubr)
library(cowplot)
# load data (none provided in the original question)
data("mtcars")
# create histogram (I have 10 bins here, but you may need a different amount)
phist <- gghistogram(mtcars, x="hp", bins=10, fill="blue", ylab="Count (blue)") + ggtitle("Car Horsepower Histogram")
# create density plot, removing many plot elements
pdens <- ggdensity(mtcars, x="hp", col="red", size=2, alpha = 0, ylab="Density (red)") +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)), position = "right") +
theme_half_open(11, rel_small = 1) +
rremove("x.axis")+
rremove("xlab") +
rremove("x.text") +
rremove("x.ticks") +
rremove("legend")
# overlay and display the plots
aligned_plots <- align_plots(phist, pdens, align="hv", axis="tblr")
ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])

Related

How to add a boxplot to a histogram using ggMarginal in R

I would like to draw a histogram with a density curve and then put a boxplot above the top margin. I know how to do this using the hist(), boxplot() and layout() functions, or using functions from the ggplot2 and grid packages. However, I am looking for a specific solution using ggplot2 and the ggMarginal() function within the ggExtra package. Let's simulate some data before I present my problem:
library(ggplot2)
library(ggExtra)
set.seed(1234)
vdat = data.frame(V1 = c(sample(1:10, 100, T), 99))
vname = colnames(vdat)[1]
boxplot(vdat[[vname]], horizontal = T)
To note, I explicitly insert an outlier 99 into a sample of numbers from 1 to 10. Hence, when I draw the boxplot, 99 should be displayed as an outlier.
I can easily draw a histogram using ggplot2.
p = ggplot(data=vdat, aes_string(x=vname)) +
geom_histogram(aes(y=stat(density)),
bins=nclass.Sturges(vdat[[vname]])+1,
color="black", fill="steelblue", na.rm=T) +
geom_density(na.rm=T) +
theme_bw()
p
When I try to use ggMarginal to add a marginal boxplot, the added boxplots are not right.
p1 = ggMarginal(p, type="boxplot")
p1
The boxplot on the right might be right. But the one on top, which is the very one I need, is definitely wrong. The outlier 99 is not there and the median is clearly not right.
When I try not to provide p1, but the original data, x, and y as suggested by the help documentation, I get the right boxplot but the histogram is now gone.
p2 = ggMarginal(data=vdat, x=vname, y=NA, type="boxplot", margins="x")
p2
How can I combine the correct parts of p1 and p2 such that I have the histogram from p1 and the boxplot from p2?
I am trying something like
p1 + p2
or
ggMarginal(p1, data=vdat, x=vname, y=NA, type="boxplot", margins="x")
But they are not working.
According to ggMarginal's documentation, p is expected to be a ggplot scatterplot. We can insert the following line as the first geom layer in p:
geom_point(aes(y = 0.01), alpha = 0)
y = 0.01 was chosen as a value within the existing plot's y-axis range, and alpha = 0 ensures this layer isn't visible.
Running your code with this p should give you the boxplot with outlier.
p <- ggplot(data=vdat, aes_string(x=vname)) +
geom_point(aes(y = 0.01), alpha = 0) +
geom_histogram(aes(y=stat(density)),
bins=nclass.Sturges(vdat[[vname]])+1,
color="black", fill="steelblue", na.rm=T) +
geom_density(na.rm=T) +
theme_bw()
p1 = ggMarginal(p, type="boxplot", margins = "x")
p1
By the way, I don't think it really makes sense to plot a boxplot to the right in this instance, since you have not assigned any variable to y.

ggplot scatterplot and lines

I have some biological data for two individuals, and I graph it using R as a scatterplot using ggplot like this:
p1<-ggplot(data, aes(meth_matrix$sample1, meth_matrix$sample3)) +
geom_point() +
theme_minimal()
which works perfect, but I want to add lines to it: the abline that divides the scatterplot in half:
p1 + geom_abline(color="blue")
and my question is: how can I draw two red lines parallel to that diagonal (y intercept would be 0.2, slope would be the same as the blue line) ??
Also: how can I draw the difference of both samples in a similar scatterplot (it will look like a horizontal scatterplot) with ggplot? right now I can only do it with plot like:
dif_samples<-meth_matrix$sample1- meth_matrix$sample3
plot(dif_samples, main="difference",
xlab="CpGs ", ylab="Methylation ", pch=19)
(also I'd like adding the horizontal blue line and the red lines paralllel to the blue line)
Please help!!!
Thank you very much.
You can specify slopes and intercepts in the geom_abline() function. I'll use the iris dataset that comes with ggplot2 to illustrate:
# I'll use the iris dataset. I normalise by dividing the variables by their max so that
# a line through the origin will be visible
library(ggplot2)
p1 <- ggplot(iris, aes(Sepal.Length/max(Sepal.Length), Sepal.Width/max(Sepal.Width))) +
geom_point() + theme_minimal()
# Draw lines by specifying their slopes and intercepts. since all lines
# share a slope I just give one insted of a vector of slopes
p1 + geom_abline(intercept = c(0, .2, -.2), slope = 1,
color = c("blue", "red", "red"))
I'm not as clear on exactly what you want for the second plot, but you can plot differences directly in the call to ggplot() and you can add horizontal lines with geom_hline():
# Now lets plot the difference between sepal length and width
# for each observation
p2 <- ggplot(iris, aes(x = 1:nrow(iris),
y = (Sepal.Length - Sepal.Width) )) +
geom_point() + theme_minimal()
# we'll add horizontal lines -- you can pick values that make sense for your problem
p2 + geom_hline(yintercept = c(3, 3.2, 2.8),
color = c("blue", "red", "red"))
Created on 2018-03-21 by the reprex package (v0.2.0).

Dividing long time series in multiple panels with ggplot2

I have a rather long timeseries that I want to plot in ggplot, but it's sufficiently long that even using the full width of the page it's barely readable.
What I want to do instead is to divide the plot into 2 (or more, in the general case) panels one on top of each other.
I could do it manually but not only it's cumbersome but also it's hard to get the axis to have the same scale. Ideally I would like to have something like this:
ggplot(data, aes(time, y)) +
geom_line() +
facet_time(time, n = 2)
And then get something like this:
(This plot was made using facet_wrap(~(year(as.Date(time)) > 2000), ncol = 1, scales = "free_x"), which messes up x axis scale, it works only for 2 panels, and doesn't work well with geom_smooth())
Also, ideally it would also handle summary statistics correctly. For example, using the correct data for geom_smooth() (so facetting wouldn't do it, because at the beginning of every facet it would not use the data in the last chunk of the previous one).
Is there a way to do this?
Thank you!
Below I create two separate plots, one for the period 1982-1999 and one for 1999-2016 and then lay them out using grid.arrange from the gridExtra package. The horizontal axes are scaled equivalently in both plots.
I also generate regression lines outside of ggplot using the loess function so that it can be added using geom_line (you can of course use any regression function here, such as lm, gam, splines, etc). With this approach the regression can be run on the entire time series, ensuring continuity of the regression line across the two panels, even though we break the time series into two halves for plotting.
library(dplyr) # For the chaining (%>%) operator
library(purrr) # For the map function
library(gridExtra) # For the grid.arrange function
Function to extract a legend from a ggplot. We'll use this to get one legend across two separate plots.
# http://stackoverflow.com/questions/12539348/ggplot-separate-legend-and-plot
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
legend
}
# Fake data
set.seed(255)
dat = data.frame(time=rep(seq(1982,2016,length.out=500),2),
value= c(arima.sim(list(ar=c(0.4, 0.05, 0.5)), n=500),
arima.sim(list(ar=c(0.3, -0.3, 0.6)), n=500)),
group=rep(c("A","B"), each=500))
Generate smoother lines using loess: We want a separate regression line for each level of group, so we use group_by with the chaining operator from dplyr:
dat = dat %>% group_by(group) %>%
mutate(smooth = predict(loess(value ~ time, span=0.1)))
Create a list of two plots, one for each time period: We use map to create separate plots for each time period and return a list with the two plot objects as elements (you can also use base lapply for this instead of map):
pl = map(list(c(1982,1999), c(1999,2016)),
~ ggplot(dat %>% filter(time >= .x[1], time <= .x[2]),
aes(colour=group)) +
geom_line(aes(time, value), alpha=0.5) +
geom_line(aes(time, smooth), size=1) +
scale_x_continuous(breaks=1982:2016, expand=c(0.01,0)) +
scale_y_continuous(limits=range(dat$value)) +
theme_bw() +
labs(x="", y="", colour="") +
theme(strip.background=element_blank(),
strip.text=element_blank(),
axis.title=element_blank()))
# Extract legend as a separate graphics object
leg = g_legend(pl[[1]])
Finally, we lay out both plots (after removing legends) plus the extracted legend:
grid.arrange(arrangeGrob(grobs=map(pl, function(p) p + guides(colour=FALSE)), ncol=1),
leg, ncol=2, widths=c(10,1), left="Value", bottom="Year")
You can do this by storing the plot object, then printing it twice. Each time add an option coord_cartesian:
orig_plot <- ggplot(data, aes(time, y)) +
geom_line()
early <- orig_plot + coord_cartesian(xlim = c(1982, 2000))
late <- orig_plot + coord_cartesian(xlim = c(2000, 2016))
That makes sure that both plots use all the data.
To plot them on the same page, use grid (I got this from the ggplot2 book, which is probably around as a pdf somewhere):
library(grid)
vp1 <- viewport(width = 1, height = .5, just = c("center", "bottom"))
vp2 <- viewport(width = 1, height = .5, just = c("center", "top"))
print(early, vp = vp1)
print(late, vp = vp2)

R: Density plot with colors by group?

I have data from 2 populations.
I'd like to get the histogram and density plot of both on the same graphic.
With one color for one population and another color for the other one.
I've tried this (example):
library(ggplot2)
AA <- rnorm(100000, 70,20)
BB <- rnorm(100000,120,20)
valores <- c(AA,BB)
grupo <- c(rep("AA", 100000),c(rep("BB", 100000)))
todo <- data.frame(valores, grupo)
ggplot(todo, aes(x=valores, fill=grupo, color=grupo)) +
geom_histogram(aes(y=..density..), binwidth=3)+ geom_density(aes(color=grupo))
But I'm just getting a graphic with a single line and a single color.
I would like to have different colors for the the two density lines. And if possible the histograms as well.
I've done it with ggplot2 but base R would also be OK.
or I don't know what I've changed and now I get this:
ggplot(todo, aes(x=valores, fill=grupo, color=grupo)) +
geom_histogram( position="identity", binwidth=3, alpha=0.5)+
geom_density(aes(color=grupo))
but the density lines were not plotted.
or even strange things like
I suggest this ggplot2 solution:
ggplot(todo, aes(valores, color=grupo)) +
geom_histogram(position="identity", binwidth=3, aes(y=..density.., fill=grupo), alpha=0.5) +
geom_density()
#skan: Your attempt was close but you plotted the frequencies instead of density values in the histogram.
A base R solution could be:
hist(AA, probability = T, col = rgb(1,0,0,0.5), border = rgb(1,0,0,1),
xlim=range(AA,BB), breaks= 50, ylim=c(0,0.025), main="AA and BB", xlab = "")
hist(BB, probability = T, col = rgb(0,0,1,0.5), border = rgb(0,0,1,1), add=T)
lines(density(AA))
lines(density(BB), lty=2)
For alpha I used rgb. But there are more ways to get it in. See alpha() in the scales package for instance. I added also the breaks parameter for the plot of the AAs to increase the binwidth compared to the BB group.

How to draw a clipped density plot in ggplot2 without missing sections

I would like to use ggplot2 to draw a lattice plot of densities produced from different methods, in which the same yaxis scale is used throughout.
I would like to set the upper limit of the y axis to a value below the highest density value for any one method. However ggplot by default removes sections of the geom that are outside of the plotted region.
For example:
# Toy example of problem
xval <- rnorm(10000)
#Base1
plot(density(xval))
#Base2
plot(density(xval), ylim=c(0, 0.3)) # densities > 0.3 not removed from plot
xval <- as.data.frame(xval)
ggplot(xval, aes(x=xval)) + geom_density() #gg1 - looks like Base1
ggplot(xval, aex(x=xval)) + geom_density() + ylim(0, 0.3)
#gg2: does not look like Base2 due to removal of density values > 0.3
These produce the images below:
How can I make the ggplot image not have the missing section?
Using xlim() or ylim() directly will drop all data points that are not within the specified range. This yields the discontinuity of the density plot. Use coord_cartesian() to zoom in without losing the data points.
ggplot(xval, aes(x=xval)) +
geom_density() +
coord_cartesian(ylim = c(0, 0.3))

Resources