R: Density plot with colors by group? - r

I have data from 2 populations.
I'd like to get the histogram and density plot of both on the same graphic.
With one color for one population and another color for the other one.
I've tried this (example):
library(ggplot2)
AA <- rnorm(100000, 70,20)
BB <- rnorm(100000,120,20)
valores <- c(AA,BB)
grupo <- c(rep("AA", 100000),c(rep("BB", 100000)))
todo <- data.frame(valores, grupo)
ggplot(todo, aes(x=valores, fill=grupo, color=grupo)) +
geom_histogram(aes(y=..density..), binwidth=3)+ geom_density(aes(color=grupo))
But I'm just getting a graphic with a single line and a single color.
I would like to have different colors for the the two density lines. And if possible the histograms as well.
I've done it with ggplot2 but base R would also be OK.
or I don't know what I've changed and now I get this:
ggplot(todo, aes(x=valores, fill=grupo, color=grupo)) +
geom_histogram( position="identity", binwidth=3, alpha=0.5)+
geom_density(aes(color=grupo))
but the density lines were not plotted.
or even strange things like

I suggest this ggplot2 solution:
ggplot(todo, aes(valores, color=grupo)) +
geom_histogram(position="identity", binwidth=3, aes(y=..density.., fill=grupo), alpha=0.5) +
geom_density()
#skan: Your attempt was close but you plotted the frequencies instead of density values in the histogram.

A base R solution could be:
hist(AA, probability = T, col = rgb(1,0,0,0.5), border = rgb(1,0,0,1),
xlim=range(AA,BB), breaks= 50, ylim=c(0,0.025), main="AA and BB", xlab = "")
hist(BB, probability = T, col = rgb(0,0,1,0.5), border = rgb(0,0,1,1), add=T)
lines(density(AA))
lines(density(BB), lty=2)
For alpha I used rgb. But there are more ways to get it in. See alpha() in the scales package for instance. I added also the breaks parameter for the plot of the AAs to increase the binwidth compared to the BB group.

Related

how to add a fitted distribution to a histogram

i am trying to add a fitted distribution to the histogram, but after I run it, it is just a straight line. How can i get a density line?
hist(data$price) lines(density(data$price)), lwd = 2, col ="red")
You are using graphics function hist. Use MASS function truehist instead
MASS::truehist(data$price)
lines(density(data$price)), lwd = 2, col ="red")
#Chriss gave a good solution--it does produce a density curve on top of the histogram; however, it changes the y-axis so that you only see the density values (losing the count values).
Here is an alternate solution that will place the frequency counts on the left-side y-axis and add density as a right-side y-axis. Tweak code as needed for things like bins, color, etc. I'm using the mtcars data as an example since there was no code or data provided in the question to replicate. In addition to the two libraries used here (ggpubr and cowplot), you may need to use some ggplot functions to better customize these plot options.
Code for this solution was modified from https://www.datanovia.com/en/blog/ggplot-histogram-with-density-curve-in-r-using-secondary-y-axis/
# packages needed
library(ggpubr)
library(cowplot)
# load data (none provided in the original question)
data("mtcars")
# create histogram (I have 10 bins here, but you may need a different amount)
phist <- gghistogram(mtcars, x="hp", bins=10, fill="blue", ylab="Count (blue)") + ggtitle("Car Horsepower Histogram")
# create density plot, removing many plot elements
pdens <- ggdensity(mtcars, x="hp", col="red", size=2, alpha = 0, ylab="Density (red)") +
scale_y_continuous(expand = expansion(mult = c(0, 0.05)), position = "right") +
theme_half_open(11, rel_small = 1) +
rremove("x.axis")+
rremove("xlab") +
rremove("x.text") +
rremove("x.ticks") +
rremove("legend")
# overlay and display the plots
aligned_plots <- align_plots(phist, pdens, align="hv", axis="tblr")
ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])

How to add a boxplot to a histogram using ggMarginal in R

I would like to draw a histogram with a density curve and then put a boxplot above the top margin. I know how to do this using the hist(), boxplot() and layout() functions, or using functions from the ggplot2 and grid packages. However, I am looking for a specific solution using ggplot2 and the ggMarginal() function within the ggExtra package. Let's simulate some data before I present my problem:
library(ggplot2)
library(ggExtra)
set.seed(1234)
vdat = data.frame(V1 = c(sample(1:10, 100, T), 99))
vname = colnames(vdat)[1]
boxplot(vdat[[vname]], horizontal = T)
To note, I explicitly insert an outlier 99 into a sample of numbers from 1 to 10. Hence, when I draw the boxplot, 99 should be displayed as an outlier.
I can easily draw a histogram using ggplot2.
p = ggplot(data=vdat, aes_string(x=vname)) +
geom_histogram(aes(y=stat(density)),
bins=nclass.Sturges(vdat[[vname]])+1,
color="black", fill="steelblue", na.rm=T) +
geom_density(na.rm=T) +
theme_bw()
p
When I try to use ggMarginal to add a marginal boxplot, the added boxplots are not right.
p1 = ggMarginal(p, type="boxplot")
p1
The boxplot on the right might be right. But the one on top, which is the very one I need, is definitely wrong. The outlier 99 is not there and the median is clearly not right.
When I try not to provide p1, but the original data, x, and y as suggested by the help documentation, I get the right boxplot but the histogram is now gone.
p2 = ggMarginal(data=vdat, x=vname, y=NA, type="boxplot", margins="x")
p2
How can I combine the correct parts of p1 and p2 such that I have the histogram from p1 and the boxplot from p2?
I am trying something like
p1 + p2
or
ggMarginal(p1, data=vdat, x=vname, y=NA, type="boxplot", margins="x")
But they are not working.
According to ggMarginal's documentation, p is expected to be a ggplot scatterplot. We can insert the following line as the first geom layer in p:
geom_point(aes(y = 0.01), alpha = 0)
y = 0.01 was chosen as a value within the existing plot's y-axis range, and alpha = 0 ensures this layer isn't visible.
Running your code with this p should give you the boxplot with outlier.
p <- ggplot(data=vdat, aes_string(x=vname)) +
geom_point(aes(y = 0.01), alpha = 0) +
geom_histogram(aes(y=stat(density)),
bins=nclass.Sturges(vdat[[vname]])+1,
color="black", fill="steelblue", na.rm=T) +
geom_density(na.rm=T) +
theme_bw()
p1 = ggMarginal(p, type="boxplot", margins = "x")
p1
By the way, I don't think it really makes sense to plot a boxplot to the right in this instance, since you have not assigned any variable to y.

How to shade part of a density curve in ggplot (with no y axis data)

I'm trying to create a density curve in R using a set of random numbers between 1000, and shade the part that is less than or equal to a certain value. There are a lot of solutions out there involving geom_area or geom_ribbon, but they all require a yval, which I don't have (it's just a vector of 1000 numbers). Any ideas on how I could do this?
Two other related questions:
Is it possible to do the same thing for a cumulative density function (I'm currently using stat_ecdf to generate one), or shade it at all?
Is there any way to edit geom_vline so it will only go up to the height of the density curve, rather than the whole y axis?
Code: (the geom_area is a failed attempt to edit some code I found. If I set ymax manually, I just get a column taking up the whole plot, instead of just the area under the curve)
set.seed(100)
amount_spent <- rnorm(1000,500,150)
amount_spent1<- data.frame(amount_spent)
rand1 <- runif(1,0,1000)
amount_spent1$pdf <- dnorm(amount_spent1$amount_spent)
mean1 <- mean(amount_spent1$amount_spent)
#density/bell curve
ggplot(amount_spent1,aes(amount_spent)) +
geom_density( size=1.05, color="gray64", alpha=.5, fill="gray77") +
geom_vline(xintercept=mean1, alpha=.7, linetype="dashed", size=1.1, color="cadetblue4")+
geom_vline(xintercept=rand1, alpha=.7, linetype="dashed",size=1.1, color="red3")+
geom_area(mapping=aes(ifelse(amount_spent1$amount_spent > rand1,amount_spent1$amount_spent,0)), ymin=0, ymax=.03,fill="red",alpha=.3)+
ylab("")+
xlab("Amount spent on lobbying (in Millions USD)")+
scale_x_continuous(breaks=seq(0,1000,100))
There are a couple of questions that show this ... here and here, but they calculate the density prior to plotting.
This is another way, more complicated than required im sure, that allows ggplot to do some of the calculations for you.
# Your data
set.seed(100)
amount_spent1 <- data.frame(amount_spent=rnorm(1000, 500, 150))
mean1 <- mean(amount_spent1$amount_spent)
rand1 <- runif(1,0,1000)
Basic density plot
p <- ggplot(amount_spent1, aes(amount_spent)) +
geom_density(fill="grey") +
geom_vline(xintercept=mean1)
You can extract the x and y positions for the area to shade from the plot object using ggplot_build. Linear interpolation was used to get the y value at x=rand1
# subset region and plot
d <- ggplot_build(p)$data[[1]]
p <- p + geom_area(data = subset(d, x > rand1), aes(x=x, y=y), fill="red") +
geom_segment(x=rand1, xend=rand1,
y=0, yend=approx(x = d$x, y = d$y, xout = rand1)$y,
colour="blue", size=3)

R ggplot with two series: points and errorbars with legends

If I have a dataframe like this:
obs<-rnorm(20)
d<-data.frame(year=2000:2019,obs=obs,pred=obs+rnorm(20,.1))
d$pup<-d$pred+.5
d$plow<-d$pred-.5
d$obs[20]<-NA
d
And I want the observation and model prediction error bars to look something like:
(p1<-ggplot(data=d)+aes(x=year)
+geom_point(aes(y=obs),color='red',shape=19)
+geom_point(aes(y=pred),color='blue',shape=3)
+geom_errorbar(aes(ymin=plow,ymax=pup))
)
How do I add a legend/scale/key identifying the red points as observations and the blue plusses with error bars as point predictions with ranges?
Here is one solution melting pred/obs into one column. Can't post image due to rep.
library(ggplot2)
obs <- rnorm(20)
d <- data.frame(dat=c(obs,obs+rnorm(20,.1)))
d$pup <- d$dat+.5
d$plow <- d$dat-.5
d$year <- rep(2000:2019,2)
d$lab <- c(rep("Obs", 20), rep("Pred", 20))
p1<-ggplot(data=d, aes(x=year)) +
geom_point(aes(y = dat, colour = factor(lab), shape = factor(lab))) +
geom_errorbar(data = d[21:40,], aes(ymin=plow,ymax=pup), colour = "blue") +
scale_shape_manual(name = "Legend Title", values=c(6,1)) +
scale_colour_manual(name = "Legend Title", values=c("red", "blue"))
p1
edit: Thanks for the rep. Image added
Here is a ggplot solution that does not require melting and grouping.
set.seed(1) # for reproducible example
obs <- rnorm(20)
d <- data.frame(year=2000:2019,obs,pred=obs+rnorm(20,.1))
d$obs[20]<-NA
library(ggplot2)
ggplot(d,aes(x=year))+
geom_point(aes(y=obs,color="obs",shape="obs"))+
geom_point(aes(y=pred,color="pred",shape="pred"))+
geom_errorbar(aes(ymin=pred-0.5,ymax=pred+0.5))+
scale_color_manual("Legend",values=c(obs="red",pred="blue"))+
scale_shape_manual("Legend",values=c(obs=19,pred=3))
This creates a color and shape scale wiith two components each ("obs" and "pred"). Then uses scale_*_manual(...) to set the values for those scales ("red","blue") for color, and (19,3) for scale.
Generally, if you have only two categories, like "obs" and "pred", then this is a reasonable way to go use ggplot, and avoids merging everything into one data frame. If you have more than two categories, or if they are integral to the dataset (e.g., actual categorical variables), then you are much better off doing this as in the other answer.
Note that your example left out the column year so your code does not run.

Conditional graphing and fading colors

I am trying to create a graph where because there are so many points on the graph, at the edges of the green it starts to fade to black while the center stays green. The code I am currently using to create this graph is:
plot(snb$px,snb$pz,col=snb$event_type,xlim=c(-2,2),ylim=c(1,6))
I looked into contour plotting but that did not work for this. The coloring variable is a factor variable.
Thanks!
This is a great problem for ggplot2.
First, read the data in:
snb <- read.csv('MLB.csv')
With your data frame you could try plotting points that are partly transparent, and setting them to be colored according to the factor event_type:
require(ggplot2)
p1 <- ggplot(data = snb, aes(x = px, y = py, color = event_type)) +
geom_point(alpha = 0.5)
print(p1)
and then you get this:
Or, you might want to think about plotting this as a heatmap using geom_bin2d(), and plotting facets (subplots) for each different event_type, like this:
p2 <- ggplot(data = snb, aes(x = px, y = py)) +
geom_bin2d(binwidth = c(0.25, 0.25)) +
facet_wrap(~ event_type)
print(p2)
which makes a plot for each level of the factor, where the color will be the number of data points in each bins that are 0.25 on each side. But, if you have more than about 5 or 6 levels, this might look pretty bad. From the small data sample you supplied, I got this
If the levels of the factors don't matter, there are some nice examples here of plots with too many points. You could also try looking at some of the examples on the ggplot website or the R cookbook.
Transparency could help, which is easily achieved, as #BenBolker points out, with adjustcolor:
colvect = adjustcolor(c("black", "green"), alpha = 0.2)
plot(snb$px, snb$pz,
col = colvec[snb$event_type],
xlim = c(-2,2),
ylim = c(1,6))
It's built in to ggplot:
require(ggplot2)
p <- ggplot(data = snb, aes(x = px, y = pz, color = event_type)) +
geom_point(alpha = 0.2)
print(p)

Resources