I want to graphically show the assumptions of linear (and later other type) regression. How can I add to my plot small Gaussian densities (or any type of densities) on a regression line just like in this figure:
You can compute the empirical densities of the residuals for sections along a fitted line. Then, it is just a matter of drawing the lines at the positions of your choosing in each interval using geom_path. To add theoretical distribution, generate some densities along the range of the residuals for each section (here using normal density). For the Normal densities below, the standard deviation for each one is determined for each section from the residuals, but you could just choose a standard deviation for all of them and use that instead.
## Sample data
set.seed(0)
dat <- data.frame(x=(x=runif(100, 0, 50)),
y=rnorm(100, 10*x, 100))
## breaks: where you want to compute densities
breaks <- seq(0, max(dat$x), len=5)
dat$section <- cut(dat$x, breaks)
## Get the residuals
dat$res <- residuals(lm(y ~ x, data=dat))
## Compute densities for each section, and flip the axes, and add means of sections
## Note: the densities need to be scaled in relation to the section size (2000 here)
dens <- do.call(rbind, lapply(split(dat, dat$section), function(x) {
d <- density(x$res, n=50)
res <- data.frame(x=max(x$x)- d$y*2000, y=d$x+mean(x$y))
res <- res[order(res$y), ]
## Get some data for normal lines as well
xs <- seq(min(x$res), max(x$res), len=50)
res <- rbind(res, data.frame(y=xs + mean(x$y),
x=max(x$x) - 2000*dnorm(xs, 0, sd(x$res))))
res$type <- rep(c("empirical", "normal"), each=50)
res
}))
dens$section <- rep(levels(dat$section), each=100)
## Plot both empirical and theoretical
ggplot(dat, aes(x, y)) +
geom_point() +
geom_smooth(method="lm", fill=NA, lwd=2) +
geom_path(data=dens, aes(x, y, group=interaction(section,type), color=type), lwd=1.1) +
theme_bw() +
geom_vline(xintercept=breaks, lty=2)
Or, just gaussian curves
## Just normal
ggplot(dat, aes(x, y)) +
geom_point() +
geom_smooth(method="lm", fill=NA, lwd=2) +
geom_path(data=dens[dens$type=="normal",], aes(x, y, group=section), color="salmon", lwd=1.1) +
theme_bw() +
geom_vline(xintercept=breaks, lty=2)
Related
When I plot densities with ggplot, it seems to be very wrong around the limits. I see that geom_density and other functions allow specifying various density kernels, but none of them seem to fix the issue.
How do you correctly plot densities around the limits with ggplot?
As an example, let's plot the Chi-square distribution with 2 degrees of freedom. Using the builtin probability densities:
library(ggplot2)
u = seq(0, 2, by=0.01)
v = dchisq(u, df=2)
df = data.frame(x=u, p=v)
p = ggplot(df) +
geom_line(aes(x=x, y=p), size=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 0.5))
show(p)
We get the expected plot:
Now let's try simulating it and plotting the empirical distribution:
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_density(aes(x=x)) +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
We get an incorrect plot:
We can try to visualize the actual distribution:
library(ggplot2, dplyr, tidyr)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_point(aes(x=x, y=0.5), position=position_jitter(height=0.2), shape='.', alpha=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 1))
show(p)
And it seems to look correct, contrary to the density plot:
It seems like the problem has to do with kernels, and geom_density does allow using different kernels. But they don't really correct the limit problem. For example, the code above with triangular looks about the same:
Here's an idea of what I'm expecting to see (of course, I want a density, not a histogram):
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_histogram(aes(x=x), center=0.1, binwidth=0.2, fill='white', color='black') +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
The usual kernel density methods have trouble when there is a constraint such as in this case for a density with only support above zero. The usual recommendation for handling this has been to use the logspline package:
install.packages("logspline")
library(logspline)
png(); fit <- logspline(rchisq(10000, 3))
plot(fit) ; dev.off()
If this needed to be done in the ggplot2 environment there is a dlogspline function:
densdf <- data.frame( y=dlogspline(seq(0,12,length=1000), fit),
x=seq(0,12,length=1000))
ggplot(densdf, aes(y=y,x=x))+geom_line()
Perhaps you were insisting on one with 2 degrees of freedom?
I'm trying to create a density curve in R using a set of random numbers between 1000, and shade the part that is less than or equal to a certain value. There are a lot of solutions out there involving geom_area or geom_ribbon, but they all require a yval, which I don't have (it's just a vector of 1000 numbers). Any ideas on how I could do this?
Two other related questions:
Is it possible to do the same thing for a cumulative density function (I'm currently using stat_ecdf to generate one), or shade it at all?
Is there any way to edit geom_vline so it will only go up to the height of the density curve, rather than the whole y axis?
Code: (the geom_area is a failed attempt to edit some code I found. If I set ymax manually, I just get a column taking up the whole plot, instead of just the area under the curve)
set.seed(100)
amount_spent <- rnorm(1000,500,150)
amount_spent1<- data.frame(amount_spent)
rand1 <- runif(1,0,1000)
amount_spent1$pdf <- dnorm(amount_spent1$amount_spent)
mean1 <- mean(amount_spent1$amount_spent)
#density/bell curve
ggplot(amount_spent1,aes(amount_spent)) +
geom_density( size=1.05, color="gray64", alpha=.5, fill="gray77") +
geom_vline(xintercept=mean1, alpha=.7, linetype="dashed", size=1.1, color="cadetblue4")+
geom_vline(xintercept=rand1, alpha=.7, linetype="dashed",size=1.1, color="red3")+
geom_area(mapping=aes(ifelse(amount_spent1$amount_spent > rand1,amount_spent1$amount_spent,0)), ymin=0, ymax=.03,fill="red",alpha=.3)+
ylab("")+
xlab("Amount spent on lobbying (in Millions USD)")+
scale_x_continuous(breaks=seq(0,1000,100))
There are a couple of questions that show this ... here and here, but they calculate the density prior to plotting.
This is another way, more complicated than required im sure, that allows ggplot to do some of the calculations for you.
# Your data
set.seed(100)
amount_spent1 <- data.frame(amount_spent=rnorm(1000, 500, 150))
mean1 <- mean(amount_spent1$amount_spent)
rand1 <- runif(1,0,1000)
Basic density plot
p <- ggplot(amount_spent1, aes(amount_spent)) +
geom_density(fill="grey") +
geom_vline(xintercept=mean1)
You can extract the x and y positions for the area to shade from the plot object using ggplot_build. Linear interpolation was used to get the y value at x=rand1
# subset region and plot
d <- ggplot_build(p)$data[[1]]
p <- p + geom_area(data = subset(d, x > rand1), aes(x=x, y=y), fill="red") +
geom_segment(x=rand1, xend=rand1,
y=0, yend=approx(x = d$x, y = d$y, xout = rand1)$y,
colour="blue", size=3)
I have a dataframe with two columns x and y that each contain values between 0 and 100 (the data are paired). I want to correlate them to each other using binned scatter plots. If I were to use a regular scatter plot, it would be easy to do:
geom_point(aes(x=x, y=y))
but I'd like to instead bin the points into N bins from 0 to 100, get the average value of x in each bin and the average value of y for the points in that bin, and show that as a scatter plot - so correlate the binned averages instead of the raw data points.
is there a clever/quick way to do this in ggplot2, using some combination of geom_smooth() and geom_point? Or does it have to be pre-computed manually and then plotted?
Yes, you can use stat_summary_bin.
set.seed(42)
x <- runif(1e4)
y <- x^2 + x + 4 * rnorm(1e4)
df <- data.frame(x=x, y=y)
library(ggplot2)
(ggplot(df, aes(x=x,y=y)) +
geom_point(alpha = 0.4) +
stat_summary_bin(fun.y='mean', bins=20,
color='orange', size=2, geom='point'))
I suggest geom_bin2d.
DF <- data.frame(x=1:100,y=1:100+rnorm(100))
library(ggplot2)
p <- ggplot(DF,aes(x=x,y=y)) + geom_bin2d()
print(p)
Is there any way to plot the cumulative probability from a frequency table? I mean a "smooth" version of it, similar to the way geom_density() plots.
So far, I managed to plot the individually calculated probabilities as points joined by lines, but it doesn't look very good.
I generate some test data:
set.seed(1)
x <- sort(sample(1:100, 20))
p <- runif(x); p <- cumsum(p)/sum(p)
table <- data.frame(x=x, prob=p)
You can use geom_smooth from the ggplot2 package.
require("ggplot2")
qplot(x=x, y=p, data=table, aes(ymin=0, ymax=1)) + ylab("ecf") +
geom_smooth(se=F, stat="smooth", method="loess", fullrange=T, fill="lightgrey", size=1)
As an alternative, an easy way to specifiy smoothing by a parameter try DeconCdf from the decon package:
require("decon")
plot(DeconCdf(x, sig=1))
If you want to use ggplot, you first have to transform the Decon function object in a data.frame.
f <- DeconCdf(x, sig=1)
m <- ggplot(data=data.frame(x=f$x, p=f$y), aes(x=x, y=p, ymin=0, ymax=1)) + ylab("ecf")
m + geom_line(size=1)
Use the sig-Parameter as your smoothing parameter:
f <- DeconCdf(x, sig=0.3)
m <- ggplot(data=data.frame(x=f$x, p=f$y), aes(x=x, y=p, ymin=0, ymax=1)) + ylab("ecf")
m + geom_line(size=1)
This version plots a histogram with a smoothed line from geom_density:
# Generate some data:
set.seed(28986)
x2 <- rweibull(100, 1, 1/2)
# Plot the points:
library(ggplot2)
library(scales)
ggplot(data.frame(x=x2),aes(x=x, y=1-cumsum(..count..)/sum(..count..))) +
geom_histogram(aes(fill=..count..)) +
geom_density(fill=NA, color="black", adjust=1/2) +
scale_y_continuous("Percent of units\n(equal to or larger than x)",labels=percent) +
theme_grey(base_size=18)
Note that I've used 1 - "cumulative probability" due to individual preference (I think it looks better and I'm accustomed to dealing with "reliability" metrics), but obviously that's just a preference that you could ignore by removing the 1- part in the aes.
Is there anyway to add a reduced major axis line (and ideally CI) to a ggplot? I know I can use method="lm" to get an OLS fit, but there doesn't seem to be a default method for RMA. I can get the RMA coefs and the CI interval from package lmodel2, but adding them with geom_abline() doesn't seem to work. Here's dummy data and code. I just want to replace the OLS line and CI with a RMA line and CI:
dat <- data.frame(a=log10(rnorm(50, 30, 10)), b=log10(rnorm(50, 20, 2)))
ggplot(dat, aes(x=a, y=b) ) +
geom_point(shape=1) +
geom_smooth(method="lm")
Edit1: the code below gets the RMA (here called SMA - standardized major axis) coefs and CIs. Package lmodel2 provides more detailed output, while package smatr returns just the coefs and CIs, if that's any help:
library(lmodel2)
fit1 <- lmodel2(b ~ a, data=dat)
library(smatr)
fit2 <- line.cis(b, a, data=dat)
Building off Joran's answer, I think it's a little easier to pass the whole data frame to geom_abline:
library(ggplot2)
library(lmodel2)
dat <- data.frame(a=log10(rnorm(50, 30, 10)), b=log10(rnorm(50, 20, 2)))
mod <- lmodel2(a ~ b, data=dat,"interval", "interval", 99)
reg <- mod$regression.results
names(reg) <- c("method", "intercept", "slope", "angle", "p-value")
ggplot(dat) +
geom_point(aes(b, a)) +
geom_abline(data = reg, aes(intercept = intercept, slope = slope, colour = method))
As Chase commented, the actual lmodel2() code and the ggplot code you are using would be helpful. But here's an example that may point you in the right direction:
dat <- data.frame(a=log10(rnorm(50, 30, 10)), b=log10(rnorm(50, 20, 2)))
mod <- lmodel2(a ~ b, data=dat,"interval", "interval", 99)
#EDIT: mod is a list, with components (data.frames) regression.results and
# confidence.intervals containing the intercepts+slopes for different
# estimation methods; just put the right values into geom_abline
ggplot(dat,aes(x=b,y=a)) + geom_point() +
geom_abline(intercept=mod$regression.results[4,2],
slope=mod$regression.results[4,3],colour="blue") +
geom_abline(intercept=mod$confidence.intervals[4,2],
slope=mod$confidence.intervals[4,4],colour="red") +
geom_abline(intercept=mod$confidence.intervals[4,3],
slope=mod$confidence.intervals[4,5],colour="red") +
xlim(c(-10,10)) + ylim(c(-10,10))
Full disclosure: I know nothing about RMA regression, so I just plucked out the relevent slopes and intercepts and plopped them into geom_abline(), using some example code from lmodel2 as a guide. The CIs produced in this toy example don't seem to make much sense, since I had to force ggplot to zoom out using xlim() and ylim() in order to see the CI lines (red).
But maybe this will help you construct a working example in ggplot().
EDIT2: With OPs added code to extract the coefficients, the ggplot() would be something like this:
ggplot(dat,aes(x=b,y=a)) + geom_point() +
geom_abline(intercept=fit2[1,1],slope=fit2[2,1],colour="blue") +
geom_abline(intercept=fit2[1,2],slope=fit2[2,2],colour="red") +
geom_abline(intercept=fit2[1,3],slope=fit2[2,3],colour="red")
I found myself in the same situation.
Obtain fitted values and their confidence intervals using the ggpmisc package:
cibrary(ggpmisc)
ci <- predict.lmodel2(fit1, method= 'RMA', interval= "confidence")
Add the model predictions to your data:
datci <- cbind(dat, ci)
Plot using geom_smooth arguments like transparency and line width (of course, you can customize them)
p <- ggplot(datci, aes(x= b, y= a)) + geom_point() + geom_line(aes(x= b, y= a)), lwd= 1.1, alpha= 0.6)
Use geom_ribbon if you want to add confidence intervals:
p + geom_ribbon(aes(ymin= lwr, ymax= upr, fill= feather), alpha= 0.3, color= NA)