ggplot2: overlaying stat_function() with geom_col() - r

When executing the following piece of code, the output plot shows a blue line of f(x) = 0, instead of the Gamma pdf (see the blue line in this picture).
analyzeGamma <- function(csvPath, alpha, beta) {
dfSamples <- read.csv(file = csvPath,
header = TRUE,
sep = ",")
base <- ggplot(dfSamples, aes(x = value, y = quantity))
base +
geom_col(color = "red") +
geom_vline(xintercept = qgamma(seq(0.1, 0.9, by = 0.1), alpha, beta)) +
stat_function(
fun = dgamma,
args = list(shape = alpha, rate = beta),
colour = "blue"
)
}
path = "/tmp/data.csv"
alpha = 1.2
beta = 0.01
analyzeGamma(path, alpha, beta)
When I comment out the line:
geom_col(color = "red") +
The Gamma pdf is drawn correctly, as can be seen here.
Any idea why it happens and how to resolve?
Thanks.

It's because your geom_col() goes up to 25 and probability density functions have an integral of 1. If I'm correct in assuming your columns resemble a histogram with count data as quantities, you would have to scale your density to match the columns as follows:
density * number of samples * width of columns
If you've precomputed the columns, 'number of samples' would be the sum of all your y-values.
An example with some toy data, notice the function in the stat:
alpha = 1.2
beta = 0.01
df <- data.frame(x = rgamma(1000, shape = alpha, rate = beta))
binwidth <- 5
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
stat_function(
fun = function(z, shape, rate)(dgamma(z, shape, rate) * length(df$x) * binwidth),
args = list(shape = alpha, rate = beta),
colour = "blue"
)
The following example with geom_col() gives the same picture:
x <- table(cut_width(df$x, binwidth, boundary = 0))
newdf <- data.frame(x = seq(0.5*binwidth, max(df$x), by = binwidth),
y = as.numeric(x))
ggplot(newdf, aes(x, y)) +
geom_col(width = binwidth) +
stat_function(
fun = function(z, shape, rate)(dgamma(z, shape, rate) * sum(newdf$y) * binwidth),
args = list(shape = alpha, rate = beta),
colour = "blue"
)

ggplot scales the y-axis to show all data. The blue curve appears as a straight line due do scale - if you compare the scale of the y-axis in both charts you'll see: when you draw the geom_col the y axis maximum is somewhere at 25 (and stat_functions seems to be a straigh line). Without the geom_col, y-axis max is somewhere at 0.006.

Related

Plot logistic regression using parameters in ggplot2

I would like to plot a logistic regression directly from the parameter estimates using ggplot2, but not quite sure how to do it.
For example, if I had 1500 draws of alpha and beta parameter estimates, I could plot each of the lines thus:
alpha_post = rnorm(n=1500,mean=1.1,sd = .15)
beta_post = rnorm(n=1500,mean=1.8,sd = .19)
X_lim = seq(from = -3,to = 2,by=.01)
for (i in 1:length(alpha_post)){
print(i)
y = exp(alpha_post[i] + beta_post[i]*X_lim)/(1+ exp(alpha_post[i] + beta_post[i]*X_lim) )
if (i==1){plot(X_lim,y,type="l")}
else {lines(X_lim,y,add=T)}
}
How would I do this in ggplot2? I know how to use geom_smooth(), but this is a little different.
As always in ggplot, you want to make a data.frame with all data that needs to be plotted:
d <- data.frame(
alpha_post = alpha_post,
beta_post = beta_post,
X_lim = rep(seq(from = -3,to = 2,by=.01), each = length(alpha_post))
)
d$y <- with(d, exp(alpha_post + beta_post * X_lim) / (1 + exp(alpha_post + beta_post * X_lim)))
Then the plotting itself becomes quite easy:
ggplot(d, aes(X_lim, y, group = alpha_post)) + geom_line()
If you want to be more fancy, add a summary line with e.g. the mean:
ggplot(d, aes(X_lim, y)) +
geom_line(aes(group = alpha_post), alpha = 0.3) +
geom_line(size = 1, color = 'firebrick', stat = 'summary', fun.y = 'mean')

How to Add a Legend to a ggplot without plotting the raw data?

I have made a plot of a polynomial function: y = x^2 - 6*x + 9
with a series of several points in a sequence + minor standard error in y. I used these points to construct a spline model for that function from the raw data points, and then I calculated the derivative from the spline model with R's predict() function and then I added both of the spline curves to the plot.
By the way, the expected derivative function is this: dy / dx = 2*x - 6
The original function I colored blue and the 1st derivative function I colored red. I wish to add legends to these plots, but I'm finding that difficult since I did not assign any points to the plots, as I declared the data-frames within the geom_smooth() functions.
The code I'm using is this:
library(ggplot2)
# Plot the function: f(x) = x^2 - 6x + 9
# with a smooth spline:
# And then the deriviative of that function from predicted values of the
# smoothed spline: f ' (x) = 2*x - 6
# Get a large sequence of x-values:
x <- seq(from = -10, to = 10, by = 0.01)
# The y-values are a function of each x value.
y <- x^2 - 6*x + 9 + rnorm(length(x), 0, 0.5)
# Fit the curve to a model which is a smoothed spine.
model <- smooth.spline(x = x, y = y)
# Predict the 1st derivative of this smoothed spline.
f_x <- predict(model, x = seq(from = min(x), to = max(x), by = 1), deriv = 1)
# Plot the smoothed spline of the original function and the derivative with respect to x.
p <- ggplot() + theme_bw() + geom_smooth(data = data.frame(x,y), aes(x = x, y = y), method = "loess", col = "blue", se = TRUE) + geom_smooth(data = data.frame(f_x$x, f_x$y), aes(x = f_x$x, y = f_x$y), method = "loess", col = "red", se = TRUE)
# Set the bounds of the plot.
p <- p + scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-5, 10)) + scale_y_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-10, 10))
# Add some axis labels
p <- p + labs(x = "x-axis", y = "y-axis", title = "Original Function and predicted derivative function")
p <- p + scale_fill_manual(values = c("blue", "red"), labels = c("Original Function", "Derivative Function with respect to x"))
print(p)
I was hoping that I could add the legend with scale_fill_manual(), but my attempt does not add a legend to the plot. Essentially, the plot I get generally looks like this, minus the messy legend that I added in paint. I would like that legend, thank you.
I did this because I want to show to my chemistry instructor that I can accurately measure the heat capacity just from the points from differential scanning calorimetry data for which I believe the heat capacity is just the first derivative plot of heat flow vs Temperature differentiated with respect to temperature.
So I tried to make a plot showing the original function overlayed with the 1st derivative function with respect to x, showing that the plot of the first derivative made only from a spline curve fitted to raw data points reliably produces the expected line dy / dx = 2 * x - 6, which it does.
I just want to add that legend.
Creating a data frame with you data and use color within aesthetics is the most common way of doing this.
df <- rbind(
data.frame(data='f(x)', x=x, y=y),
data.frame(data='f`(x)', x=f_x$x, y=f_x$y))
p <- ggplot(df, aes(x,y, color=data)) + geom_smooth(method = 'loess')
p <- p + scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-5, 10)) + scale_y_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-10, 10))
p <- p + labs(x = "x-axis", y = "y-axis", title = "Original Function and predicted derivative function")
p <- p + scale_color_manual(name = "Functions", values = c("blue", "red"), labels = c("Original Function", "Derivative Function with respect to x"))
print(p)

Nonparametric regression ggplot

I'm trying to plot some nonparametric regression curves with ggplot2. I achieved It with the base plot()function:
library(KernSmooth)
set.seed(1995)
X <- runif(100, -1, 1)
G <- X[which (X > 0)]
L <- X[which (X < 0)]
u <- rnorm(100, 0 , 0.02)
Y <- -exp(-20*L^2)-exp(-20*G^2)/(X+1)+u
m <- lm(Y~X)
plot(Y~X)
abline(m, col="red")
m2 <- locpoly(X, Y, bandwidth = 0.05, degree = 0)
lines(m2$x, m2$y, col = "red")
m3 <- locpoly(X, Y, bandwidth = 0.15, degree = 0)
lines(m3$x, m3$y, col = "black")
m4 <- locpoly(X, Y, bandwidth = 0.3, degree = 0)
lines(m4$x, m4$y, col = "green")
legend("bottomright", legend = c("NW(bw=0.05)", "NW(bw=0.15)", "NW(bw=0.3)"),
lty = 1, col = c("red", "black", "green"), cex = 0.5)
With ggplot2 have achieved plotting the linear regression:
With this code:
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25))
But I dont't know how to add the other lines to this plot, as in the base R plot. Any suggestions? Thanks in advance.
Solution
The shortest solution (though not the most beautiful one) is to add the lines using the data= argument of the geom_line function:
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25)) +
geom_line(data = as.data.frame(m2), mapping = aes(x=x,y=y))
Beautiful solution
To get beautiful colors and legend, use
# Need to convert lists to data.frames, ggplot2 needs data.frames
m2 <- as.data.frame(m2)
m3 <- as.data.frame(m3)
m4 <- as.data.frame(m4)
# Colnames are used as names in ggplot legend. Theres nothing wrong in using
# column names which contain symbols or whitespace, you just have to use
# backticks, e.g. m2$`NW(bw=0.05)` if you want to work with them
colnames(m2) <- c("x","NW(bw=0.05)")
colnames(m3) <- c("x","NW(bw=0.15)")
colnames(m4) <- c("x","NW(bw=0.3)")
# To give the different kernel density estimates different colors, they must all be in one data frame.
# For merging to work, all x columns of m2-m4 must be the same!
# the merge function will automatically detec columns of same name
# (that is, x) in m2-m4 and use it to identify y values which belong
# together (to the same x value)
mm <- Reduce(x=list(m2,m3,m4), f=function(a,b) merge(a,b))
# The above line is the same as:
# mm <- merge(m2,m3)
# mm <- merge(mm,m4)
# ggplot needs data in long (tidy) format
mm <- tidyr::gather(mm, kernel, y, -x)
ggplot(m, aes(x = X, y = Y)) +
geom_point(shape = 1) +
geom_smooth(method = lm, se = FALSE) +
theme(axis.line = element_line(colour = "black", size = 0.25)) +
geom_line(data = mm, mapping = aes(x=x,y=y,color=kernel))
Solution which will settle this for everyone and for eternity
The most beautiful and reproducable way though will be to create a custom stat in ggplot2 (see the included stats in ggplot).
There is this vignette of the ggplot2 team to this topic: Extending ggplot2. I have never undertaken such a heroic endeavour though.

Scatter plot in R with large overlap and 3000+ points

I am making a scatter plot in R with ggplot2. I am comparing the fraction of votes Hillary and Bernie received in the primary and education level. There is a lot over overlap and way to many points. I tried to use transparency so I could see the overlap but it still looks bad.
Code:
demanalyze <- function(infocode, n = 1){
infoname <- filter(infolookup, column_name == infocode)$description
infocolumn <- as.vector(as.matrix(mydata[infocode]))
ggplot(mydata) +
aes(x = infocolumn) +
ggtitle(infoname) +
xlab(infoname) +
ylab("Fraction of votes each canidate recieved") +
xlab(infoname) +
geom_point(aes(y = sanders_vote_fraction, colour = "Bernie Sanders")) +#, color = alpha("blue",0.02), size=I(1)) +
stat_smooth(aes(y = sanders_vote_fraction), method = "lm", formula = y ~ poly(x, n), size = 1, color = "darkblue", se = F) +
geom_point(aes(y = clinton_vote_fraction, colour = "Hillary Clinton")) +#, color = alpha("red",0.02), size=I(1)) +
stat_smooth(aes(y = clinton_vote_fraction), method = "lm", formula = y ~ poly(x, n), size = 1, color = "darkred", se = F) +
scale_colour_manual("",
values = c("Bernie Sanders" = alpha("blue",0.02), "Hillary Clinton" = alpha("red",0.02))
) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
}
What could I change to make the overlap areas look less messy?
The standard way to plot a large number of points over 2 dimensions is to use 2D density plots:
With reproducible example:
x1 <- rnorm(1000, mean=10)
x2 <- rnorm(1000, mean=10)
y1 <- rnorm(1000, mean= 5)
y2 <- rnorm(1000, mean = 7)
mydat <- data.frame(xaxis=c(x1, x2), yaxis=c(y1, y2), lab=rep(c("H","B"),each=1000))
head(mydat)
library(ggplot2)
##Dots and density plots (kinda messy, but can play with alpha)
p1 <-ggplot(mydat) + geom_point(aes(x=xaxis, y = yaxis, color=lab),alpha=0.4) +
stat_density2d(aes(x=xaxis, y = yaxis, color=lab))
p1
## just density
p2 <-ggplot(mydat) + stat_density2d(aes(x=xaxis, y = yaxis, color=lab))
p2
There are many parameters to play with, so look here for the full info on the plot type in ggplot2.

Computing weighted average using lowess mthod in R

I am trying to use the lowess method from R to compute the weighted average of a data set which is not uniformly distributed along x axis. For example, the first 5 data points are like this, where the first column is the x and the second is the y.
375.0 2040.0
472.0 5538.0
510.0 4488.0
573.0 2668.0
586.0 7664.0
I used the following command in R:
x<-read.table(add,header=FALSE,sep="\t")
y<-lowess(x[,1],x[,2],f=0.01)
write.table(y, file = results , sep = "\t", col.names =FALSE, row.names =FALSE)
The output looks like this:
The green line shows the average computed by the smooth function in matlab (tri-cubic kernel), and the red line is the average line computed by lowess method in R. The blue dots are the data points.
I can't find why the method in R does not work. Do you have any idea?
Here is a link to part of the data.
Thanks a lot for your help.
Th smooth function in matlab is like a filter ,
yy = smooth(y)
yy(1) = y(1)
yy(2) = (y(1) + y(2) + y(3))/3
yy(3) = (y(1) + y(2) + y(3) + y(4) + y(5))/5 ## convolution of size 5
yy(4) = (y(2) + y(3) + y(4) + y(5) + y(6))/5
I think it is better to do a simple smooth here.
Here some attempts using loess, lowesss with f = 0.2(1/5) and using smooth.spline
I am using ggplot2 to plot ( to use geom_jitter with some alpha )
library(ggplot2)
dat <- subset(data, V2 < 5000)
#dat <- data
xy <- lowess(dat$V1,dat$V2,f = 0.8)
xy <- as.data.frame(do.call(cbind,xy))
p1<- ggplot(data = dat, aes(x= V1, y = V2))+
geom_jitter(position = position_jitter(width = .2), alpha= 0.1)+
geom_smooth()
xy <- lowess(dat$V1,dat$V2,f = 0.2)
xy <- as.data.frame(do.call(cbind,xy))
xy.smooth <- smooth.spline(dat$V1,dat$V2)
xy.smooth <- data.frame(x= xy.smooth$x,y = xy.smooth$y)
p2 <- ggplot(data = dat, aes(x= V1, y = V2))+
geom_jitter(position = position_jitter(width = .2), alpha= 0.1)+
geom_line(data = xy, aes(x=x, y = y, group = 1 ), color = 'red')+
geom_line(data = xy.smooth, aes(x=x, y = y, group = 1 ), color = 'blue')
library(gridExtra)
grid.arrange(p1,p2)

Resources