I can get a single power curve shown below but I want to create a power analysis graph. I want to change my delta value (to .6, .7, and .8) and plot those 3 other lines on that same r curve in a different color. I provided an example of what I kinda want it to look.
n_participants <- c(5, 10, 20, 30, 40)
npercluster <- 20
n_tot <- n_participants*npercluster
icc <- 0.6 # assumption
deff <- 1 + icc*(npercluster - 1)
ess <- n_tot / deff
mydelt <- 0.5
mypowers <- power.t.test(n=ess, delta=mydelt)$power
plot(n_participants, mypowers, type='l',
main=paste('Power based on', npercluster, 'volumes per participants'),
xlab='Number of participants', ylim=c( 0, 1),
ylab='Power')
If you are planning to use R a lot I would recommend investing in learning ggplot2. Base R plotting solutions get very limited very quickly.
To solve your problem I would make a data frame with every combination of effect size and sample size.
dat <- expand.grid(mydelt=c(0.5,0.6,0.7,0.8), ess=n_tot / deff)
Then add a column for the power:
dat$mypowers = power.t.test(n=dat$ess, delta=dat$mydelt)$power
Then I can use ggplot to easily make a nice graph of the power curves:
library(ggplot2)
ggplot(dat, aes(x=ess, y=mypowers, color=factor(mydelt))) + geom_point() + geom_line()
You can easily change the overall graph look and add appropriate labels:
ggplot(dat, aes(x=ess, y=mypowers, color=factor(mydelt))) +
geom_point() +
geom_line() +
theme_bw() +
labs(x="Effective sample size", y="Power", color="Effect size" )
In response to the comment.. there was a mistake in the code above in that I plotted the effective total sample size on the x axis not the sample size per cluster. So instead we should make sure we have n_participants in the dataset for plotting, then calculate the powers and plot:
So the whole script is now:
n_participants <- 5:40
npercluster <- 20
icc <- 0.6 # assumption
deff <- 1 + icc*(npercluster - 1)
dat <- expand.grid(mydelt=c(0.5,0.6,0.7,0.8), npart=n_participants)
dat$n_tot <- dat$npart*npercluster
dat$ess <- dat$n_tot / deff
dat$mypowers <- power.t.test(n=dat$ess, delta=dat$mydelt)$power
library(ggplot2)
ggplot(dat, aes(x=npart, y=mypowers, color=factor(mydelt))) +
geom_line()+
theme_bw() +
labs(x="Number of participants", y="Power", color="Effect size" )
Which gives this graph:
You may put the logic in a function f, sapply over desired deltas and - as also suggested in comments - use matplot without having to bother with any new packages.
f <- \(mydelt=.5, n_participants=c(5, 10, 20, 30, 40), npercluster=20, icc=.6) {
n_tot <- n_participants*npercluster
deff <- 1 + icc*(npercluster - 1)
ess <- n_tot/deff
power.t.test(n=ess, delta=mydelt)$power
}
deltas <- seq(.5, .8, .1)
res <- t(sapply(deltas, f))
matplot(res, type='l', main=paste('Power based on 20 volumes per participants'),
xlab='Number of participants',
ylab='Power')
legend('topleft', legend=deltas, col=seq_along(deltas), lty=seq_along(deltas),
title='delta', cex=.8)
It's also possible pipe it directly into matplot:
t(sapply(deltas, f)) |>
matplot(res, ...)
See ?matplot for easy customizing of colors, linetypes etc.
Note: R >= 4.1 used.
Related
Currently, plot is not useful. How would I plot this distribution, since the range is too high?
I have data of 50 year and have to see which activity is most harmful.
The data contain about 1000 unique activity say column1.
I am using groupby(column1) and summarise(total = sum(column2,column3))
but problem is there few total value in 6 to 7 digit
because of these two fact my plot x look bad and due few high value y most value are near x axis.
I believe the problem is at x axis since so many names are clustered together due to less space.
I think a log transformation might help you gain some better insight out of your data:
Set up some fake data that resembles your situation:
set.seed(1776) # reproducible random numbers
num_obs <- 10000 # set number of observations
options(scipen = 999) # don't use scientific notation
# don't worry about this code, just creating a reproducible example
y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) *
abs(rnorm(sum(make_these_outliers)) * 50000))
Recreate the plot you have now to show the issue you're facing:
# recreating your current situation
plot(y, main='Ugly Plot')
Log10 transformation
Now we'll use the log10 transformation on your data an visualize the result. So a value of "10" is now "1", value of "100" is now "2", value of "1000" is now "3", etc.
# log10
plot(log10(y), col= rgb(0, 0, 0, alpha=0.3), pch=16, main='Log Scale and Transparency - Slightly Better')
The pch = 16 argument fills in the points and the alpha = 0.4 sets the opacity of each point. An alpha of 0.4 means an opacity of 40% (can also think of this as 60% transparent).
ggplot2
I'll also show this in ggplot2, because using the scale transformations, ggplot2 is smart enough to put the true value on the y-axis to prevent you from having to do the mental gymnastics of log10 transforms in your head.
# now with ggplot2
# install.packages("ggplot2") # <-- run this if you haven't installed ggplot2 yet
library(ggplot2)
# ggplot2 prefers your data to be in a data.frame (makes it easier to work with)
data_df <- data.frame(
index = 1:num_obs,
y = y)
ggplot(data = data_df, aes(x = index, y = y)) +
geom_point(alpha=0.2) +
scale_y_continuous(trans="log10") +
ggtitle("Y-axis reflects values of the datapoints", "even better?") +
theme_bw(base_size = 12)
At this point, you can start to tell how I've constructed the fake data, which is why there is such a high concentration of points in the 10-1000 range.
Hopefully this helps! I definitely recommend taking PauloH's advice and asking around on stats.stackexchange.com as well to make sure you aren't misrepresenting your data.
Using ggplot2 instead and setting alpha may solve your problem but if that is not enough you may want tag along zoom_facet() from the ggforce package.
set.seed(1776)
num_obs <- 10000
options(scipen = 999)
y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) *
abs(rnorm(sum(make_these_outliers)) * 50000))
# install.packages('ggplot2')
library(ggplot2)
# install.packages('ggforce')
library(ggforce)
data_df <- data.frame(
index = 1:num_obs,
y = y)
ggplot(data = data_df, aes(x = index, y = y)) +
geom_point(alpha=0.05) +
facet_zoom(y = (y <= 500), zoom.size = .8) +
theme_bw()
The result would look more or less like the following:
Hope it helps. Check the ggforce's GitHub:
https://github.com/thomasp85/ggforce
I am trying to plot lattice type data with ggplot2 and then superimpose a normal distribution over the sample data to illustrate how far off normal the underlying data is. I would like to have the normal dist on top to have the same mean and stdev as the panel.
here's an example:
library(ggplot2)
#make some example data
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
#This works
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + facet_wrap(~State_CD)
print(pg)
That all works great and produces a nice three panel graph of the data. How do I add the normal dist on top? It seems I would use stat_function, but this fails:
#this fails
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + stat_function(fun=dnorm) + facet_wrap(~State_CD)
print(pg)
It appears that the stat_function is not getting along with the facet_wrap feature. How do I get these two to play nicely?
------------EDIT---------
I tried to integrate ideas from two of the answers below and I am still not there:
using a combination of both answers I can hack together this:
library(ggplot)
library(plyr)
#make some example data
dd<-data.frame(matrix(rnorm(108, mean=2, sd=2),36,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
DevMeanSt <- ddply(dd, c("State_CD"), function(df)mean(df$Predicted_value))
colnames(DevMeanSt) <- c("State_CD", "mean")
DevSdSt <- ddply(dd, c("State_CD"), function(df)sd(df$Predicted_value) )
colnames(DevSdSt) <- c("State_CD", "sd")
DevStatsSt <- merge(DevMeanSt, DevSdSt)
pg <- ggplot(dd, aes(x=Predicted_value))
pg <- pg + geom_density()
pg <- pg + stat_function(fun=dnorm, colour='red', args=list(mean=DevStatsSt$mean, sd=DevStatsSt$sd))
pg <- pg + facet_wrap(~State_CD)
print(pg)
which is really close... except something is wrong with the normal dist plotting:
what am I doing wrong here?
stat_function is designed to overlay the same function in every panel. (There's no obvious way to match up the parameters of the function with the different panels).
As Ian suggests, the best way is to generate the normal curves yourself, and plot them as a separate dataset (this is where you were going wrong before - merging just doesn't make sense for this example and if you look carefully you'll see that's why you're getting the strange sawtooth pattern).
Here's how I'd go about solving the problem:
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
grid <- with(dd, seq(min(predicted), max(predicted), length = 100))
normaldens <- ddply(dd, "state", function(df) {
data.frame(
predicted = grid,
density = dnorm(grid, mean(df$predicted), sd(df$predicted))
)
})
ggplot(dd, aes(predicted)) +
geom_density() +
geom_line(aes(y = density), data = normaldens, colour = "red") +
facet_wrap(~ state)
Orginally posted as an answer to this question, I was encouraged to share my solution here too.
I too became frustrated with overlaying theoretical densities over empirical data, so I wrote a function that automated this process. Since 2009 when this question was first posed, ggplot2 has greatly expanded the extensibility, so I've put it in a extension package on github (EDIT: you can find it on CRAN now).
library(ggplot2)
library(ggh4x)
set.seed(0)
# Make the example data
dd <- data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),
c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
ggplot(dd, aes(Predicted_value)) +
geom_density() +
stat_theodensity(colour = "red") +
facet_wrap(~ State_CD)
Created on 2021-01-28 by the reprex package (v0.3.0)
If you are willing to use ggformula, then this is pretty easy. (It is also possible to mix and match and use ggformula just for the distribution overlay, but I'll illustrate the full on ggformula approach.)
library(ggformula)
theme_set(theme_bw())
gf_dens( ~ Sepal.Length | Species, data = iris) %>%
gf_fitdistr(color = "red") %>%
gf_fitdistr(dist = "gamma", color = "blue")
Created on 2019-01-15 by the reprex package (v0.2.1)
I think you need to provide more information. This seems to work:
pg <- ggplot(dd, aes(Predicted_value)) ## need aesthetics in the ggplot
pg <- pg + geom_density()
## gotta provide the arguments of the dnorm
pg <- pg + stat_function(fun=dnorm, colour='red',
args=list(mean=mean(dd$Predicted_value), sd=sd(dd$Predicted_value)))
## wrap it!
pg <- pg + facet_wrap(~State_CD)
pg
We are providing the same mean and sd parameter for every panel. Getting panel specific means and standard deviations is left as an exercise to the reader* ;)
'*' In other words, not sure how it can be done...
If you don't want to generate the normal distribution line-graph "by hand", still use stat_function, and show graphs side-by-side -- then you could consider using the "multiplot" function published on "Cookbook for R" as an alternative to facet_wrap. You can copy the multiplot code to your project from here.
After you copy the code, do the following:
# Some fake data (copied from hadley's answer)
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
# Split the data by state, apply a function on each member that converts it into a
# plot object, and return the result as a vector.
plots <- lapply(split(dd,dd$state),FUN=function(state_slice){
# The code here is the plot code generation. You can do anything you would
# normally do for a single plot, such as calling stat_function, and you do this
# one slice at a time.
ggplot(state_slice, aes(predicted)) +
geom_density() +
stat_function(fun=dnorm,
args=list(mean=mean(state_slice$predicted),
sd=sd(state_slice$predicted)),
color="red")
})
# Finally, present the plots on 3 columns.
multiplot(plotlist = plots, cols=3)
I think your best bet is to draw the line manually with geom_line.
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
dd$Predicted_value<-dd$Predicted_value*as.numeric(dd$State_CD) #make different by state
##Calculate means and standard deviations by level
means<-as.numeric(by(dd[,2],dd$State_CD,mean))
sds<-as.numeric(by(dd[,2],dd$State_CD,sd))
##Create evenly spaced evaluation points +/- 3 standard deviations away from the mean
dd$vals<-0
for(i in 1:length(levels(dd$State_CD))){
dd$vals[dd$State_CD==levels(dd$State_CD)[i]]<-seq(from=means[i]-3*sds[i],
to=means[i]+3*sds[i],
length.out=sum(dd$State_CD==levels(dd$State_CD)[i]))
}
##Create normal density points
dd$norm<-with(dd,dnorm(vals,means[as.numeric(State_CD)],
sds[as.numeric(State_CD)]))
pg <- ggplot(dd, aes(Predicted_value))
pg <- pg + geom_density()
pg <- pg + geom_line(aes(x=vals,y=norm),colour="red") #Add in normal distribution
pg <- pg + facet_wrap(~State_CD,scales="free")
pg
Can anyone tell me how to create a plot which features 3 different matrices sets of data. In general, I have 3 different matricies of data all 1*1001 dimensions, and i wish to plot all 3 on the same graph.
I have managed to get one matrix to plot at once, and assemble the code to create the other 2 matrices but not to plot it. B[i,] is randomly generated data. What I would like to know is what would be the coding to get all 3 plots together on one graph.
Code for one matrix:
ntime<-1000
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
}
matplot(t, t(average.price.at.each.timestep), type="l", lty=1, main="MC Price of a Zero Coupon Bond", ylab="Price", xlab = "Option Exercise Date")
Code for 3:
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
s.e.at.each.time <-matrix(0,nrow=1,ncol=ntime+1)
upper.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
lower.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
std <- function(x) sd(x)/sqrt(length(x))
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
s.e.at.each.time[i] <- std(B[i,])
upper.c.l.at[i] <- average.price.at.each.timestep[i]+1.96*s.e.at.each.time[i]
lower.c.l.at[i] <- average.price.at.each.timestep[i]-1.96*s.e.at.each.time[i]
}
I'm still struggling with this as I cannot get the solutions given to match with my data sets, I have now included the code below that generates the matrix B as a working example so you can see the data I am dealing with. As you can see it produces a plot of the different prices, I would like a plot with the average price and confidence intervals of the average.
# Define Bond Price Parameters
#
P<-1 #par value
# Define Vasicek Model Parameters
#
rev.rate<-0.3 #speed of reversion
long.term.mean<-0.1 #long term level of the mean
sigma<-0.05 #volatility
r0<-0.03 #spot interest rate
Strike<-0.05
# Define Simulation Parameters
#
T<-50 #time to expiry
ntime<-1000 #number of timesteps
yearstep<-ntime/T #yearstep
npaths<-1000 #number of paths
dt<-T/ntime #timestep
R <- matrix(0,nrow=ntime+1,ncol=npaths) #matrix of Vasicek interest rate values
B <- matrix(0,nrow=ntime+1,ncol=npaths) # matrix of Bond Prices
R[1,]<-r0 #specifies that all paths start at specified spot rate
B[1,]<-P
# do loop which generates values to fill matrix R with multiple paths of Interest Rates as they evolve over time.
# stochastic process based on standard normal distribution
for (j in 1:npaths) {
for (i in 1:ntime) {
dZ <-rnorm(1,mean=0,sd=1)*sqrt(dt)
Rij<-R[i,j]
Bij<-B[i,j]
dr <-rev.rate*(long.term.mean-Rij)*dt+sigma*dZ
R[i+1,j]<-Rij+dr
B[i+1,j]<-Bij*exp(-R[i+1,j]*dt)
}
}
t<-seq(0,T,dt)
par(mfcol = c(3,3))
matplot(t, B[,1:pmin(20,npaths)], type="l", lty=1, main="Price of a Zero Coupon Bond", ylab="Price", xlab = "Time to Expiry")
Your example isn't reproducible, so I created some fake data that I hope is structured similarly to yours. If this isn't what you were looking for, let me know and I'll update as needed.
# Fake data
ntime <- 100
mat1 <- matrix(rnorm(ntime+1, 10, 2), nrow=1, ncol=ntime+1)
mat2 <- matrix(rnorm(ntime+1, 20, 2), nrow=1, ncol=ntime+1)
mat3 <- matrix(rnorm(ntime+1, 30, 2), nrow=1, ncol=ntime+1)
matplot(1:(ntime+1), t(mat1), type="l", lty=1, ylim=c(0, max(c(mat1,mat2,mat3))),
main="MC Price of a Zero Coupon Bond",
ylab="Price", xlab = "Option Exercise Date")
# Add lines for mat2 and mat3
lines(1:101, mat2, col="red")
lines(1:101, mat3, col="blue")
UPDATE: Is this what you're trying to do?
matplot(t, t(average.price.at.each.timestep), type="l", lty=1,
main="MC Price of a Zero Coupon Bond", ylab="Price",
xlab = "Option Exercise Date")
matlines(t, t(upper.c.l.at), lty=2, col="red")
matlines(t, t(lower.c.l.at), lty=2, col="green")
See plot below. If you have multiple columns that you want to plot (as in your updated example where you plot 20 separate paths) and you want to add lower and upper CIs for all of them (though this would make the plot unreadable), just use a matrix of upper and lower CI values that correspond to each path in average.price.at.each.timestep and use matlines to add them to your existing plot of the multiple paths.
This is doable using ggplot2 and reshape2. The structures you have are a little awkward, which you could improve by using a data frame instead of a matrix.
#Dummy data
average.price.at.each.timestep <- rnorm(1000, sd=0.01)
s.e.at.each.time <- rnorm(1000, sd=0.0005, mean=1)
#CIs (note you can vectorise this):
upper.c.l.at <- average.price.at.each.timestep+1.96*s.e.at.each.time
lower.c.l.at <- average.price.at.each.timestep-1.96*s.e.at.each.time
#create a data frame:
prices <- data.frame(time = 1:length(average.price.at.each.timestep), price=average.price.at.each.timestep, upperCI= upper.c.l.at, lowerCI= lower.c.l.at)
library(reshape2)
#turn the data frame into time, variable, value triplets
prices.t <- melt(prices, id.vars=c("time"))
#plot
library(ggplot2)
ggplot(prices.t, aes(time, value, colour=variable)) + geom_line()
This produces the following plot:
This can be improved somewhat by using geom_ribbon instead:
ggplot(prices, aes(time, price)) + geom_ribbon(aes(ymin=lowerCI, ymax=upperCI), alpha=0.1) + geom_line()
Which produces this plot:
Here's another, slightly different ggplot solution that does not require you to calculate the confidence limits first - ggplot does it for you.
# create sample dataset
set.seed(1) # for reproducible example
B <- matrix(rnorm(1000,mean=rep(10+1:10/2,each=10)),nc=10)
library(ggplot2)
library(reshape2) # for melt(...)
gg <- melt(data.frame(date=1:nrow(B),B), id="date")
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = function(y)mean(y)-1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = function(y)mean(y)+1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
theme_bw()
stat_summary(...) summarizes the y-values for a given value of x (the date). So in the first call, it calculates the mean, in the second the lowerCL, and in the third the upperCL.
You could also create a CL(...) function, and call that:
CL <- function(x,level=0.95,type=c("lower","upper")) {
fact <- c(lower=-1,upper=1)
mean(x) - fact[type]*qnorm((1-level)/2)*sd(x)/sqrt(length(x))
}
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = CL, type="lower", geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = CL, type="upper", geom="line",linetype="dotted", color="blue")+
theme_bw()
This produces a plot identical to the one above.
can someone point me in the right direction to making a plot like this one w/ ggplot2? even just the function type.
I've been looking around in ggplot2 and can't find anything like this.
I'm assuming that the essential features of the plot are that:
a.) the x-axis is categorical, and
b.) the x-positions of the points are varied slightly,
c.) some summary statistic (I used medians).
If that's what you're looking for,
require(ggplot2)
require(plyr)
#define the data
lev <- gl(2, 10, 20, labels=c("I", "II"))
y <- runif(20)
df <- data.frame(lev, y)
#calculate the medians - I'm guessing that's what the horiz lines are?
meds <- ddply(df, .(lev), summarise, med = median(y))
ggplot(df, aes(x=lev, y=y, colour=lev)) +
geom_point(position="jitter") +
theme_bw() +
scale_colour_manual(values=c("red", "darkblue")) +
geom_errorbar(data=meds, aes(x=lev, y=med, ymin=med, ymax=med))
You can use annotate() to add the numbers and the little bracket if that is important.
I am trying to plot lattice type data with ggplot2 and then superimpose a normal distribution over the sample data to illustrate how far off normal the underlying data is. I would like to have the normal dist on top to have the same mean and stdev as the panel.
here's an example:
library(ggplot2)
#make some example data
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
#This works
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + facet_wrap(~State_CD)
print(pg)
That all works great and produces a nice three panel graph of the data. How do I add the normal dist on top? It seems I would use stat_function, but this fails:
#this fails
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + stat_function(fun=dnorm) + facet_wrap(~State_CD)
print(pg)
It appears that the stat_function is not getting along with the facet_wrap feature. How do I get these two to play nicely?
------------EDIT---------
I tried to integrate ideas from two of the answers below and I am still not there:
using a combination of both answers I can hack together this:
library(ggplot)
library(plyr)
#make some example data
dd<-data.frame(matrix(rnorm(108, mean=2, sd=2),36,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
DevMeanSt <- ddply(dd, c("State_CD"), function(df)mean(df$Predicted_value))
colnames(DevMeanSt) <- c("State_CD", "mean")
DevSdSt <- ddply(dd, c("State_CD"), function(df)sd(df$Predicted_value) )
colnames(DevSdSt) <- c("State_CD", "sd")
DevStatsSt <- merge(DevMeanSt, DevSdSt)
pg <- ggplot(dd, aes(x=Predicted_value))
pg <- pg + geom_density()
pg <- pg + stat_function(fun=dnorm, colour='red', args=list(mean=DevStatsSt$mean, sd=DevStatsSt$sd))
pg <- pg + facet_wrap(~State_CD)
print(pg)
which is really close... except something is wrong with the normal dist plotting:
what am I doing wrong here?
stat_function is designed to overlay the same function in every panel. (There's no obvious way to match up the parameters of the function with the different panels).
As Ian suggests, the best way is to generate the normal curves yourself, and plot them as a separate dataset (this is where you were going wrong before - merging just doesn't make sense for this example and if you look carefully you'll see that's why you're getting the strange sawtooth pattern).
Here's how I'd go about solving the problem:
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
grid <- with(dd, seq(min(predicted), max(predicted), length = 100))
normaldens <- ddply(dd, "state", function(df) {
data.frame(
predicted = grid,
density = dnorm(grid, mean(df$predicted), sd(df$predicted))
)
})
ggplot(dd, aes(predicted)) +
geom_density() +
geom_line(aes(y = density), data = normaldens, colour = "red") +
facet_wrap(~ state)
Orginally posted as an answer to this question, I was encouraged to share my solution here too.
I too became frustrated with overlaying theoretical densities over empirical data, so I wrote a function that automated this process. Since 2009 when this question was first posed, ggplot2 has greatly expanded the extensibility, so I've put it in a extension package on github (EDIT: you can find it on CRAN now).
library(ggplot2)
library(ggh4x)
set.seed(0)
# Make the example data
dd <- data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),
c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
ggplot(dd, aes(Predicted_value)) +
geom_density() +
stat_theodensity(colour = "red") +
facet_wrap(~ State_CD)
Created on 2021-01-28 by the reprex package (v0.3.0)
If you are willing to use ggformula, then this is pretty easy. (It is also possible to mix and match and use ggformula just for the distribution overlay, but I'll illustrate the full on ggformula approach.)
library(ggformula)
theme_set(theme_bw())
gf_dens( ~ Sepal.Length | Species, data = iris) %>%
gf_fitdistr(color = "red") %>%
gf_fitdistr(dist = "gamma", color = "blue")
Created on 2019-01-15 by the reprex package (v0.2.1)
I think you need to provide more information. This seems to work:
pg <- ggplot(dd, aes(Predicted_value)) ## need aesthetics in the ggplot
pg <- pg + geom_density()
## gotta provide the arguments of the dnorm
pg <- pg + stat_function(fun=dnorm, colour='red',
args=list(mean=mean(dd$Predicted_value), sd=sd(dd$Predicted_value)))
## wrap it!
pg <- pg + facet_wrap(~State_CD)
pg
We are providing the same mean and sd parameter for every panel. Getting panel specific means and standard deviations is left as an exercise to the reader* ;)
'*' In other words, not sure how it can be done...
If you don't want to generate the normal distribution line-graph "by hand", still use stat_function, and show graphs side-by-side -- then you could consider using the "multiplot" function published on "Cookbook for R" as an alternative to facet_wrap. You can copy the multiplot code to your project from here.
After you copy the code, do the following:
# Some fake data (copied from hadley's answer)
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
# Split the data by state, apply a function on each member that converts it into a
# plot object, and return the result as a vector.
plots <- lapply(split(dd,dd$state),FUN=function(state_slice){
# The code here is the plot code generation. You can do anything you would
# normally do for a single plot, such as calling stat_function, and you do this
# one slice at a time.
ggplot(state_slice, aes(predicted)) +
geom_density() +
stat_function(fun=dnorm,
args=list(mean=mean(state_slice$predicted),
sd=sd(state_slice$predicted)),
color="red")
})
# Finally, present the plots on 3 columns.
multiplot(plotlist = plots, cols=3)
I think your best bet is to draw the line manually with geom_line.
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
dd$Predicted_value<-dd$Predicted_value*as.numeric(dd$State_CD) #make different by state
##Calculate means and standard deviations by level
means<-as.numeric(by(dd[,2],dd$State_CD,mean))
sds<-as.numeric(by(dd[,2],dd$State_CD,sd))
##Create evenly spaced evaluation points +/- 3 standard deviations away from the mean
dd$vals<-0
for(i in 1:length(levels(dd$State_CD))){
dd$vals[dd$State_CD==levels(dd$State_CD)[i]]<-seq(from=means[i]-3*sds[i],
to=means[i]+3*sds[i],
length.out=sum(dd$State_CD==levels(dd$State_CD)[i]))
}
##Create normal density points
dd$norm<-with(dd,dnorm(vals,means[as.numeric(State_CD)],
sds[as.numeric(State_CD)]))
pg <- ggplot(dd, aes(Predicted_value))
pg <- pg + geom_density()
pg <- pg + geom_line(aes(x=vals,y=norm),colour="red") #Add in normal distribution
pg <- pg + facet_wrap(~State_CD,scales="free")
pg