I am trying to plot lattice type data with ggplot2 and then superimpose a normal distribution over the sample data to illustrate how far off normal the underlying data is. I would like to have the normal dist on top to have the same mean and stdev as the panel.
here's an example:
library(ggplot2)
#make some example data
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
#This works
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + facet_wrap(~State_CD)
print(pg)
That all works great and produces a nice three panel graph of the data. How do I add the normal dist on top? It seems I would use stat_function, but this fails:
#this fails
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + stat_function(fun=dnorm) + facet_wrap(~State_CD)
print(pg)
It appears that the stat_function is not getting along with the facet_wrap feature. How do I get these two to play nicely?
------------EDIT---------
I tried to integrate ideas from two of the answers below and I am still not there:
using a combination of both answers I can hack together this:
library(ggplot)
library(plyr)
#make some example data
dd<-data.frame(matrix(rnorm(108, mean=2, sd=2),36,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
DevMeanSt <- ddply(dd, c("State_CD"), function(df)mean(df$Predicted_value))
colnames(DevMeanSt) <- c("State_CD", "mean")
DevSdSt <- ddply(dd, c("State_CD"), function(df)sd(df$Predicted_value) )
colnames(DevSdSt) <- c("State_CD", "sd")
DevStatsSt <- merge(DevMeanSt, DevSdSt)
pg <- ggplot(dd, aes(x=Predicted_value))
pg <- pg + geom_density()
pg <- pg + stat_function(fun=dnorm, colour='red', args=list(mean=DevStatsSt$mean, sd=DevStatsSt$sd))
pg <- pg + facet_wrap(~State_CD)
print(pg)
which is really close... except something is wrong with the normal dist plotting:
what am I doing wrong here?
stat_function is designed to overlay the same function in every panel. (There's no obvious way to match up the parameters of the function with the different panels).
As Ian suggests, the best way is to generate the normal curves yourself, and plot them as a separate dataset (this is where you were going wrong before - merging just doesn't make sense for this example and if you look carefully you'll see that's why you're getting the strange sawtooth pattern).
Here's how I'd go about solving the problem:
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
grid <- with(dd, seq(min(predicted), max(predicted), length = 100))
normaldens <- ddply(dd, "state", function(df) {
data.frame(
predicted = grid,
density = dnorm(grid, mean(df$predicted), sd(df$predicted))
)
})
ggplot(dd, aes(predicted)) +
geom_density() +
geom_line(aes(y = density), data = normaldens, colour = "red") +
facet_wrap(~ state)
Orginally posted as an answer to this question, I was encouraged to share my solution here too.
I too became frustrated with overlaying theoretical densities over empirical data, so I wrote a function that automated this process. Since 2009 when this question was first posed, ggplot2 has greatly expanded the extensibility, so I've put it in a extension package on github (EDIT: you can find it on CRAN now).
library(ggplot2)
library(ggh4x)
set.seed(0)
# Make the example data
dd <- data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),
c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
ggplot(dd, aes(Predicted_value)) +
geom_density() +
stat_theodensity(colour = "red") +
facet_wrap(~ State_CD)
Created on 2021-01-28 by the reprex package (v0.3.0)
If you are willing to use ggformula, then this is pretty easy. (It is also possible to mix and match and use ggformula just for the distribution overlay, but I'll illustrate the full on ggformula approach.)
library(ggformula)
theme_set(theme_bw())
gf_dens( ~ Sepal.Length | Species, data = iris) %>%
gf_fitdistr(color = "red") %>%
gf_fitdistr(dist = "gamma", color = "blue")
Created on 2019-01-15 by the reprex package (v0.2.1)
I think you need to provide more information. This seems to work:
pg <- ggplot(dd, aes(Predicted_value)) ## need aesthetics in the ggplot
pg <- pg + geom_density()
## gotta provide the arguments of the dnorm
pg <- pg + stat_function(fun=dnorm, colour='red',
args=list(mean=mean(dd$Predicted_value), sd=sd(dd$Predicted_value)))
## wrap it!
pg <- pg + facet_wrap(~State_CD)
pg
We are providing the same mean and sd parameter for every panel. Getting panel specific means and standard deviations is left as an exercise to the reader* ;)
'*' In other words, not sure how it can be done...
If you don't want to generate the normal distribution line-graph "by hand", still use stat_function, and show graphs side-by-side -- then you could consider using the "multiplot" function published on "Cookbook for R" as an alternative to facet_wrap. You can copy the multiplot code to your project from here.
After you copy the code, do the following:
# Some fake data (copied from hadley's answer)
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
# Split the data by state, apply a function on each member that converts it into a
# plot object, and return the result as a vector.
plots <- lapply(split(dd,dd$state),FUN=function(state_slice){
# The code here is the plot code generation. You can do anything you would
# normally do for a single plot, such as calling stat_function, and you do this
# one slice at a time.
ggplot(state_slice, aes(predicted)) +
geom_density() +
stat_function(fun=dnorm,
args=list(mean=mean(state_slice$predicted),
sd=sd(state_slice$predicted)),
color="red")
})
# Finally, present the plots on 3 columns.
multiplot(plotlist = plots, cols=3)
I think your best bet is to draw the line manually with geom_line.
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
dd$Predicted_value<-dd$Predicted_value*as.numeric(dd$State_CD) #make different by state
##Calculate means and standard deviations by level
means<-as.numeric(by(dd[,2],dd$State_CD,mean))
sds<-as.numeric(by(dd[,2],dd$State_CD,sd))
##Create evenly spaced evaluation points +/- 3 standard deviations away from the mean
dd$vals<-0
for(i in 1:length(levels(dd$State_CD))){
dd$vals[dd$State_CD==levels(dd$State_CD)[i]]<-seq(from=means[i]-3*sds[i],
to=means[i]+3*sds[i],
length.out=sum(dd$State_CD==levels(dd$State_CD)[i]))
}
##Create normal density points
dd$norm<-with(dd,dnorm(vals,means[as.numeric(State_CD)],
sds[as.numeric(State_CD)]))
pg <- ggplot(dd, aes(Predicted_value))
pg <- pg + geom_density()
pg <- pg + geom_line(aes(x=vals,y=norm),colour="red") #Add in normal distribution
pg <- pg + facet_wrap(~State_CD,scales="free")
pg
Related
I can get a single power curve shown below but I want to create a power analysis graph. I want to change my delta value (to .6, .7, and .8) and plot those 3 other lines on that same r curve in a different color. I provided an example of what I kinda want it to look.
n_participants <- c(5, 10, 20, 30, 40)
npercluster <- 20
n_tot <- n_participants*npercluster
icc <- 0.6 # assumption
deff <- 1 + icc*(npercluster - 1)
ess <- n_tot / deff
mydelt <- 0.5
mypowers <- power.t.test(n=ess, delta=mydelt)$power
plot(n_participants, mypowers, type='l',
main=paste('Power based on', npercluster, 'volumes per participants'),
xlab='Number of participants', ylim=c( 0, 1),
ylab='Power')
If you are planning to use R a lot I would recommend investing in learning ggplot2. Base R plotting solutions get very limited very quickly.
To solve your problem I would make a data frame with every combination of effect size and sample size.
dat <- expand.grid(mydelt=c(0.5,0.6,0.7,0.8), ess=n_tot / deff)
Then add a column for the power:
dat$mypowers = power.t.test(n=dat$ess, delta=dat$mydelt)$power
Then I can use ggplot to easily make a nice graph of the power curves:
library(ggplot2)
ggplot(dat, aes(x=ess, y=mypowers, color=factor(mydelt))) + geom_point() + geom_line()
You can easily change the overall graph look and add appropriate labels:
ggplot(dat, aes(x=ess, y=mypowers, color=factor(mydelt))) +
geom_point() +
geom_line() +
theme_bw() +
labs(x="Effective sample size", y="Power", color="Effect size" )
In response to the comment.. there was a mistake in the code above in that I plotted the effective total sample size on the x axis not the sample size per cluster. So instead we should make sure we have n_participants in the dataset for plotting, then calculate the powers and plot:
So the whole script is now:
n_participants <- 5:40
npercluster <- 20
icc <- 0.6 # assumption
deff <- 1 + icc*(npercluster - 1)
dat <- expand.grid(mydelt=c(0.5,0.6,0.7,0.8), npart=n_participants)
dat$n_tot <- dat$npart*npercluster
dat$ess <- dat$n_tot / deff
dat$mypowers <- power.t.test(n=dat$ess, delta=dat$mydelt)$power
library(ggplot2)
ggplot(dat, aes(x=npart, y=mypowers, color=factor(mydelt))) +
geom_line()+
theme_bw() +
labs(x="Number of participants", y="Power", color="Effect size" )
Which gives this graph:
You may put the logic in a function f, sapply over desired deltas and - as also suggested in comments - use matplot without having to bother with any new packages.
f <- \(mydelt=.5, n_participants=c(5, 10, 20, 30, 40), npercluster=20, icc=.6) {
n_tot <- n_participants*npercluster
deff <- 1 + icc*(npercluster - 1)
ess <- n_tot/deff
power.t.test(n=ess, delta=mydelt)$power
}
deltas <- seq(.5, .8, .1)
res <- t(sapply(deltas, f))
matplot(res, type='l', main=paste('Power based on 20 volumes per participants'),
xlab='Number of participants',
ylab='Power')
legend('topleft', legend=deltas, col=seq_along(deltas), lty=seq_along(deltas),
title='delta', cex=.8)
It's also possible pipe it directly into matplot:
t(sapply(deltas, f)) |>
matplot(res, ...)
See ?matplot for easy customizing of colors, linetypes etc.
Note: R >= 4.1 used.
I was curious however, if it is possible to add any specific legend or put which species corresponds in the observed-expected plot, to know which circle it is respectively. I am using a fake dataset at the moment called finches. The package is called "cooccur" which creates a ggplot object. I was curious on how to actually edit this to put labels of species on here.
Alternatively is to extract the labels and co-occurrences and use base graphics, but this is not as ideal.
CODE SNIPPET BELOW
library(devtools)
#install_github("griffithdan/cooccur")
library(cooccur)
options(stringsAsFactors = FALSE)
data(finches)
cooccur.finches <- cooccur(mat=finches,
type="spp_site",
thresh=TRUE,
spp_names=TRUE)
summary(cooccur.finches)
plot(cooccur.finches)
p <- obs.v.exp(cooccur.finches)
# the ggplot2 object can be edited directly and then replotted
p
# alternatively, use base graphics, This is what I am currently doing but it is not correct
cooc.exp <- cooccur.finches$results$exp_cooccur
cooc.obs <- cooccur.finches$results$obs_cooccur
sp1 <- cooccur.finches$results$sp1_name
sp2 <- cooccur.finches$results$sp2_name
plot(cooc.obs ~ cooc.exp)
text(x = cooc.exp[1], y = cooc.obs[1], labels = sp1[1]) # plots only one name
I installed cooccur_1.3, and running your code gives this plot:
library(cooccur)
options(stringsAsFactors = FALSE)
data(finches)
cooccur.finches <- cooccur(mat=finches,
type="spp_site",
thresh=TRUE,
spp_names=TRUE)
plot(cooccur.finches)
Anyway, if you want to get a scatter plot, you can go to the dataframe and do a ggplot, below I only label the points where species 1 is Geospiza magnirostris, otherwise 80 points to label is quite insane:
library(ggrepel)
library(ggplot2)
df = cooccur.finches$results
df$type = "random"
df$type[df$p_lt<0.05] = "negative"
df$type[df$p_gt<0.05] = "positive"
ggplot(df,aes(x=exp_cooccur,y=obs_cooccur)) +
geom_point(aes(color=type)) + geom_abline(linetype="dashed") +
geom_label_repel(data=subset(df,sp1_name=="Geospiza magnirostris"),
aes(label=paste(sp1_name,sp2_name,sep="\n")),
size=2,nudge_x=-1,nudge_y=-1) +
scale_color_manual(values=c("#FFCC66","light blue","dark gray")) +
theme_bw()
I often find following type of scatter + histograms + correlations plots very useful to understand the nature of my data before getting into the analysis.
Can anyone help me to generate this kind of combined plot in R?
Example
If you don't care how exactly the plot looks, the GGally package offers a ready-made solution for the kind of plot you indicated.
library(ggplot2)
library(GGally)
# build a data frame of random values
d <- data.frame(a = rnorm(1000, 10))
d$b = rnorm(1000, 10)
d$c = rnorm(1000, 10)
d$d = d$b + rnorm(1000, sd=.1)
# prepare plot
p <- ggpairs(d, diag=list(continuous='bar'))
# show plot
print(p)
I would like to change the color of coefficient lines based on whether the point estimate is negative or positive in a ggplot2 coefficient plot in R. For example:
require(coefplot)
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
coefplot.lm(mod1)
Which produces the following plot:
In this plot, I would like to change the "x" variable to red when plotted. Any ideas? Thanks.
I think, you cannot do this with a plot produced by coefplot.lm. The package coefplot uses ggplot2 as the plotting system, which is good itself, but does not allow to play with colors as easily as you would like. To achieve the desired colors, you need to have a variable in your dataset that would color-code the values; you need to specify color = color-code in aes() function within the layer that draws the dots with CE. Apparently, this is impossible to do with the output of coefplot.lm function. Maybe, you can change the colors using ggplot2 ggplot_build() function. I would say, it's easier to write your own function for this task.
I've done this once to plot odds. If you want, you may use my code. Feel free to change it. The idea is the same as in coefplot. First, we extract coefficients from a model object and prepare the data set for plotting; second, actually plot.
The code for extracting coefficients and data set preparation
df_plot_odds <- function(x){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint.default(x))))
odds<-tmp[-1,]
names(odds)<-c('OR', 'lower', 'upper')
odds$vars<-row.names(odds)
odds$col<-odds$OR>1
odds$col[odds$col==TRUE] <-'blue'
odds$col[odds$col==FALSE] <-'red'
odds$pvalue <- summary(x)$coef[-1, "Pr(>|t|)"]
return(odds)
}
Plot the output of the extract function
plot_odds <- function(df_plot_odds, xlab="Odds Ratio", ylab="", asp=1){
require(ggplot2)
p <- ggplot(df_plot_odds, aes(x=vars, y=OR, ymin=lower, ymax=upper),asp=asp) +
geom_errorbar(aes(color=col),width=0.1) +
geom_point(aes(color=col),size=3)+
geom_hline(yintercept = 1, linetype=2) +
scale_color_manual('Effect', labels=c('Positive','Negative'),
values=c('blue','red'))+
coord_flip() +
theme_bw() +
theme(legend.position="none",aspect.ratio = asp)+
ylab(xlab) +
xlab(ylab) #switch because of the coord_flip() above
return(p)
}
Plotting your example
set.seed(123)
dat <- data.frame(x = rnorm(100),y = rnorm(100), z = rnorm(100))
mod1 <- lm(y ~ x + z, data = dat)
df <- df_plot_odds(mod1)
plot <- plot_odds(df)
plot
Which yields
Note that I chose theme_wb() as the default. Output is a ggplot2object. So, you may change it quite a lot.
I am trying to plot lattice type data with ggplot2 and then superimpose a normal distribution over the sample data to illustrate how far off normal the underlying data is. I would like to have the normal dist on top to have the same mean and stdev as the panel.
here's an example:
library(ggplot2)
#make some example data
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
#This works
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + facet_wrap(~State_CD)
print(pg)
That all works great and produces a nice three panel graph of the data. How do I add the normal dist on top? It seems I would use stat_function, but this fails:
#this fails
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + stat_function(fun=dnorm) + facet_wrap(~State_CD)
print(pg)
It appears that the stat_function is not getting along with the facet_wrap feature. How do I get these two to play nicely?
------------EDIT---------
I tried to integrate ideas from two of the answers below and I am still not there:
using a combination of both answers I can hack together this:
library(ggplot)
library(plyr)
#make some example data
dd<-data.frame(matrix(rnorm(108, mean=2, sd=2),36,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
DevMeanSt <- ddply(dd, c("State_CD"), function(df)mean(df$Predicted_value))
colnames(DevMeanSt) <- c("State_CD", "mean")
DevSdSt <- ddply(dd, c("State_CD"), function(df)sd(df$Predicted_value) )
colnames(DevSdSt) <- c("State_CD", "sd")
DevStatsSt <- merge(DevMeanSt, DevSdSt)
pg <- ggplot(dd, aes(x=Predicted_value))
pg <- pg + geom_density()
pg <- pg + stat_function(fun=dnorm, colour='red', args=list(mean=DevStatsSt$mean, sd=DevStatsSt$sd))
pg <- pg + facet_wrap(~State_CD)
print(pg)
which is really close... except something is wrong with the normal dist plotting:
what am I doing wrong here?
stat_function is designed to overlay the same function in every panel. (There's no obvious way to match up the parameters of the function with the different panels).
As Ian suggests, the best way is to generate the normal curves yourself, and plot them as a separate dataset (this is where you were going wrong before - merging just doesn't make sense for this example and if you look carefully you'll see that's why you're getting the strange sawtooth pattern).
Here's how I'd go about solving the problem:
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
grid <- with(dd, seq(min(predicted), max(predicted), length = 100))
normaldens <- ddply(dd, "state", function(df) {
data.frame(
predicted = grid,
density = dnorm(grid, mean(df$predicted), sd(df$predicted))
)
})
ggplot(dd, aes(predicted)) +
geom_density() +
geom_line(aes(y = density), data = normaldens, colour = "red") +
facet_wrap(~ state)
Orginally posted as an answer to this question, I was encouraged to share my solution here too.
I too became frustrated with overlaying theoretical densities over empirical data, so I wrote a function that automated this process. Since 2009 when this question was first posed, ggplot2 has greatly expanded the extensibility, so I've put it in a extension package on github (EDIT: you can find it on CRAN now).
library(ggplot2)
library(ggh4x)
set.seed(0)
# Make the example data
dd <- data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),
c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
ggplot(dd, aes(Predicted_value)) +
geom_density() +
stat_theodensity(colour = "red") +
facet_wrap(~ State_CD)
Created on 2021-01-28 by the reprex package (v0.3.0)
If you are willing to use ggformula, then this is pretty easy. (It is also possible to mix and match and use ggformula just for the distribution overlay, but I'll illustrate the full on ggformula approach.)
library(ggformula)
theme_set(theme_bw())
gf_dens( ~ Sepal.Length | Species, data = iris) %>%
gf_fitdistr(color = "red") %>%
gf_fitdistr(dist = "gamma", color = "blue")
Created on 2019-01-15 by the reprex package (v0.2.1)
I think you need to provide more information. This seems to work:
pg <- ggplot(dd, aes(Predicted_value)) ## need aesthetics in the ggplot
pg <- pg + geom_density()
## gotta provide the arguments of the dnorm
pg <- pg + stat_function(fun=dnorm, colour='red',
args=list(mean=mean(dd$Predicted_value), sd=sd(dd$Predicted_value)))
## wrap it!
pg <- pg + facet_wrap(~State_CD)
pg
We are providing the same mean and sd parameter for every panel. Getting panel specific means and standard deviations is left as an exercise to the reader* ;)
'*' In other words, not sure how it can be done...
If you don't want to generate the normal distribution line-graph "by hand", still use stat_function, and show graphs side-by-side -- then you could consider using the "multiplot" function published on "Cookbook for R" as an alternative to facet_wrap. You can copy the multiplot code to your project from here.
After you copy the code, do the following:
# Some fake data (copied from hadley's answer)
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
# Split the data by state, apply a function on each member that converts it into a
# plot object, and return the result as a vector.
plots <- lapply(split(dd,dd$state),FUN=function(state_slice){
# The code here is the plot code generation. You can do anything you would
# normally do for a single plot, such as calling stat_function, and you do this
# one slice at a time.
ggplot(state_slice, aes(predicted)) +
geom_density() +
stat_function(fun=dnorm,
args=list(mean=mean(state_slice$predicted),
sd=sd(state_slice$predicted)),
color="red")
})
# Finally, present the plots on 3 columns.
multiplot(plotlist = plots, cols=3)
I think your best bet is to draw the line manually with geom_line.
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
dd$Predicted_value<-dd$Predicted_value*as.numeric(dd$State_CD) #make different by state
##Calculate means and standard deviations by level
means<-as.numeric(by(dd[,2],dd$State_CD,mean))
sds<-as.numeric(by(dd[,2],dd$State_CD,sd))
##Create evenly spaced evaluation points +/- 3 standard deviations away from the mean
dd$vals<-0
for(i in 1:length(levels(dd$State_CD))){
dd$vals[dd$State_CD==levels(dd$State_CD)[i]]<-seq(from=means[i]-3*sds[i],
to=means[i]+3*sds[i],
length.out=sum(dd$State_CD==levels(dd$State_CD)[i]))
}
##Create normal density points
dd$norm<-with(dd,dnorm(vals,means[as.numeric(State_CD)],
sds[as.numeric(State_CD)]))
pg <- ggplot(dd, aes(Predicted_value))
pg <- pg + geom_density()
pg <- pg + geom_line(aes(x=vals,y=norm),colour="red") #Add in normal distribution
pg <- pg + facet_wrap(~State_CD,scales="free")
pg