R customizing scatter plot - plot

I often find following type of scatter + histograms + correlations plots very useful to understand the nature of my data before getting into the analysis.
Can anyone help me to generate this kind of combined plot in R?
Example

If you don't care how exactly the plot looks, the GGally package offers a ready-made solution for the kind of plot you indicated.
library(ggplot2)
library(GGally)
# build a data frame of random values
d <- data.frame(a = rnorm(1000, 10))
d$b = rnorm(1000, 10)
d$c = rnorm(1000, 10)
d$d = d$b + rnorm(1000, sd=.1)
# prepare plot
p <- ggpairs(d, diag=list(continuous='bar'))
# show plot
print(p)

Related

R ggplot: series of frequency histogram with normal density line [duplicate]

I am trying to plot lattice type data with ggplot2 and then superimpose a normal distribution over the sample data to illustrate how far off normal the underlying data is. I would like to have the normal dist on top to have the same mean and stdev as the panel.
here's an example:
library(ggplot2)
#make some example data
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
#This works
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + facet_wrap(~State_CD)
print(pg)
That all works great and produces a nice three panel graph of the data. How do I add the normal dist on top? It seems I would use stat_function, but this fails:
#this fails
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + stat_function(fun=dnorm) + facet_wrap(~State_CD)
print(pg)
It appears that the stat_function is not getting along with the facet_wrap feature. How do I get these two to play nicely?
------------EDIT---------
I tried to integrate ideas from two of the answers below and I am still not there:
using a combination of both answers I can hack together this:
library(ggplot)
library(plyr)
#make some example data
dd<-data.frame(matrix(rnorm(108, mean=2, sd=2),36,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
DevMeanSt <- ddply(dd, c("State_CD"), function(df)mean(df$Predicted_value))
colnames(DevMeanSt) <- c("State_CD", "mean")
DevSdSt <- ddply(dd, c("State_CD"), function(df)sd(df$Predicted_value) )
colnames(DevSdSt) <- c("State_CD", "sd")
DevStatsSt <- merge(DevMeanSt, DevSdSt)
pg <- ggplot(dd, aes(x=Predicted_value))
pg <- pg + geom_density()
pg <- pg + stat_function(fun=dnorm, colour='red', args=list(mean=DevStatsSt$mean, sd=DevStatsSt$sd))
pg <- pg + facet_wrap(~State_CD)
print(pg)
which is really close... except something is wrong with the normal dist plotting:
what am I doing wrong here?
stat_function is designed to overlay the same function in every panel. (There's no obvious way to match up the parameters of the function with the different panels).
As Ian suggests, the best way is to generate the normal curves yourself, and plot them as a separate dataset (this is where you were going wrong before - merging just doesn't make sense for this example and if you look carefully you'll see that's why you're getting the strange sawtooth pattern).
Here's how I'd go about solving the problem:
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
grid <- with(dd, seq(min(predicted), max(predicted), length = 100))
normaldens <- ddply(dd, "state", function(df) {
data.frame(
predicted = grid,
density = dnorm(grid, mean(df$predicted), sd(df$predicted))
)
})
ggplot(dd, aes(predicted)) +
geom_density() +
geom_line(aes(y = density), data = normaldens, colour = "red") +
facet_wrap(~ state)
Orginally posted as an answer to this question, I was encouraged to share my solution here too.
I too became frustrated with overlaying theoretical densities over empirical data, so I wrote a function that automated this process. Since 2009 when this question was first posed, ggplot2 has greatly expanded the extensibility, so I've put it in a extension package on github (EDIT: you can find it on CRAN now).
library(ggplot2)
library(ggh4x)
set.seed(0)
# Make the example data
dd <- data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),
c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
ggplot(dd, aes(Predicted_value)) +
geom_density() +
stat_theodensity(colour = "red") +
facet_wrap(~ State_CD)
Created on 2021-01-28 by the reprex package (v0.3.0)
If you are willing to use ggformula, then this is pretty easy. (It is also possible to mix and match and use ggformula just for the distribution overlay, but I'll illustrate the full on ggformula approach.)
library(ggformula)
theme_set(theme_bw())
gf_dens( ~ Sepal.Length | Species, data = iris) %>%
gf_fitdistr(color = "red") %>%
gf_fitdistr(dist = "gamma", color = "blue")
Created on 2019-01-15 by the reprex package (v0.2.1)
I think you need to provide more information. This seems to work:
pg <- ggplot(dd, aes(Predicted_value)) ## need aesthetics in the ggplot
pg <- pg + geom_density()
## gotta provide the arguments of the dnorm
pg <- pg + stat_function(fun=dnorm, colour='red',
args=list(mean=mean(dd$Predicted_value), sd=sd(dd$Predicted_value)))
## wrap it!
pg <- pg + facet_wrap(~State_CD)
pg
We are providing the same mean and sd parameter for every panel. Getting panel specific means and standard deviations is left as an exercise to the reader* ;)
'*' In other words, not sure how it can be done...
If you don't want to generate the normal distribution line-graph "by hand", still use stat_function, and show graphs side-by-side -- then you could consider using the "multiplot" function published on "Cookbook for R" as an alternative to facet_wrap. You can copy the multiplot code to your project from here.
After you copy the code, do the following:
# Some fake data (copied from hadley's answer)
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
# Split the data by state, apply a function on each member that converts it into a
# plot object, and return the result as a vector.
plots <- lapply(split(dd,dd$state),FUN=function(state_slice){
# The code here is the plot code generation. You can do anything you would
# normally do for a single plot, such as calling stat_function, and you do this
# one slice at a time.
ggplot(state_slice, aes(predicted)) +
geom_density() +
stat_function(fun=dnorm,
args=list(mean=mean(state_slice$predicted),
sd=sd(state_slice$predicted)),
color="red")
})
# Finally, present the plots on 3 columns.
multiplot(plotlist = plots, cols=3)
I think your best bet is to draw the line manually with geom_line.
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
dd$Predicted_value<-dd$Predicted_value*as.numeric(dd$State_CD) #make different by state
##Calculate means and standard deviations by level
means<-as.numeric(by(dd[,2],dd$State_CD,mean))
sds<-as.numeric(by(dd[,2],dd$State_CD,sd))
##Create evenly spaced evaluation points +/- 3 standard deviations away from the mean
dd$vals<-0
for(i in 1:length(levels(dd$State_CD))){
dd$vals[dd$State_CD==levels(dd$State_CD)[i]]<-seq(from=means[i]-3*sds[i],
to=means[i]+3*sds[i],
length.out=sum(dd$State_CD==levels(dd$State_CD)[i]))
}
##Create normal density points
dd$norm<-with(dd,dnorm(vals,means[as.numeric(State_CD)],
sds[as.numeric(State_CD)]))
pg <- ggplot(dd, aes(Predicted_value))
pg <- pg + geom_density()
pg <- pg + geom_line(aes(x=vals,y=norm),colour="red") #Add in normal distribution
pg <- pg + facet_wrap(~State_CD,scales="free")
pg

Change colors of select lines in ggplot2 coefficient plot in R

I would like to change the color of coefficient lines based on whether the point estimate is negative or positive in a ggplot2 coefficient plot in R. For example:
require(coefplot)
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
coefplot.lm(mod1)
Which produces the following plot:
In this plot, I would like to change the "x" variable to red when plotted. Any ideas? Thanks.
I think, you cannot do this with a plot produced by coefplot.lm. The package coefplot uses ggplot2 as the plotting system, which is good itself, but does not allow to play with colors as easily as you would like. To achieve the desired colors, you need to have a variable in your dataset that would color-code the values; you need to specify color = color-code in aes() function within the layer that draws the dots with CE. Apparently, this is impossible to do with the output of coefplot.lm function. Maybe, you can change the colors using ggplot2 ggplot_build() function. I would say, it's easier to write your own function for this task.
I've done this once to plot odds. If you want, you may use my code. Feel free to change it. The idea is the same as in coefplot. First, we extract coefficients from a model object and prepare the data set for plotting; second, actually plot.
The code for extracting coefficients and data set preparation
df_plot_odds <- function(x){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint.default(x))))
odds<-tmp[-1,]
names(odds)<-c('OR', 'lower', 'upper')
odds$vars<-row.names(odds)
odds$col<-odds$OR>1
odds$col[odds$col==TRUE] <-'blue'
odds$col[odds$col==FALSE] <-'red'
odds$pvalue <- summary(x)$coef[-1, "Pr(>|t|)"]
return(odds)
}
Plot the output of the extract function
plot_odds <- function(df_plot_odds, xlab="Odds Ratio", ylab="", asp=1){
require(ggplot2)
p <- ggplot(df_plot_odds, aes(x=vars, y=OR, ymin=lower, ymax=upper),asp=asp) +
geom_errorbar(aes(color=col),width=0.1) +
geom_point(aes(color=col),size=3)+
geom_hline(yintercept = 1, linetype=2) +
scale_color_manual('Effect', labels=c('Positive','Negative'),
values=c('blue','red'))+
coord_flip() +
theme_bw() +
theme(legend.position="none",aspect.ratio = asp)+
ylab(xlab) +
xlab(ylab) #switch because of the coord_flip() above
return(p)
}
Plotting your example
set.seed(123)
dat <- data.frame(x = rnorm(100),y = rnorm(100), z = rnorm(100))
mod1 <- lm(y ~ x + z, data = dat)
df <- df_plot_odds(mod1)
plot <- plot_odds(df)
plot
Which yields
Note that I chose theme_wb() as the default. Output is a ggplot2object. So, you may change it quite a lot.

Circular density plot using ggplot2

I'm working with circular data and I wanted to reproduce this kind of plot using ggplot2:
library(circular)
data1 <- rvonmises(1000, circular(0), 10, control.circular=list(units="radians")) ## sample
quantile.circular(data1,c(0.05,.95)) ## for interval
data2 <- mean(data1)
dens <- density(data1, bw=27)
p<-plot(dens, points.plot=TRUE, xlim=c(-1,2.1),ylim=c(-1.0,1.2),
main="Circular Density", ylab="", xlab="")
points(circular(0), plot.info=p, col="blue",type="o")
arrows.circular(c(5.7683795,0.5151433 )) ## confidence interval
arrows.circular(data2, lwd=3) ## circular mean
The thinest arrows are extremes of my interval
I suppose blue point is forecast
The third arrow is circular mean
I need circular density
I've been looking for something similar but I did not found anything.
Any suggestion?
Thanks
To avoid running in the wrong direction would you quickly check if this code goes in the right direction? The arrows can be added easily using +arrow(...) with appropriate loading.
EDIT: One remark to the complicated way of attaching density values - ggplot's geom_density does not seem to like coord_polar (at least the way I tried it).
#create some dummy radial data and wrap it in a dataframe
d1<-runif(100,min=0,max=120)
df = NULL
df$d1 <- d1
df <- as.data.frame(df)
#estimate kernel density and then derive an approximate function to attach density values to the radial values in the dataframe
data_density <- density(d1)
density_function <- with(data_density, approxfun(x, y, rule=1))
df$density <- density_function(df$d1)
#order dataframe to facilitate geom_line in polar coordinates
df <- df[order(df$density,df$d1),]
#ggplot object
require(ggplot2)
g = ggplot(df,aes(x=d1,y=density))
#Radial observations on unit circle
g = g + geom_point(aes(x=d1,y=min(df$density)))
#Density function
g = g + geom_line()
g = g + ylim(0,max(df$density))
g = g + xlim(0,360)
#polar coordinates
g = g + coord_polar()
g
Uniform random variables sampled from (0,120):

Plotting CCDF of walking durations

I have plotted the CCDF as mentioned in question part of the maximum plot points in R? post to get a plot(image1) with this code:
ccdf<-function(duration,density=FALSE)
{
freqs = table(duration)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(duration),log='xy')
Now, on the basis of answer by teucer on Howto Plot “Reverse” Cumulative Frequency Graph With ECDF I tried to plot a CCDF using the commands below:
f <- ecdf(duration)
plot(1-f(duration),duration)
I got a plot like image2.
Also I read in from the comments in one of the answers in Plotting CDF of a dataset in R? as CCDF is nothing but 1-ECDF.
I am totally confused about how to get the CCDF of my data.
Image1
Image2
Generate some data and find the ecdf function.
x <- rlnorm(1e5, 5)
ecdf_x <- ecdf(x)
Generate vector at regular intervals over range of x. (EDIT: you want them evenly spaced on a log scale in this case; if you have negative values, then use sample over a linear scale.)
xx <- seq(min(x), max(x), length.out = 1e4)
#or
log_x <- log(x)
xx <- exp(seq(min(log_x), max(log_x), length.out = 1e3))
Create data with x and y coordinates for plot.
dfr <- data.frame(
x = xx,
ecdf = ecdf_x(xx),
ccdf = 1 - ecdf_x(xx)
)
Draw plot.
p_ccdf <- ggplot(dfr, aes(x, ccdf)) +
geom_line() +
scale_x_log10()
p_ccdf
(Also take a look at aes(x, ecdf).)
I used ggplot to get desired ccdf plot of my data as shown below:
>>ecdf_x <- ecdf(x)
>>dfr <- data.frame( ecdf = ecdf_x(x),
>>ccdf = 1 - ecdf_x(x) )
>>p_ccdf <- ggplot(dfr, aes(x, ccdf)) + geom_line() + scale_x_log10()
>>p_ccdf
Sorry for posting it so late.
Thank you all!

using stat_function and facet_wrap together in ggplot2 in R

I am trying to plot lattice type data with ggplot2 and then superimpose a normal distribution over the sample data to illustrate how far off normal the underlying data is. I would like to have the normal dist on top to have the same mean and stdev as the panel.
here's an example:
library(ggplot2)
#make some example data
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
#This works
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + facet_wrap(~State_CD)
print(pg)
That all works great and produces a nice three panel graph of the data. How do I add the normal dist on top? It seems I would use stat_function, but this fails:
#this fails
pg <- ggplot(dd) + geom_density(aes(x=Predicted_value)) + stat_function(fun=dnorm) + facet_wrap(~State_CD)
print(pg)
It appears that the stat_function is not getting along with the facet_wrap feature. How do I get these two to play nicely?
------------EDIT---------
I tried to integrate ideas from two of the answers below and I am still not there:
using a combination of both answers I can hack together this:
library(ggplot)
library(plyr)
#make some example data
dd<-data.frame(matrix(rnorm(108, mean=2, sd=2),36,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
DevMeanSt <- ddply(dd, c("State_CD"), function(df)mean(df$Predicted_value))
colnames(DevMeanSt) <- c("State_CD", "mean")
DevSdSt <- ddply(dd, c("State_CD"), function(df)sd(df$Predicted_value) )
colnames(DevSdSt) <- c("State_CD", "sd")
DevStatsSt <- merge(DevMeanSt, DevSdSt)
pg <- ggplot(dd, aes(x=Predicted_value))
pg <- pg + geom_density()
pg <- pg + stat_function(fun=dnorm, colour='red', args=list(mean=DevStatsSt$mean, sd=DevStatsSt$sd))
pg <- pg + facet_wrap(~State_CD)
print(pg)
which is really close... except something is wrong with the normal dist plotting:
what am I doing wrong here?
stat_function is designed to overlay the same function in every panel. (There's no obvious way to match up the parameters of the function with the different panels).
As Ian suggests, the best way is to generate the normal curves yourself, and plot them as a separate dataset (this is where you were going wrong before - merging just doesn't make sense for this example and if you look carefully you'll see that's why you're getting the strange sawtooth pattern).
Here's how I'd go about solving the problem:
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
grid <- with(dd, seq(min(predicted), max(predicted), length = 100))
normaldens <- ddply(dd, "state", function(df) {
data.frame(
predicted = grid,
density = dnorm(grid, mean(df$predicted), sd(df$predicted))
)
})
ggplot(dd, aes(predicted)) +
geom_density() +
geom_line(aes(y = density), data = normaldens, colour = "red") +
facet_wrap(~ state)
Orginally posted as an answer to this question, I was encouraged to share my solution here too.
I too became frustrated with overlaying theoretical densities over empirical data, so I wrote a function that automated this process. Since 2009 when this question was first posed, ggplot2 has greatly expanded the extensibility, so I've put it in a extension package on github (EDIT: you can find it on CRAN now).
library(ggplot2)
library(ggh4x)
set.seed(0)
# Make the example data
dd <- data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),
c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
ggplot(dd, aes(Predicted_value)) +
geom_density() +
stat_theodensity(colour = "red") +
facet_wrap(~ State_CD)
Created on 2021-01-28 by the reprex package (v0.3.0)
If you are willing to use ggformula, then this is pretty easy. (It is also possible to mix and match and use ggformula just for the distribution overlay, but I'll illustrate the full on ggformula approach.)
library(ggformula)
theme_set(theme_bw())
gf_dens( ~ Sepal.Length | Species, data = iris) %>%
gf_fitdistr(color = "red") %>%
gf_fitdistr(dist = "gamma", color = "blue")
Created on 2019-01-15 by the reprex package (v0.2.1)
I think you need to provide more information. This seems to work:
pg <- ggplot(dd, aes(Predicted_value)) ## need aesthetics in the ggplot
pg <- pg + geom_density()
## gotta provide the arguments of the dnorm
pg <- pg + stat_function(fun=dnorm, colour='red',
args=list(mean=mean(dd$Predicted_value), sd=sd(dd$Predicted_value)))
## wrap it!
pg <- pg + facet_wrap(~State_CD)
pg
We are providing the same mean and sd parameter for every panel. Getting panel specific means and standard deviations is left as an exercise to the reader* ;)
'*' In other words, not sure how it can be done...
If you don't want to generate the normal distribution line-graph "by hand", still use stat_function, and show graphs side-by-side -- then you could consider using the "multiplot" function published on "Cookbook for R" as an alternative to facet_wrap. You can copy the multiplot code to your project from here.
After you copy the code, do the following:
# Some fake data (copied from hadley's answer)
dd <- data.frame(
predicted = rnorm(72, mean = 2, sd = 2),
state = rep(c("A", "B", "C"), each = 24)
)
# Split the data by state, apply a function on each member that converts it into a
# plot object, and return the result as a vector.
plots <- lapply(split(dd,dd$state),FUN=function(state_slice){
# The code here is the plot code generation. You can do anything you would
# normally do for a single plot, such as calling stat_function, and you do this
# one slice at a time.
ggplot(state_slice, aes(predicted)) +
geom_density() +
stat_function(fun=dnorm,
args=list(mean=mean(state_slice$predicted),
sd=sd(state_slice$predicted)),
color="red")
})
# Finally, present the plots on 3 columns.
multiplot(plotlist = plots, cols=3)
I think your best bet is to draw the line manually with geom_line.
dd<-data.frame(matrix(rnorm(144, mean=2, sd=2),72,2),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("x_value", "Predicted_value", "State_CD")
dd$Predicted_value<-dd$Predicted_value*as.numeric(dd$State_CD) #make different by state
##Calculate means and standard deviations by level
means<-as.numeric(by(dd[,2],dd$State_CD,mean))
sds<-as.numeric(by(dd[,2],dd$State_CD,sd))
##Create evenly spaced evaluation points +/- 3 standard deviations away from the mean
dd$vals<-0
for(i in 1:length(levels(dd$State_CD))){
dd$vals[dd$State_CD==levels(dd$State_CD)[i]]<-seq(from=means[i]-3*sds[i],
to=means[i]+3*sds[i],
length.out=sum(dd$State_CD==levels(dd$State_CD)[i]))
}
##Create normal density points
dd$norm<-with(dd,dnorm(vals,means[as.numeric(State_CD)],
sds[as.numeric(State_CD)]))
pg <- ggplot(dd, aes(Predicted_value))
pg <- pg + geom_density()
pg <- pg + geom_line(aes(x=vals,y=norm),colour="red") #Add in normal distribution
pg <- pg + facet_wrap(~State_CD,scales="free")
pg

Resources