r : ecdf over histogram - r

in R, with ecdf I can plot a empirical cumulative distribution function
plot(ecdf(mydata))
and with hist I can plot a histogram of my data
hist(mydata)
How I can plot the histogram and the ecdf in the same plot?
EDIT
I try make something like that
https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf

Also a bit late, here's another solution that extends #Christoph 's Solution with a second y-Axis.
par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
par(new = T)
ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')
The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).
Credits go here (#tim_yates Answer) and there.

There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.
I suspect this question will soon become interesting to you, particularly #hadley 's answer.
ggplot2 single scale
Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.
library(ggplot2)
set.seed(27272)
mydata <- data.frame(x= rexp(333, rate=4) + rnorm(333))
ggplot(mydata, aes(x)) +
stat_ecdf(color="red") +
geom_bar(aes(y = (..count..)/sum(..count..)))
base R multi scale
Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.
h <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec),
y=(1:length(mydata$x))/length(mydata$x) * max(h$density),
col ='red')

you can try a ggplot approach with a second axis
set.seed(15)
a <- rnorm(500, 50, 10)
# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() +
geom_histogram(aes(a), bins = binsize) +
geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))
Edit
Since the scaling was wrong I added a second solution, calculatin everything in advance:
binsize=30
a_range= floor(range(a)) +c(0,1)
b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor()
df_hist <- tibble(a) %>%
mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>%
count(gr) %>%
mutate(gr = as.character(gr) %>% as.numeric())
# calculate ecdf with binsize 30
df <- tibble(x=b) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
ggplot(df_hist, aes(gr, n)) +
geom_col(width = 2, color = "white") +
geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))

As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try
set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))
to get

Although a bit late... Another version which is working with preset bins:
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
ec <- ecdf(dt)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
y = c(0, max(h$counts)), col ='black', lty = 3)
(Only the second y-axis is not working yet...)

In addition to previous answers, I wanted to have ggplot do the tedious calculation (in contrast to #Roman's solution, which was kindly enough updated upon my request), i.e., calculate and draw the histogram and calculate and overlay the ECDF. I came up with the following (pseudo code):
# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)
# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)
# 3. Overlay scaled ECDF and add secondary axis
plot +
stat_ecdf(aes(y=..y..*maxPlotY)) +
scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))
This way you don't need to calculate everything beforehand and feed the results to ggpplot. Just lay back and let it do everything for you!

Related

Is there a way to use R to break chart axis and break linear regression line?

I'm trying to figure out how to modify a scatter-plot that contains two groups of data along a continuum separated by a large gap. The graph needs a break on the x-axis as well as on the regression line.
This R code using the ggplot2 library accurately presents the data, but is unsightly due to the vast amount of empty space on the graph. Pearson's correlation is -0.1380438.
library(ggplot2)
p <- ggplot(, aes(x = dis, y = result[, 1])) + geom_point(shape = 1) +
xlab("X-axis") +
ylab("Y-axis") + geom_smooth(color = "red", method = "lm", se = F) + theme_classic()
p + theme(plot.title = element_text(hjust = 0.5, size = 14))
This R code uses gap.plot to produce the breaks needed, but the regression line doesn't contain a break and doesn't reflect the slope properly. As you can see, the slope of the regression line isn't as sharp as the graph above and there needs to be a visible distinction in the slope of the line between those disparate groups.
library(plotrix)
gap.plot(
x = dis,
y = result[, 1],
gap = c(700, 4700),
gap.axis = "x",
xlab = "X-Axis",
ylab = "Y-Axis",
xtics = seq(0, 5575, by = 200)
)
abline(v = seq(700, 733) , col = "white")
abline(lm(result[, 1] ~ dis), col = "red", lwd = 2)
axis.break(1, 716, style = "slash")
Using MS Paint, I created an approximation of what the graph should look like. Notice the break marks on the top as well as the discontinuity between on the regression line between the two groups.
One solution is to plot the regression line in two pieces, using ablineclip to limit what's plotted each time. (Similar to #tung's suggestion, although it's clear that you want the appearance of a single graph rather than the appearance of facets.) Here's how that would work:
library(plotrix)
# Simulate some data that looks roughly like the original graph.
dis = c(rnorm(100, 300, 50), rnorm(100, 5000, 100))
result = c(rnorm(100, 0.6, 0.1), rnorm(100, 0.5, 0.1))
# Store the location of the gap so we can refer to it later.
x.axis.gap = c(700, 4700)
# gap.plot() works internally by shifting the location of the points to be
# plotted based on the gap size/location, and then adjusting the axis labels
# accordingly. We'll re-compute the second half of the regression line in the
# same way; these are the new values for the x-axis.
dis.alt = dis - x.axis.gap[1]
# Plot (same as before).
gap.plot(
x = dis,
y = result,
gap = x.axis.gap,
gap.axis = "x",
xlab = "X-Axis",
ylab = "Y-Axis",
xtics = seq(0, 5575, by = 200)
)
abline(v = seq(700, 733), col = "white")
axis.break(1, 716, style = "slash")
# Add regression line in two pieces: from 0 to the start of the gap, and from
# the end of the gap to infinity.
ablineclip(lm(result ~ dis), col = "red", lwd = 2, x2 = x.axis.gap[1])
ablineclip(lm(result ~ dis.alt), col = "red", lwd = 2, x1 = x.axis.gap[1] + 33)

ggplot2: solid line for one group, points for the other

I have four series that I would like to plot.
There are 2 models : xg and algo30.
There are two types of data: predicted and observed.
This means we have the following 4 series: "predicted xg","observed xg", "predicted 30", "observed 30".
I want "xg" to be blue, "algo30" to be red.
I also want predicted to be a solid line and observed to be points.
Here is what I mean, using base plot:
library(magrittr)
library(ggplot2)
library(dplyr)
set.seed(123)
gr <- 1:10
obs.xg <- sort(runif(10, 0.5, 1))
obs.30 <- sort(runif(10, 0.5, 1))
pred.xg <- lm(obs.xg~gr) %>% predict() %>% add(rnorm(10,0,.01))
pred.30 <- lm(obs.30~gr) %>% predict() %>% add(rnorm(10,0,.01))
plot(gr, obs.xg, col="darkblue", ylim=range(c(obs.xg,obs.30)), pch=20)
lines(gr, pred.xg, col="darkblue", lwd=2)
points(gr, obs.30, col="firebrick", pch=20)
lines(gr, pred.30, col="firebrick", lwd=2)
legend("bottomright",
pch=c(20,NA,NA,NA,NA),
lty=c(NA,1,NA,1,1),
lwd=c(NA,1,NA,2,2),
col = c("black","black",NA, "darkblue","firebrick"),
legend=c("observé","prédit",NA,"xgboost","algo30"),
bty='n')
Here is my best attempt using ggplot. Notice that the legend doesnt work as I want.
xg.data <- data.frame(model= "xg", decile = seq(1:10), observed = obs.xg, predicted = pred.xg)
algo30.data <- data.frame(model = "algo30",decile = seq(1:10), observed = obs.30, predicted = pred.30)
ggplotdata <- bind_rows(xg.data, algo30.data)
ggplotdata %>%
ggplot( aes(x=decile, y= predicted, color= model))+ geom_line()+
geom_point(aes(x=decile, y= observed, color = model))
Most of the time when making a legend like this I look to override.aes in guide_legend().
The idea here is to make a legend using an additional aesthetic that you don't want mapped onto the plot itself and then using constants instead of a variable for that aesthetic. I used alpha, since both points and lines use that aesthetic.
Then the heavy lifting is done in scale_alpha_manual: removing the legend name, making sure the plot still looks right by setting the values, and then, finally, picking the correct point type and lines along with blanks for the legend.
ggplot(ggplotdata, aes(x=decile, y= predicted, color= model))+
geom_line( aes(alpha = "prédit") )+
geom_point(aes(x=decile, y= observed, alpha = "observé")) +
scale_alpha_manual(name = NULL, values = c(1, 1),
guide = guide_legend(override.aes = list(linetype = c(0, 1), shape = c(16, NA)))) +
scale_color_manual(name = NULL, values = c("firebrick", "darkblue"))

Plot two densities in the same figure in R

I have two densities
N(µ = 1, σ2 = 1) and
N(µ = −3.5, σ2 = 3/4). I know I am have to use plot() and lines() but I am not sure how to convert the densities into functions. I am not even sure if that's what I have to do.
Any help would be appreciated. Thank you
You can use the dnorm() function, along with a sequence of numbers generated with seq() to get values to plot a pdf:
Get 5000 values between -10 and 10
x<-seq(-10,10,length=5000)
Calculate values - notice that dnorm() uses standard deviation and not the variance, so you need to take the square root of 0.75.
y<-dnorm(x,mean=0, sd=1)
z<-dnorm(x, mean = -3.5, sd = sqrt(0.75))
Plot first density in red with plot():
plot(x, y, type="l" , ylim = c(0,1), xlim = c(-8,8), col = "red")
Plot second one on top of first using the lines() function in blue:
lines(x,z, type = "l", col = "blue")
This code will plot you two densities in the same figure ;)
library(tidyverse)
seq(-10, 10, by = 0.1) %>%
tibble(x = .) %>%
mutate(D1 = dnorm(x, 1, 1),
D2 = dnorm(x, -3.5, 3/4)) %>%
gather(-x, key = Distribution, value = Value) %>%
ggplot(aes(x, Value)) +
geom_line(aes(color = Distribution))

How to superimpose bar plots in R?

I'm trying to create a figure similar to the one below (taken from Ro, Russell, & Lavie, 2001). In their graph, they are plotting bars for the errors (i.e., accuracy) within the reaction time bars. Basically, what I am looking for is a way to plot bars within bars.
I know there are several challenges with creating a graph like this. First, Hadley points out that it is not possible to create a graph with two scales in ggplot2 because those graphs are fundamentally flawed (see Plot with 2 y axes, one y axis on the left, and another y axis on the right)
Nonetheless, the graph with superimposed bars seems to solve this dual sclaing problem, and I'm trying to figure out a way to create it in R. Any help would be appreciated.
It's fairly easy in base R, by using par(new = T) to add to an existing graph
set.seed(54321) # for reproducibility
data.1 <- sample(1000:2000, 10)
data.2 <- sample(seq(0, 5, 0.1), 10)
# Use xpd = F to avoid plotting the bars below the axis
barplot(data.1, las = 1, col = "black", ylim = c(500, 3000), xpd = F)
par(new = T)
# Plot the new data with a different ylim, but don't plot the axis
barplot(data.2, las = 1, col = "white", ylim = c(0, 30), yaxt = "n")
# Add the axis on the right
axis(4, las = 1)
It is pretty easy to make the bars in ggplot. Here is some example code. No two y-axes though (although look here for a way to do that too).
library(ggplot2)
data.1 <- sample(1000:2000, 10)
data.2 <- sample(500:1000, 10)
library(ggplot2)
ggplot(mapping = aes(x, y)) +
geom_bar(data = data.frame(x = 1:10, y = data.1), width = 0.8, stat = 'identity') +
geom_bar(data = data.frame(x = 1:10, y = data.2), width = 0.4, stat = 'identity', fill = 'white') +
theme_classic() + scale_y_continuous(expand = c(0, 0))

How to overlay density plots in R?

I would like to overlay 2 density plots on the same device with R. How can I do that? I searched the web but I didn't find any obvious solution.
My idea would be to read data from a text file (columns) and then use
plot(density(MyData$Column1))
plot(density(MyData$Column2), add=T)
Or something in this spirit.
use lines for the second one:
plot(density(MyData$Column1))
lines(density(MyData$Column2))
make sure the limits of the first plot are suitable, though.
ggplot2 is another graphics package that handles things like the range issue Gavin mentions in a pretty slick way. It also handles auto generating appropriate legends and just generally has a more polished feel in my opinion out of the box with less manual manipulation.
library(ggplot2)
#Sample data
dat <- data.frame(dens = c(rnorm(100), rnorm(100, 10, 5))
, lines = rep(c("a", "b"), each = 100))
#Plot.
ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5)
Adding base graphics version that takes care of y-axis limits, add colors and works for any number of columns:
If we have a data set:
myData <- data.frame(std.nromal=rnorm(1000, m=0, sd=1),
wide.normal=rnorm(1000, m=0, sd=2),
exponent=rexp(1000, rate=1),
uniform=runif(1000, min=-3, max=3)
)
Then to plot the densities:
dens <- apply(myData, 2, density)
plot(NA, xlim=range(sapply(dens, "[", "x")), ylim=range(sapply(dens, "[", "y")))
mapply(lines, dens, col=1:length(dens))
legend("topright", legend=names(dens), fill=1:length(dens))
Which gives:
Just to provide a complete set, here's a version of Chase's answer using lattice:
dat <- data.frame(dens = c(rnorm(100), rnorm(100, 10, 5))
, lines = rep(c("a", "b"), each = 100))
densityplot(~dens,data=dat,groups = lines,
plot.points = FALSE, ref = TRUE,
auto.key = list(space = "right"))
which produces a plot like this:
That's how I do it in base (it's actually mentionned in the first answer comments but I'll show the full code here, including legend as I can not comment yet...)
First you need to get the info on the max values for the y axis from the density plots. So you need to actually compute the densities separately first
dta_A <- density(VarA, na.rm = TRUE)
dta_B <- density(VarB, na.rm = TRUE)
Then plot them according to the first answer and define min and max values for the y axis that you just got. (I set the min value to 0)
plot(dta_A, col = "blue", main = "2 densities on one plot"),
ylim = c(0, max(dta_A$y,dta_B$y)))
lines(dta_B, col = "red")
Then add a legend to the top right corner
legend("topright", c("VarA","VarB"), lty = c(1,1), col = c("blue","red"))
I took the above lattice example and made a nifty function. There is probably a better way to do this with reshape via melt/cast. (Comment or edit if you see an improvement.)
multi.density.plot=function(data,main=paste(names(data),collapse = ' vs '),...){
##combines multiple density plots together when given a list
df=data.frame();
for(n in names(data)){
idf=data.frame(x=data[[n]],label=rep(n,length(data[[n]])))
df=rbind(df,idf)
}
densityplot(~x,data=df,groups = label,plot.points = F, ref = T, auto.key = list(space = "right"),main=main,...)
}
Example usage:
multi.density.plot(list(BN1=bn1$V1,BN2=bn2$V1),main='BN1 vs BN2')
multi.density.plot(list(BN1=bn1$V1,BN2=bn2$V1))
You can use the ggjoy package. Let's say that we have three different beta distributions such as:
set.seed(5)
b1<-data.frame(Variant= "Variant 1", Values = rbeta(1000, 101, 1001))
b2<-data.frame(Variant= "Variant 2", Values = rbeta(1000, 111, 1011))
b3<-data.frame(Variant= "Variant 3", Values = rbeta(1000, 11, 101))
df<-rbind(b1,b2,b3)
You can get the three different distributions as follows:
library(tidyverse)
library(ggjoy)
ggplot(df, aes(x=Values, y=Variant))+
geom_joy(scale = 2, alpha=0.5) +
scale_y_discrete(expand=c(0.01, 0)) +
scale_x_continuous(expand=c(0.01, 0)) +
theme_joy()
Whenever there are issues of mismatched axis limits, the right tool in base graphics is to use matplot. The key is to leverage the from and to arguments to density.default. It's a bit hackish, but fairly straightforward to roll yourself:
set.seed(102349)
x1 = rnorm(1000, mean = 5, sd = 3)
x2 = rnorm(5000, mean = 2, sd = 8)
xrng = range(x1, x2)
#force the x values at which density is
# evaluated to be the same between 'density'
# calls by specifying 'from' and 'to'
# (and possibly 'n', if you'd like)
kde1 = density(x1, from = xrng[1L], to = xrng[2L])
kde2 = density(x2, from = xrng[1L], to = xrng[2L])
matplot(kde1$x, cbind(kde1$y, kde2$y))
Add bells and whistles as desired (matplot accepts all the standard plot/par arguments, e.g. lty, type, col, lwd, ...).

Resources