Plot two densities in the same figure in R - r

I have two densities
N(µ = 1, σ2 = 1) and
N(µ = −3.5, σ2 = 3/4). I know I am have to use plot() and lines() but I am not sure how to convert the densities into functions. I am not even sure if that's what I have to do.
Any help would be appreciated. Thank you

You can use the dnorm() function, along with a sequence of numbers generated with seq() to get values to plot a pdf:
Get 5000 values between -10 and 10
x<-seq(-10,10,length=5000)
Calculate values - notice that dnorm() uses standard deviation and not the variance, so you need to take the square root of 0.75.
y<-dnorm(x,mean=0, sd=1)
z<-dnorm(x, mean = -3.5, sd = sqrt(0.75))
Plot first density in red with plot():
plot(x, y, type="l" , ylim = c(0,1), xlim = c(-8,8), col = "red")
Plot second one on top of first using the lines() function in blue:
lines(x,z, type = "l", col = "blue")

This code will plot you two densities in the same figure ;)
library(tidyverse)
seq(-10, 10, by = 0.1) %>%
tibble(x = .) %>%
mutate(D1 = dnorm(x, 1, 1),
D2 = dnorm(x, -3.5, 3/4)) %>%
gather(-x, key = Distribution, value = Value) %>%
ggplot(aes(x, Value)) +
geom_line(aes(color = Distribution))

Related

Density plot of the F-distribution (df1=1). Theoretical or simulated?

I am plotting the density of F(1,49) in R. It seems that the simulated plot does not match the theoretical plot when values approach the zero.
set.seed(123)
val <- rf(1000, df1=1, df2=49)
plot(density(val), yaxt="n",ylab="",xlab="Observation",
main=expression(paste("Density plot (",italic(n),"=1000, ",italic(df)[1],"=1, ",italic(df)[2],"=49)")),
lwd=2)
curve(df(x, df1=1, df2=49), from=0, to=10, add=T, col="red",lwd=2,lty=2)
legend("topright",c("Theoretical","Simulated"),
col=c("red","black"),lty=c(2,1),bty="n")
Using density(val, from = 0) gets you much closer, although still not perfect. Densities near boundaries are notoriously difficult to calculate in a satisfactory way.
By default, density uses a Gaussian kernel to estimate the probability density at a given point. Effectively, this means that at each point an observation was found, a normal density curve is placed there with its center at the observation. All these normal densities are added up, then the result is normalized so that the area under the curve is 1.
This works well if observations have a central tendency, but gives unrealistic results when there are sharp boundaries (Try plot(density(runif(1000))) for a prime example).
When you have a very high density of points close to zero, but none below zero, the left tail of all the normal kernels will "spill over" into the negative values, giving a Gaussian-type which doesn't match the theoretical density.
This means that if you have a sharp boundary at 0, you should remove values of your simulated density that are between zero and about two standard deviations of your smoothing kernel - anything below this will be misleading.
Since we can control the standard deviation of our smoothing kernel with the bw parameter of density, and easily control which x values are plotted using ggplot, we will get a more sensible result by doing something like this:
library(ggplot2)
ggplot(as.data.frame(density(val), bw = 0.1), aes(x, y)) +
geom_line(aes(col = "Simulated"), na.rm = TRUE) +
geom_function(fun = ~ df(.x, df1 = 1, df2 = 49),
aes(col = "Theoretical"), lty = 2) +
lims(x = c(0.2, 12)) +
theme_classic(base_size = 16) +
labs(title = expression(paste("Density plot (",italic(n),"=1000, ",
italic(df)[1],"=1, ",italic(df)[2],"=49)")),
x = "Observation", y = "") +
scale_color_manual(values = c("black", "red"), name = "")
The kde1d and logspline packages are not bad for such densities.
sims <- rf(1500, 1, 49)
library(kde1d)
kd <- kde1d(sims, bw = 1, xmin = 0)
plot(kd, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)
library(logspline)
fit <- logspline(sims, lbound = 0, knots = c(0, 0.5, 1, 1.5, 2))
plot(fit, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)

Simulation Poisson Process using R and ggplot2

Using a simulation of a Poisson process with rate lambda = 0.7. Show a sample run of a Poisson process with N(t) on the vertical axis and time t on the horizontal axis. The simulation is from the range t[0:100]. Generate first graph with 10 trajectories and second graph with 100 trajectories.
I have tried the following code but I cannot generate both graphs.
library(plyr)
library(ggplot2)
Process_poisson<- function(t, lambda){
distr_poisson<- rpois(1, t*lambda)
s_poisson<- sort(runif(distr_poisson, 0, t))
data.frame(x = c(0, 0, s_poisson),y = c(0, 0:distr_poisson))
}
N_simulations<- function(n,t,lambda){
s_poisson<- lapply (1:n, function(n) data.frame(Process_poisson(t, lambda), simulation = n))
s_poisson<- ldply (s_poisson, data.frame)
s_poisson$simulation<- factor(s_poisson$simulation)
}
t<- 0:100
lambda<- 0.7
N_simulations(10, t, lambda)
N_simulations(100, t, lambda)
par(mfrow = c(1,2))
matplot(x, y, type = "l", lty = 0:5, lwd = 1, lend = par("lend"),
pch = NULL, col = simulation, cex = 0.5, bg = NA, main =sprintf("Nº simulations of trajectories of Poisson Process",10,lambda), xlab = "Time", ylab = "N(t)",
xlim = c(0,100), ylim = c(-10,0))
matplot(Proceso_poisson(t, lambda), n, y, type = "l", lty = 0:5, lwd = 1, lend = par("lend"),
pch = NULL, col = simulacion, cex = 0.5, bg = NA, main =sprintf("Nº simulations of trajectories of Poisson Process",10,lambda), xlab = "Time", ylab = "N(t)",
xlim = c(0,100), ylim = c(-10,0))
How could I do it?
Thanks so much!
I think you could make this simpler. Here's a ggplot solution.
First, create a function that will simulate a Poisson process by taking samples drawn from an exponential distribution with the appropriate lambda. In this example I have used a while loop that starts with a vector x whose first element is 0. The function grows this vector by adding random samples until its sum reaches the target duration tmax. This is not the most efficient way to do it, but should make the example clearer.
When the target is reached, the function returns the cumulative sum of the vector, which represents the arrival times of a Poisson process of the appropriate lambda. Note that to make plotting easier, it actually returns a data frame with the cumulative times, the cumulative count, and a grouping variable run that will allow us to plot several runs easily on a single plot.
make_sample_df <- function(run, tmax, lambda)
{
x <- 0
while(sum(x) < tmax) x <- c(x, rexp(1, lambda))
data.frame(t = cumsum(x), N = seq_along(x), run = rep(run, length(x)))
}
We can now use this function inside our actual plotting function:
plot_poisson <- function(runs, tmax, lambda)
{
# Creates one data frame for each run, this sticks them all together:
df <- do.call("rbind", lapply(seq(runs), make_sample_df, tmax, lambda))
ggplot2::ggplot(df, aes(t, N, group = run)) +
geom_step(alpha = 0.25) +
labs( title = paste(runs, "runs of Poisson process with lambda", lambda)) +
theme(legend.position = "none") +
coord_cartesian(xlim = c(0, tmax))
}
So you can do:
plot_poisson(runs = 10, tmax = 100, lambda = 0.7)
plot_poisson(runs = 100, tmax = 100, lambda = 0.7)

r : ecdf over histogram

in R, with ecdf I can plot a empirical cumulative distribution function
plot(ecdf(mydata))
and with hist I can plot a histogram of my data
hist(mydata)
How I can plot the histogram and the ecdf in the same plot?
EDIT
I try make something like that
https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf
Also a bit late, here's another solution that extends #Christoph 's Solution with a second y-Axis.
par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
par(new = T)
ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')
The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).
Credits go here (#tim_yates Answer) and there.
There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.
I suspect this question will soon become interesting to you, particularly #hadley 's answer.
ggplot2 single scale
Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.
library(ggplot2)
set.seed(27272)
mydata <- data.frame(x= rexp(333, rate=4) + rnorm(333))
ggplot(mydata, aes(x)) +
stat_ecdf(color="red") +
geom_bar(aes(y = (..count..)/sum(..count..)))
base R multi scale
Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.
h <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec),
y=(1:length(mydata$x))/length(mydata$x) * max(h$density),
col ='red')
you can try a ggplot approach with a second axis
set.seed(15)
a <- rnorm(500, 50, 10)
# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() +
geom_histogram(aes(a), bins = binsize) +
geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))
Edit
Since the scaling was wrong I added a second solution, calculatin everything in advance:
binsize=30
a_range= floor(range(a)) +c(0,1)
b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor()
df_hist <- tibble(a) %>%
mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>%
count(gr) %>%
mutate(gr = as.character(gr) %>% as.numeric())
# calculate ecdf with binsize 30
df <- tibble(x=b) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
ggplot(df_hist, aes(gr, n)) +
geom_col(width = 2, color = "white") +
geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))
As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try
set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))
to get
Although a bit late... Another version which is working with preset bins:
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
ec <- ecdf(dt)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
y = c(0, max(h$counts)), col ='black', lty = 3)
(Only the second y-axis is not working yet...)
In addition to previous answers, I wanted to have ggplot do the tedious calculation (in contrast to #Roman's solution, which was kindly enough updated upon my request), i.e., calculate and draw the histogram and calculate and overlay the ECDF. I came up with the following (pseudo code):
# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)
# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)
# 3. Overlay scaled ECDF and add secondary axis
plot +
stat_ecdf(aes(y=..y..*maxPlotY)) +
scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))
This way you don't need to calculate everything beforehand and feed the results to ggpplot. Just lay back and let it do everything for you!

Combining 2 datasets in a single plot in R

I have two columns of data, f.delta and g.delta that I would like to produce a scatter plot of in R.
Here is how I am doing it.
plot(f.delta~x, pch=20, col="blue")
points(g.delta~x, pch=20, col="red")
The problem is this: the values of f.delta vary from 0 to -7; the values of g.delta vary from 0 to 10.
When the plot is drawn, the y axis extends from 1 to -7. So while all the f.delta points are visible, any g.delta point that has y>1 is cut-off from view.
How do I stop R from automatically setting the ylims from the data values. Have tried, unsuccessfully, various combinations of yaxt, yaxp, ylims.
Any suggestion will be greatly appreciated.
Thanks,
Anjan
In addition to Gavin's excellent answer, I also thought I'd mention that another common idiom in these cases is to create an empty plot with the correct limits and then to fill it in using points, lines, etc.
Using Gavin's example data:
with(df,plot(range(x),range(f.delta,g.delta),type = "n"))
points(f.delta~x, data = df, pch=20, col="blue")
points(g.delta~x, data = df, pch=20, col="red")
The type = "n" causes plot to create only the empty plotting window, based on the range of x and y values we've supplied. Then we use points for both columns on this existing plot.
You need to tell R what the limits of the data are and pass that as argument ylim to plot() (note the argument is ylim not ylims!). Here is an example:
set.seed(1)
df <- data.frame(f.delta = runif(10, min = -7, max = 0),
g.delta = runif(10, min = 0, max = 10),
x = rnorm(10))
ylim <- with(df, range(f.delta, g.delta)) ## compute y axis limits
plot(f.delta ~ x, data = df, pch = 20, col = "blue", ylim = ylim)
points(g.delta ~ x, data = df, pch = 20, col = "red")
Which produces

How to overlay density plots in R?

I would like to overlay 2 density plots on the same device with R. How can I do that? I searched the web but I didn't find any obvious solution.
My idea would be to read data from a text file (columns) and then use
plot(density(MyData$Column1))
plot(density(MyData$Column2), add=T)
Or something in this spirit.
use lines for the second one:
plot(density(MyData$Column1))
lines(density(MyData$Column2))
make sure the limits of the first plot are suitable, though.
ggplot2 is another graphics package that handles things like the range issue Gavin mentions in a pretty slick way. It also handles auto generating appropriate legends and just generally has a more polished feel in my opinion out of the box with less manual manipulation.
library(ggplot2)
#Sample data
dat <- data.frame(dens = c(rnorm(100), rnorm(100, 10, 5))
, lines = rep(c("a", "b"), each = 100))
#Plot.
ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5)
Adding base graphics version that takes care of y-axis limits, add colors and works for any number of columns:
If we have a data set:
myData <- data.frame(std.nromal=rnorm(1000, m=0, sd=1),
wide.normal=rnorm(1000, m=0, sd=2),
exponent=rexp(1000, rate=1),
uniform=runif(1000, min=-3, max=3)
)
Then to plot the densities:
dens <- apply(myData, 2, density)
plot(NA, xlim=range(sapply(dens, "[", "x")), ylim=range(sapply(dens, "[", "y")))
mapply(lines, dens, col=1:length(dens))
legend("topright", legend=names(dens), fill=1:length(dens))
Which gives:
Just to provide a complete set, here's a version of Chase's answer using lattice:
dat <- data.frame(dens = c(rnorm(100), rnorm(100, 10, 5))
, lines = rep(c("a", "b"), each = 100))
densityplot(~dens,data=dat,groups = lines,
plot.points = FALSE, ref = TRUE,
auto.key = list(space = "right"))
which produces a plot like this:
That's how I do it in base (it's actually mentionned in the first answer comments but I'll show the full code here, including legend as I can not comment yet...)
First you need to get the info on the max values for the y axis from the density plots. So you need to actually compute the densities separately first
dta_A <- density(VarA, na.rm = TRUE)
dta_B <- density(VarB, na.rm = TRUE)
Then plot them according to the first answer and define min and max values for the y axis that you just got. (I set the min value to 0)
plot(dta_A, col = "blue", main = "2 densities on one plot"),
ylim = c(0, max(dta_A$y,dta_B$y)))
lines(dta_B, col = "red")
Then add a legend to the top right corner
legend("topright", c("VarA","VarB"), lty = c(1,1), col = c("blue","red"))
I took the above lattice example and made a nifty function. There is probably a better way to do this with reshape via melt/cast. (Comment or edit if you see an improvement.)
multi.density.plot=function(data,main=paste(names(data),collapse = ' vs '),...){
##combines multiple density plots together when given a list
df=data.frame();
for(n in names(data)){
idf=data.frame(x=data[[n]],label=rep(n,length(data[[n]])))
df=rbind(df,idf)
}
densityplot(~x,data=df,groups = label,plot.points = F, ref = T, auto.key = list(space = "right"),main=main,...)
}
Example usage:
multi.density.plot(list(BN1=bn1$V1,BN2=bn2$V1),main='BN1 vs BN2')
multi.density.plot(list(BN1=bn1$V1,BN2=bn2$V1))
You can use the ggjoy package. Let's say that we have three different beta distributions such as:
set.seed(5)
b1<-data.frame(Variant= "Variant 1", Values = rbeta(1000, 101, 1001))
b2<-data.frame(Variant= "Variant 2", Values = rbeta(1000, 111, 1011))
b3<-data.frame(Variant= "Variant 3", Values = rbeta(1000, 11, 101))
df<-rbind(b1,b2,b3)
You can get the three different distributions as follows:
library(tidyverse)
library(ggjoy)
ggplot(df, aes(x=Values, y=Variant))+
geom_joy(scale = 2, alpha=0.5) +
scale_y_discrete(expand=c(0.01, 0)) +
scale_x_continuous(expand=c(0.01, 0)) +
theme_joy()
Whenever there are issues of mismatched axis limits, the right tool in base graphics is to use matplot. The key is to leverage the from and to arguments to density.default. It's a bit hackish, but fairly straightforward to roll yourself:
set.seed(102349)
x1 = rnorm(1000, mean = 5, sd = 3)
x2 = rnorm(5000, mean = 2, sd = 8)
xrng = range(x1, x2)
#force the x values at which density is
# evaluated to be the same between 'density'
# calls by specifying 'from' and 'to'
# (and possibly 'n', if you'd like)
kde1 = density(x1, from = xrng[1L], to = xrng[2L])
kde2 = density(x2, from = xrng[1L], to = xrng[2L])
matplot(kde1$x, cbind(kde1$y, kde2$y))
Add bells and whistles as desired (matplot accepts all the standard plot/par arguments, e.g. lty, type, col, lwd, ...).

Resources