Plotting log normal density in R has wrong height - r

I have a log-normal density with a mean of -0.4 and standard deviation of 2.5.
At x = 0.001 the height is over 5 (I double checked this value with the formula for the log-normal PDF):
dlnorm(0.001, -0.4, 2.5)
5.389517
When I plot it using the curve function over the input range 0-6 it looks like with a height just over 1.5:
curve(dlnorm(x, -.4, 2.5), xlim = c(0, 6), ylim = c(0, 6))
When I adjust the input range to 0-1 the height is nearly 4:
curve(dlnorm(x, -.4, 2.5), xlim = c(0, 1), ylim = c(0, 6))
Similarly with ggplot2 (output not shown, but looks like the curve plots above):
library(ggplot2)
ggplot(data = data.frame(x = 0), mapping = aes(x = x)) +
stat_function(fun = function(x) dlnorm(x, -0.4, 2.5)) +
xlim(0, 6) +
ylim(0, 6)
ggplot(data = data.frame(x = 0), mapping = aes(x = x)) +
stat_function(fun = function(x) dlnorm(x, -0.4, 2.5)) +
xlim(0, 1) +
ylim(0, 6)
Does someone know why the density height is changing when the x-axis scale is adjusted? And why neither attempt above seems to reach the correct height? I tried this with just the normal density and this doesn't happen.

curves generates a set of discrete points in the range you give it. By default it generates n = 101 points, so there is a step problem. If you increase the number of points you will have almost the correct value:
curve(dlnorm(x, -.4, 2.5), xlim = c(0, 1), ylim = c(0, 6), n = 1000)
In the first case you propose curve generates 101 points in the interval x <- c(0,6), while in the second case generates 101 points in the interval x <- c(0,1), so the step is more dense

Related

Density plot of the F-distribution (df1=1). Theoretical or simulated?

I am plotting the density of F(1,49) in R. It seems that the simulated plot does not match the theoretical plot when values approach the zero.
set.seed(123)
val <- rf(1000, df1=1, df2=49)
plot(density(val), yaxt="n",ylab="",xlab="Observation",
main=expression(paste("Density plot (",italic(n),"=1000, ",italic(df)[1],"=1, ",italic(df)[2],"=49)")),
lwd=2)
curve(df(x, df1=1, df2=49), from=0, to=10, add=T, col="red",lwd=2,lty=2)
legend("topright",c("Theoretical","Simulated"),
col=c("red","black"),lty=c(2,1),bty="n")
Using density(val, from = 0) gets you much closer, although still not perfect. Densities near boundaries are notoriously difficult to calculate in a satisfactory way.
By default, density uses a Gaussian kernel to estimate the probability density at a given point. Effectively, this means that at each point an observation was found, a normal density curve is placed there with its center at the observation. All these normal densities are added up, then the result is normalized so that the area under the curve is 1.
This works well if observations have a central tendency, but gives unrealistic results when there are sharp boundaries (Try plot(density(runif(1000))) for a prime example).
When you have a very high density of points close to zero, but none below zero, the left tail of all the normal kernels will "spill over" into the negative values, giving a Gaussian-type which doesn't match the theoretical density.
This means that if you have a sharp boundary at 0, you should remove values of your simulated density that are between zero and about two standard deviations of your smoothing kernel - anything below this will be misleading.
Since we can control the standard deviation of our smoothing kernel with the bw parameter of density, and easily control which x values are plotted using ggplot, we will get a more sensible result by doing something like this:
library(ggplot2)
ggplot(as.data.frame(density(val), bw = 0.1), aes(x, y)) +
geom_line(aes(col = "Simulated"), na.rm = TRUE) +
geom_function(fun = ~ df(.x, df1 = 1, df2 = 49),
aes(col = "Theoretical"), lty = 2) +
lims(x = c(0.2, 12)) +
theme_classic(base_size = 16) +
labs(title = expression(paste("Density plot (",italic(n),"=1000, ",
italic(df)[1],"=1, ",italic(df)[2],"=49)")),
x = "Observation", y = "") +
scale_color_manual(values = c("black", "red"), name = "")
The kde1d and logspline packages are not bad for such densities.
sims <- rf(1500, 1, 49)
library(kde1d)
kd <- kde1d(sims, bw = 1, xmin = 0)
plot(kd, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)
library(logspline)
fit <- logspline(sims, lbound = 0, knots = c(0, 0.5, 1, 1.5, 2))
plot(fit, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)

How can I individually change the decimal place of a select few axis labels in ggplot?

I have a simple plot below. I log scaled the x-axis and I want the graph to show 0.1, 1, 10. I can't figure out how to override the default of 0.1, 1.0, 10.0.
Is there a way I could change only two of the x-axis labels?
library(ggplot2)
x <- c(0.1, 1, 10)
y <- c(1, 5, 10)
ggplot()+
geom_point(aes(x,y)) +
scale_x_log10()
You could specify labels and breaks in scale_x_log10
library(ggplot2)
x <- c(0.1, 1, 10)
y <- c(1, 5, 10)
ggplot() + geom_point(aes(x,y)) + scale_x_log10(labels = x, breaks = x)

Revise the number of ticks in the x-axis?

I only have a series of number, and I want to count the number of each element. Here is something I have done. X-axis is my element and Y-axis is the number of each element.
My question is, how could I revise the way of presentation in the x-axis? I only want to see 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9 in the axis, but still to keep the same number of bars in the figure (nothing changed). Any suggestion please?
d1 <- ggplot(TestData, aes(factor(TestData$Col1)))
d2 <- d1 + geom_bar() + xlab("") + ylab("")
Create data with mean of 0.5, std of 0.2:
data<- rnorm(1000,0.5,0.2)
dataf <- data.frame(data)
Make histogram for all data range:
ggplot(aes(x = data),data = dataf) +
geom_histogram()
Xlim to 0.4 to 0.9:
ggplot(aes(x = data),data = dataf) +
geom_histogram() +
scale_x_continuous(limits = c(0.4,0.9),
breaks= scales::pretty_breaks(n=5))
In base graphics, you can just omit the axes when generating the plot, then add them manually using the axis function:
set.seed(1234)
dat <- rnorm(1000, 0.5, 0.1)
hist(dat, axes = FALSE, xlim = c(0, 1))
axis(side = 2)
axis(side = 1, at = seq(0.4, 0.9, 0.1))

r : ecdf over histogram

in R, with ecdf I can plot a empirical cumulative distribution function
plot(ecdf(mydata))
and with hist I can plot a histogram of my data
hist(mydata)
How I can plot the histogram and the ecdf in the same plot?
EDIT
I try make something like that
https://mathematica.stackexchange.com/questions/18723/how-do-i-overlay-a-histogram-with-a-plot-of-cdf
Also a bit late, here's another solution that extends #Christoph 's Solution with a second y-Axis.
par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
par(new = T)
ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')
The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).
Credits go here (#tim_yates Answer) and there.
There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.
I suspect this question will soon become interesting to you, particularly #hadley 's answer.
ggplot2 single scale
Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.
library(ggplot2)
set.seed(27272)
mydata <- data.frame(x= rexp(333, rate=4) + rnorm(333))
ggplot(mydata, aes(x)) +
stat_ecdf(color="red") +
geom_bar(aes(y = (..count..)/sum(..count..)))
base R multi scale
Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.
h <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec),
y=(1:length(mydata$x))/length(mydata$x) * max(h$density),
col ='red')
you can try a ggplot approach with a second axis
set.seed(15)
a <- rnorm(500, 50, 10)
# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() +
geom_histogram(aes(a), bins = binsize) +
geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))
Edit
Since the scaling was wrong I added a second solution, calculatin everything in advance:
binsize=30
a_range= floor(range(a)) +c(0,1)
b <- seq(a_range[1], a_range[2], round(diff(a_range)/binsize)) %>% floor()
df_hist <- tibble(a) %>%
mutate(gr = cut(a,b, labels = floor(b[-1]), include.lowest = T, right = T)) %>%
count(gr) %>%
mutate(gr = as.character(gr) %>% as.numeric())
# calculate ecdf with binsize 30
df <- tibble(x=b) %>%
bind_cols(Ecdf=with(.,ecdf(a)(x))) %>%
mutate(Ecdf_scaled=Ecdf*max(df_hist$n))
ggplot(df_hist, aes(gr, n)) +
geom_col(width = 2, color = "white") +
geom_line(data = df, aes(x=x, y=Ecdf*max(df_hist$n)), color=2, size = 2) +
scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(df_hist$n), name = "Ecdf"))
As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try
set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))
to get
Although a bit late... Another version which is working with preset bins:
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
dt,
breaks = seq(0, 100, 1),
xlim = c(0,100))
ec <- ecdf(dt)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
y = c(0, max(h$counts)), col ='black', lty = 3)
(Only the second y-axis is not working yet...)
In addition to previous answers, I wanted to have ggplot do the tedious calculation (in contrast to #Roman's solution, which was kindly enough updated upon my request), i.e., calculate and draw the histogram and calculate and overlay the ECDF. I came up with the following (pseudo code):
# 1. Prepare the plot
plot <- ggplot() + geom_hist(...)
# 2. Get the max value of Y axis as calculated in the previous step
maxPlotY <- max(ggplot_build(plot)$data[[1]]$y)
# 3. Overlay scaled ECDF and add secondary axis
plot +
stat_ecdf(aes(y=..y..*maxPlotY)) +
scale_y_continuous(name = "Density", sec.axis = sec_axis(trans = ~./maxPlotY, name = "ECDF"))
This way you don't need to calculate everything beforehand and feed the results to ggpplot. Just lay back and let it do everything for you!

zoom in a CDF figure

I have the below figure shows cdf. I am wondering how can I zoom in to show better the difference between four lines in the left upper part of the figure.
You can use coord_cartesian to zoom in. I don't know what you mean by having the zoom part and the whole in the same figure. If you want to have them side by side, you can use the multiplot function found in the Cookbook for R page. For example:
df <- data.frame(x = c(rnorm(100, 0, 3), rnorm(100, 0, 10)),
g = gl(2, 100))
p <- ggplot(df, aes(x, colour = g)) + stat_ecdf()
p1 <- p
p2 <- p + coord_cartesian(ylim = c(.75, 1))
multiplot(p1, p2)
Edit
Based on #Paul Lemmens' comment, you can use grid's viewport function in the following way:
pdf("~/Desktop/foo.pdf", width = 6, height = 6)
subvp <- viewport(width = .4, height = .4, x = .75, y = .25)
p1
print(p2, vp = subvp)
dev.off()
which gives the following output - adjust the details for your specific example:

Resources