ggplot Poisson density curve: why zigzag lines? - r

I would like to plot the density function of a Poisson distribution. I am not sure why I get a jaggy line (in blue). On the sample plot, the normal density curve (in red) looks smooth. It is because the reason the Poisson density function doesn't accept decimal values? How to eliminate the zigzag in the Poisson density plot? Thanks very much for any help.
library(ggplot2)
ggplot(data.frame(X = seq(5, 30)), aes(x = X)) +
stat_function(fun=dpois, geom="line", size=2, color="blue3", args = list(lambda = 15)) +
stat_function(fun=dnorm, geom="line", size=2, color="red4", args = list(mean=20, sd=2))

Related

Log-scale transformation of histogram and fittiting gamma curve

I have a bunch of data, that I've fitted with a gamma distribution. I've got the histogram and the fitting curve just fine, but now I wish to draw a histogram and the curve with x in log scale. Using scale_x_log10 works just fine for the histogram, but I can't make it work for the stat_function/geom_line.
I understand that's because stat_function now takes the log values, but I'm not sure how to transform the gamma function beforehand for it to work properly. Here are the relevant pictures and code snippets:
This is the original graph:
fit.gamma2 <- fitdist(myvalues[,1],distr="gamma",method="mme")
ggplot(myvalues, aes(x = V1)) +
geom_histogram(aes(y =..density..),
boundary = 0,
binwidth = sirina,
col="black",
fill="blue",
alpha=.2) +
stat_function(fun=dgamma,
args=list(shape = fit.gamma2$estimate["shape"],
rate = fit.gamma2$estimate["rate"])) +
labs(title="Histogram žarkov + Gama porazdelitev (MM)",
x = "Medprihodni časi (s)",
y = "Gostota")
This is that same graph after using scale_x_log10. The red curve is supposed to be the fitting curve, but it's obviously way off.
ggplot(myvalues, aes(x = V1)) +
geom_histogram(aes(y =..density..),
boundary = 0,
binwidth = sirina_log,
col="black",
fill="blue",
alpha=.2) +
stat_function(fun=dgamma,
args=list(shape = fit.gamma2$estimate["shape"],
rate = fit.gamma2$estimate["rate"])) +
# geom_line(aes(x=V1,y=dgamma(V1,fit.gamma2$estimate["shape"], fit.gamma2$estimate["rate"])), color="red", size = 1) +
scale_x_log10()
I have tried applying the values in 10**x form, but as my original data ranges between 0.1 and 800, some values then escape to Inf.
You need to transform your PDF based on the derivative of log10. First create a function for the transformed PDF:
dgammalog10 <- function(x, shape, rate) {
return(x*log(10)*dgamma(x, shape, rate))
}
Then you can use fun=dgammalog10 where you had fun=dgamma.

General rule of overlaying density plot using ggplot2

I`m a novice with the R programming language. What is the standard/general method for overlaying a density curve on a histogram using ggplot2?
It depends wether you want an empirical density estimate or to fit a theoretical density. In both cases, you'd need to match the width of histogram bins to the density.
For the empirical kernel density estimates:
library(ggplot2)
# dummy data
df <- data.frame(
x = rnorm(1000)
)
binwidth <- 0.1
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
geom_density(aes(y = after_stat(count * binwidth)),
color = "red")
Theoretical density estimates don't live in ggplot2 but in extention packages. Disclaimer: I'm the author of the following package, so I'm biased:
library(ggh4x)
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
stat_theodensity(aes(y = after_stat(count * binwidth)),
color = "red")
Alternatively, if you don't want to bother with setting binwidths you can also scale the histogram to density instead:
ggplot(df, aes(x)) +
geom_histogram(aes(y = after_stat(density))) +
geom_density(color = "red")
Note: after_stat() requires ggplot2 v3.3.0, earlier versions use stat().
You need to make sure that to multiply value of ..count.. in in the density plot call by the value of whatever the binwidth is in the histogram call.
You can do it as follows:
set.seed(100)
a = data.frame(z = rnorm(10000))
binwidthVal=0.1
ggplot(a, aes(x=z)) +
geom_histogram(binwidth = binwidthVal) +
geom_density(colour='red', aes(y=binwidthVal * ..count..))
Credit to Brian Diggs for the idea.
EDIT: Seems like there is already a perfectly good answer here

Is there a way to plot a density function over a histogram that was plotted using the PlotRelativeFrequency() function in R

I have a vector of sample means and I've been tying to plot a probability histogram using hist(x) and ggplot but the bins exceed 1(which is very unusual for a probability distribution),I then used a PlotRelativeFrequency(hist(x)) function to force R to plot a histogram of probabilities,It worked! but My problem is,I cannot plot a density function over the histogram.When I used the lines(density(x)) function it plots a density function that goes way off the graph.
Since your question is tagged with ggplot, I'll give a ggplot answer.
To make histograms relative you have to set aes(y = stat(density)) such that it integrates to 1. Then, you could give the stat_function() the relevant density function for any theoretical distribution. The downside is that you'll have to pre-compute the parameters.
df <- data.frame(x = rnorm(500, 10, 2))
pars <- list(mean = mean(df$x), sd = sd(df$x))
library(ggplot2)
ggplot(df, aes(x)) +
geom_histogram(binwidth = 1, aes(y = stat(density))) +
stat_function(fun = function(x) {dnorm(x, mean = pars$mean, sd = pars$sd)})
Next up, we can plot the empirical density using kernel density estimates, which does everything pretty much automatically:
ggplot(df, aes(x)) +
geom_histogram(binwidth = 1, aes(y = stat(density))) +
geom_density()
Lastly, you can have a look at this stats function, that essentially automates the first version. Full disclaimer: I'm the author of that github repo.
library(ggnomics)
ggplot(df, aes(x)) +
geom_histogram(binwidth = 1, aes(y = stat(density))) +
stat_theodensity()

How to make density plot correctly show area near the limits?

When I plot densities with ggplot, it seems to be very wrong around the limits. I see that geom_density and other functions allow specifying various density kernels, but none of them seem to fix the issue.
How do you correctly plot densities around the limits with ggplot?
As an example, let's plot the Chi-square distribution with 2 degrees of freedom. Using the builtin probability densities:
library(ggplot2)
u = seq(0, 2, by=0.01)
v = dchisq(u, df=2)
df = data.frame(x=u, p=v)
p = ggplot(df) +
geom_line(aes(x=x, y=p), size=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 0.5))
show(p)
We get the expected plot:
Now let's try simulating it and plotting the empirical distribution:
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_density(aes(x=x)) +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
We get an incorrect plot:
We can try to visualize the actual distribution:
library(ggplot2, dplyr, tidyr)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_point(aes(x=x, y=0.5), position=position_jitter(height=0.2), shape='.', alpha=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 1))
show(p)
And it seems to look correct, contrary to the density plot:
It seems like the problem has to do with kernels, and geom_density does allow using different kernels. But they don't really correct the limit problem. For example, the code above with triangular looks about the same:
Here's an idea of what I'm expecting to see (of course, I want a density, not a histogram):
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_histogram(aes(x=x), center=0.1, binwidth=0.2, fill='white', color='black') +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
The usual kernel density methods have trouble when there is a constraint such as in this case for a density with only support above zero. The usual recommendation for handling this has been to use the logspline package:
install.packages("logspline")
library(logspline)
png(); fit <- logspline(rchisq(10000, 3))
plot(fit) ; dev.off()
If this needed to be done in the ggplot2 environment there is a dlogspline function:
densdf <- data.frame( y=dlogspline(seq(0,12,length=1000), fit),
x=seq(0,12,length=1000))
ggplot(densdf, aes(y=y,x=x))+geom_line()
Perhaps you were insisting on one with 2 degrees of freedom?

Linear model diagnostics: smoothing line obtained in ggplot2 is different from the one obtained with base plot

I am trying to reproduce the diagnostics plots for a linear regression model using ggplot2. The smoothing line that I get is different from the one obtained using base plots or ggplot2::autoplot.
library(survival)
library(ggplot2)
model <- lm(wt.loss ~ meal.cal, data=lung)
## Fitted vs. residuals using base plot:
plot(model, which=1)
## Fitted vs. residuals using ggplot
model.frame <- fortify(model)
ggplot(model.frame, aes(.fitted, .resid)) + geom_point() + geom_smooth(method="loess", se=FALSE)
The smoothing line is different, the influence of the the first few points is much larger using the loess method provided by ggplot. My question is: how can I reproduce the smoothing line obtained with plot() using ggplot2?
You can calculate the lowess, which is used to plot the red line in the original diagnostic plot, using samename base function.
smoothed <- as.data.frame(with(model.frame, lowess(x = .fitted, y = .resid)))
ggplot(model.frame, aes(.fitted, .resid)) +
theme_bw() +
geom_point(shape = 1, size = 2) +
geom_hline(yintercept = 0, linetype = "dotted", col = "grey") +
geom_path(data = smoothed, aes(x = x, y = y), col = "red")
And the original:

Resources