Fitting the Poisson distribution - r

I was unable to calculate the maximum likelihood estimator and BIC for the Poisson distribution.. I was able to get the histogram but couldn't superimpose a kernel density estimate on it.
Can you please tell me where I went wrong?
x.pois<-rpois(Y1, 20)
hist(x.pois, breaks=100,freq=FALSE)
lines(density(Y1, bw=0.8), col="red")
library(MASS)
fitdistr(Y1,densfun="pois")
my.mle<-fitdistr(Y1, densfun="poison")
print(my.mle)
BIC(my.mle)

You need to (1) spell "poisson" correctly; (2) use x.pois (the Poisson sample), not Y1 (which should be the number of points you're trying to sample, based on your code example).
Note that kernel density estimates, and histograms, of discrete distributions don't necessarily make a lot of sense.
Y1 <- 100
set.seed(101) ## for reproducibility
x.pois<-rpois(Y1, 20)
hist(x.pois, breaks=100,freq=FALSE)
lines(density(x.pois, bw=0.8), col="red")
library(MASS)
(my.mle<-fitdistr(x.pois, densfun="poisson"))
## lambda
## 20.6700000
## ( 0.4546427)
BIC(my.mle)
## [1] 572.7861
update: your other question makes it clear that Y1 really is your sample, in which case the whole rpois()-sampling thing is just a red herring. In that case you should just leave out the first three lines, and substitute Y1 for x.pois, in the code above.

Related

Gamma Likelihood in R

I want to plot the posterior distribution for data sampled from gamma(2,3) with a prior distribution of gamma(3,3). I am assuming alpha=2 is known. But a graph of my posterior for different values of the rate parameter centers around 4. It should be 3. I even tried with a uniform prior to make things simpler. Can you please spot what's wrong? Thank you.
set.seed(101)
dat <- rgamma(100,shape=2,rate=3)
alpha <- 3
n <- 100
post <- function(beta_1) {
posterior<- (((beta_1^alpha)^n)/gamma(alpha)^n)*
prod(dat^(alpha-1))*exp(-beta_1*sum(dat))
return(posterior)
}
vlogl <- Vectorize(post)
curve(vlogl2,from=2,to=6)
A tricky question and possibly more related to statistics than to programming =). I initially made the same reasoning mistake as you, but subsequently realised to be more careful with the posterior and the roles of alpha and beta_1.
The prior is uniform (or flat) so the posterior distribution is proportional (not equal) to the likelihood.
The quantity you have assigned to the posterior is indeed the likelihood. Plugging in alpha=3, this evaluates to
(prod(dat^2)/(gamma(alpha)^n)) * beta_1^(3*n)*exp(-beta_1*sum(dat)).
This is the crucial step. The last two terms in the product depend on beta_1 only, so these two parts determine the shape of the posterior. The posterior distribution is thus gamma distributed with shape parameter 3*n+1 and rate parameter sum(dat). As the mode of the gamma distribution is the ratio of these two and sum(dat) is about 66 for this seed, we get a mode of 301/66 (about 4.55). This coincides perfectly with the ``posterior plot'' (again you plotted the likelihood which is not properly scaled, i.e. not properly integrating to 1) produced by your code (attached below).
I hope LifeisBetter now =).
But a graph of my posterior for different values of the rate parameter
centers around 4. It should be 3.
The mean of your data is 0.659 (~2/3). Given a gamma distribution with a shape parameter alpha = 3, we are trying to find likely values of the rate parameter, beta, that gave rise to the observed data (subject to our prior information). The mean of a gamma distribution is the shape parameter divided by the rate parameter. 100 observations should be enough to mostly overcome the somewhat informative prior (which had a mean of 1), so we should expect beta to take values somewhere in the region alpha/mean(dat), not 3.
alpha/mean(dat)
#> [1] 4.54915
I'm not going to show the derivation of the posterior distribution for beta without TeX, but it is a gamma distribution that includes the rate parameter from the prior distribution of beta (betaPrior = 3):
set.seed(101)
n <- 100
dat <- rgamma(n, 2, 3)
alpha <- 3
betaPrior <- 3
post <- function(x) dgamma(x, alpha*(n + 1), sum(dat) + betaPrior)
curve(post, 2, 6)
Notice that the mean of beta is at ~4.39 rather than ~4.55 because of the informative prior that had a mean of 1.

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

Coefficients of my polynomial model in R don't match graph

Using Greg's helpful answer here, I fit a second order polynomial regression line to my dataset:
poly.fit<-lm(y~poly(x,2),df)
When I plot the line, I get the graph below:
The coefficients are:
# Coefficients:
# (Intercept) poly(x, 2)1 poly(x, 2)2
# 727.1 362.4 -269.0
I then wanted to find the x-value of the peak. I assume there is an easy way to do so in R but I did not know it,* so I went to Wolfram Alpha. I entered the equation:
y=727.1+362.4x-269x^2
Wolfram Alpha returned the following:
As you can see, the function intersects the x-axis at approximately x=2.4. This is obviously different from my plot in R, which ranges from 0≤x≤80. Why are these different? Does R interpret my x-values as a fraction of some backroom variable?
*I would also appreciate answers on how to find this peak. Obviously I could take the derivative, but how do I set to zero?
Use predict.
plot( 40:90, predict( poly.fit, list(x=40:90) )
In the case of a quadratic polynomial, you can of course use a little calculus and algebra (once you have friendly coefficients).
Somewhat more generally, you can get an estimate by evaluating your model over a range of candidate values and determining which one gives you the maximum response value.
Here is a (only moderately robust) function which will work here.
xmax <- function(fit, startx, endx, x='x', within=NA){
## find approximate value of variable x where model
## specified by fit takes maximum value, inside interval
## [startx, endx]; precision specified by within
within <- ifelse(is.na(within), (endx - startx)/100, within)
testx <- seq(startx, endx, by=within)
testlist <- list(testx)
names(testlist)[1] <- x
testy <- predict(fit, testlist)
testx[which.max(testy)]
}
Note if your predictor variable were called something other than x, you have to specify it as a string in the x parameter.
So to find the x value where your curve has its peak:
xmax(poly.fit, 50, 80, within=0.1)

chi-square distribution R

Trying to fit a chi_square distribution using fitdistr() in R. Documentation on this is here (and not very useful to me): https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
Question 1: chi_df below has the following output: 3.85546875 (0.07695236). What is the second number? The Variance or standard deviation?
Question 2: fitdistr generates 'k' defined by the Chi-SQ distribution. How do I fit the data so I get the scaling constant 'A'? I am dumbly using lines 14-17 below. Obviously not good.
Question 3: Is the Chi-SQ distribution only defined for a certain x-range? (Variance is defined as 2K, while mean = k. This must require some constrained x-range... Stats question not programming...)
nnn = 1000;
## Generating a chi-sq distribution
chii <- rchisq(nnn,4, ncp = 0);
## Plotting Histogram
chi_hist <- hist(chii);
## Fitting. Gives probability density which must be scaled.
chi_df <- fitdistr(chii,"chi-squared",start=list(df=3));
chi_k <- chi_df[[1]][1];
## Plotting a fitted line:
## Spanning x-length of chi-sq data
x_chi_fit <- 1:nnn*((max(chi_hist[[1]][])-min(chi_hist[[1]][]))/nnn);
## Y data using eqn for probability function
y_chi_fit <- (1/(2^(chi_k/2)*gamma(chi_k/2)) * x_chi_fit^(chi_k/2-1) * exp(-x_chi_fit/2));
## Normalizing to the peak of the histogram
y_chi_fit <- y_chi_fit*(max(chi_hist[[2]][]/max(y_chi_fit)));
## Plotting the line
lines(x_chi_fit,y_chi_fit,lwd=2,col="green");
Thanks for your help!
As commented above, ?fitdistr says
An object of class ‘"fitdistr"’, a list with four components,
...
sd: the estimated standard errors,
... so that parenthesized number is the standard error of the parameter.
The scale parameter doesn't need to be estimated; you need either to scale by the width of your histogram bins or just use freq=FALSE when drawing your histogram. See code below.
The chi-squared distribution is defined on the non-negative reals, which makes sense since it's the distribution of a squared standard Normal (this is a statistical, not a programming question).
Set up data:
nnn <- 1000
## ensure reproducibility; not a big deal in this case,
## but good practice
set.seed(101)
## Generating a chi-sq distribution
chii <- rchisq(nnn,4, ncp = 0)
Fitting.
library(MASS)
## use method="Brent" based on warning
chi_df <- fitdistr(chii,"chi-squared",start=list(df=3),
method="Brent",lower=0.1,upper=100)
chi_k <- chi_df[[1]][1]
(For what it's worth, it looks like there might be a bug in the print method for fitdistr when method="Brent" is used. You could also use method="BFGS" and wouldn't need to specify bounds ...)
Histograms
chi_hist <- hist(chii,breaks=50,col="gray")
## scale by N and width of histogram bins
curve(dchisq(x,df=chi_k)*nnn*diff(chi_hist$breaks)[1],
add=TRUE,col="green")
## or plot histogram already scaled to a density
chi_hist <- hist(chii,breaks=50,col="gray",freq=FALSE)
curve(dchisq(x,df=chi_k),add=TRUE,col="green")

How to get confidence interval for smooth.spline?

I have used smooth.spline to estimate a cubic spline for my data. But when I calculate the 90% point-wise confidence interval using equation, the results seems to be a little bit off. Can someone please tell me if I did it wrongly? I am just wondering if there is a function that can automatically calculate a point-wise interval band associated with smooth.spline function.
boneMaleSmooth = smooth.spline( bone[males,"age"], bone[males,"spnbmd"], cv=FALSE)
error90_male = qnorm(.95)*sd(boneMaleSmooth$x)/sqrt(length(boneMaleSmooth$x))
plot(boneMaleSmooth, ylim=c(-0.5,0.5), col="blue", lwd=3, type="l", xlab="Age",
ylab="Relative Change in Spinal BMD")
points(bone[males,c(2,4)], col="blue", pch=20)
lines(boneMaleSmooth$x,boneMaleSmooth$y+error90_male, col="purple",lty=3,lwd=3)
lines(boneMaleSmooth$x,boneMaleSmooth$y-error90_male, col="purple",lty=3,lwd=3)
Because I am not sure if I did it correctly, then I used gam() function from mgcv package.
It instantly gave a confidence band but I am not sure if it is 90% or 95% CI or something else. It would be great if someone can explain.
males=gam(bone[males,c(2,4)]$spnbmd ~s(bone[males,c(2,4)]$age), method = "GCV.Cp")
plot(males,xlab="Age",ylab="Relative Change in Spinal BMD")
I'm not sure the confidence intervals for smooth.spline have "nice" confidence intervals like those form lowess do. But I found a code sample from a CMU Data Analysis course to make Bayesian bootstap confidence intervals.
Here are the functions used and an example. The main function is spline.cis where the first parameter is a data frame where the first column are the x values and the second column are the y values. The other important parameter is B which indicates the number bootstrap replications to do. (See the linked PDF above for the full details.)
# Helper functions
resampler <- function(data) {
n <- nrow(data)
resample.rows <- sample(1:n,size=n,replace=TRUE)
return(data[resample.rows,])
}
spline.estimator <- function(data,m=300) {
fit <- smooth.spline(x=data[,1],y=data[,2],cv=TRUE)
eval.grid <- seq(from=min(data[,1]),to=max(data[,1]),length.out=m)
return(predict(fit,x=eval.grid)$y) # We only want the predicted values
}
spline.cis <- function(data,B,alpha=0.05,m=300) {
spline.main <- spline.estimator(data,m=m)
spline.boots <- replicate(B,spline.estimator(resampler(data),m=m))
cis.lower <- 2*spline.main - apply(spline.boots,1,quantile,probs=1-alpha/2)
cis.upper <- 2*spline.main - apply(spline.boots,1,quantile,probs=alpha/2)
return(list(main.curve=spline.main,lower.ci=cis.lower,upper.ci=cis.upper,
x=seq(from=min(data[,1]),to=max(data[,1]),length.out=m)))
}
#sample data
data<-data.frame(x=rnorm(100), y=rnorm(100))
#run and plot
sp.cis <- spline.cis(data, B=1000,alpha=0.05)
plot(data[,1],data[,2])
lines(x=sp.cis$x,y=sp.cis$main.curve)
lines(x=sp.cis$x,y=sp.cis$lower.ci, lty=2)
lines(x=sp.cis$x,y=sp.cis$upper.ci, lty=2)
And that gives something like
Actually it looks like there might be a more parametric way to calculate confidence intervals using the jackknife residuals. This code comes from the S+ help page for smooth.spline
fit <- smooth.spline(data$x, data$y) # smooth.spline fit
res <- (fit$yin - fit$y)/(1-fit$lev) # jackknife residuals
sigma <- sqrt(var(res)) # estimate sd
upper <- fit$y + 2.0*sigma*sqrt(fit$lev) # upper 95% conf. band
lower <- fit$y - 2.0*sigma*sqrt(fit$lev) # lower 95% conf. band
matplot(fit$x, cbind(upper, fit$y, lower), type="plp", pch=".")
And that results in
And as far as the gam confidence intervals go, if you read the print.gam help file, there is an se= parameter with default TRUE and the docs say
when TRUE (default) upper and lower lines are added to the 1-d plots at 2 standard errors above and below the estimate of the smooth being plotted while for 2-d plots, surfaces at +1 and -1 standard errors are contoured and overlayed on the contour plot for the estimate. If a positive number is supplied then this number is multiplied by the standard errors when calculating standard error curves or surfaces. See also shade, below.
So you can adjust the confidence interval by adjusting this parameter. (This would be in the print() call.)
The R package mgcv calculates smoothing splines and Bayesian "confidence intervals." These are not confidence intervals in the usual (frequentist) sense, but numerical simulations have shown that there is almost no difference; see the linked paper by Marra and Wood in the help file of mgcv.
library(SemiPar)
data(lidar)
require(mgcv)
fit=gam(range~s(logratio), data = lidar)
plot(fit)
with(lidar, points(logratio, range-mean(range)))

Resources