Maximum Likelihood Estimator for a Gamma density in R - r

I just simulated 100 randoms observations from a gamma density with alpha(shape parameter)=5 and lambda(rate parameter)=5 :
x=rgamma(100,shape=5,rate=5)
Now, I want to fin the maximum likelihood estimations of alpha and lambda with a function that would return both of parameters and that use these observations.
Any hints would be appreciate. Thank you.

You can use fitdistr(...) for this in the MASS package.
set.seed(1) # for reproducible example
x <- rgamma(100,shape=5,rate=5)
library(MASS)
fitdistr(x, "gamma", start=list(shape=1, rate=1))$estimate
# shape rate
# 6.603328 6.697338
Notice that with a small sample like this you don't get great estimates.
x <- rgamma(10000,shape=5,rate=5)
library(MASS) # may be loaded by default
fitdistr(x, "gamma", start=list(shape=1, rate=1))$estimate
# shape rate
# 4.984220 4.971021
fitdistr(...) also returns the standard error of the estimates and the log-likelihood.

Related

Parameters and AUC and IC50 of a dose response curve

I have these dose response data:
df <- data.frame(viability=c(14,81,58,78,71,83,64,16,32,100,100,81,86,83,100,90,15,100,38,100,91,84,92,100),
dose=c(10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061,10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061,10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061),
stringsAsFactors=F)
I then use the drc package's drm function to fit a log-logistic curve to these data:
library(drc)
fit <- drm(viability~dose,data=df,fct=LL.4(names=c("slope","low","high","ED50")),type="continuous")
> summary(fit)
Model fitted: Log-logistic (ED50 as parameter) (4 parms)
Parameter estimates:
Estimate Std. Error t-value p-value
slope:(Intercept) 5.15328 18.07742 0.28507 0.7785
low:(Intercept) 20.19430 12.61122 1.60130 0.1250
high:(Intercept) 83.33181 4.96736 16.77586 0.0000
ED50:(Intercept) 2.98733 1.99685 1.49602 0.1503
Residual standard error:
21.0743 (20 degrees of freedom)
I then generate predictions so I'll be able to plot the curve:
pred.df <- expand.grid(dose=exp(seq(log(max(df$dose)),log(min(df$dose)),length=100)))
pred <- predict(fit,newdata=pred.df,interval="confidence")
pred.df$viability <- pmax(pred[,1],0)
pred.df$viability <- pmin(pred.df$viability,100)
pred.df$viability.low <- pmax(pred[,2],0)
pred.df$viability.low <- pmin(pred.df$viability.low,100)
pred.df$viability.high <- pmax(pred[,3],0)
pred.df$viability.high <- pmin(pred.df$viability.high,100)
I also use the PharmacoGx Bioconductor package to compute AUC and IC50 for both the curve and its high and low bounds:
library(PharmacoGx)
auc.mid <- computeAUC(rev(pred.df$dose),rev(pred.df$viability))/((max(pred.df$viability)-min(pred.df$viability))*(max(pred.df$dose)-min(pred.df$dose)))
auc.low <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.low))/((max(pred.df$viability.low)-min(pred.df$viability.low))*(max(pred.df)-min(pred.df$dose)))
auc.high <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.high))/((max(pred.df$viability.high)-min(pred.df$viability.high))*(max(pred.df$dose)-min(pred.df$dose)))
ic50.mid <- computeIC50(rev(pred.df$dose),rev(pred.df$viability))
ic50.low <- computeIC50(rev(pred.df$dose),rev(pred.df$viability.low))
ic50.high <- computeIC50(rev(pred.df$dose),rev(pred.df$viability.high))
Ceating a table with all the parameters so I can plot everything together:
ann.df <- data.frame(param=c("slope","low","high","ED50","auc.mid","auc.high","auc.low","ic50.mid","ic50.high","ic50.low"),value=signif(c(summary(fit)$coefficient[,1],auc.mid,auc.high,auc.low,ic50.mid,ic50.high,ic50.low),2),stringsAsFactors=F)
And finally plotting it all:
library(ggplot2)
library(grid)
library(gridExtra)
pl <- ggplot(df,aes(x=dose,y=viability))+geom_point()+geom_ribbon(data=pred.df,aes(x=dose,y=viability,ymin=viability.low,ymax=viability.high),alpha=0.2)+labs(y="viability")+
geom_line(data=pred.df,aes(x=dose,y=viability))+coord_trans(x="log")+theme_bw()+scale_x_continuous(name="dose",breaks=sort(unique(df$dose)),labels=format(signif(sort(unique(df$dose)),3),scientific=T))
ggdraw(pl)+draw_grob(tableGrob(ann.df,rows=NULL),x=0.1,y=0.175,width=0.3,height=0.4)
Which gives:
My questions are:
I thought that slope should be negative. How come it's 5.2?
the auc.mid, auc.high, and auc.lowcumputed as:
auc.mid <- computeAUC(rev(pred.df$dose),rev(pred.df$viability))
auc.low <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.low))
auc.high <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.high))
give 21.47818, 37.52389, and 2.678228, respectively.
Since these are not in the [0,1] range I thought that divinding them by the area under the highest corresponding viability will give what I'm looking for, i.e., relative AUC, but these values seem too low relative to what the figure shows. What are these AUCs then?
Also, how come auc.mid > auc.low > auc.high? I would think that it should be auc.high > auc.mid > auc.low
The IC50 values also seem a little low. Do they make sense?
Bonus question: how do I avoid the trailing zeros in slope, low, high, ED50, ic50.mid, and ic50.high in the figure?
The parameter you are pulling out is the hill slope parameter, or the coefficient in front of the concentration variable in the exponential, not the actual slope of the curve.
The AUC provided is in the [0-100] range, for the area above the curve. I ran the code and got the order as auc.low>auc.mid>auc.high. Traditionally the area under the response curve was reported, or 1-viability.
It is important to note that the PharmacoGx package uses a 3 parameter hill slope model, similar to LL.3 in drc. Therefore, the plot will not correspond to the function fit by PharmacoGx to calculate the IC50 or AUC.
Source: PharmacoGx dev.

Generating confidence intervals for predicted probabilities after running mlogit function in R

I have been struggling with the following problem for some time and would be very grateful for any help. I am running a logit model in R using the mlogit function and am able to generate the predicted probability of choosing each alternative for a given value of the predictors as follows:
library(mlogit)
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
Fish_fit<-Fish[-(1:4),]
Fish_test<-Fish[1:4,]
m <- mlogit(mode ~price+ catch | income, data = Fish_fit)
predict(m,newdata=Fish_test,)
I cannot, however, work out how to add confidence intervals to the predicted probability estimates. I have already tried adding arguments to the predict function, but none seem to generate them. Any ideas on how it can be achieved would be much appreciated.
One approach here is Monte Carlo simulation. You'd simulate repeated draws from a multivariate-normal sampling distribution whose parameters are given by your model results.
For each simulation, estimate your predicted probabilities, and use their empirical distribution over simulations to get your confidence intervals.
library(MASS)
est_betas <- m$coefficients
est_preds <- predict(m, newdata = Fish_test)
sim_betas <- mvrnorm(1000, m$coefficients, vcov(m))
sim_preds <- apply(sim_betas, 1, function(x) {
m$coefficients <- x
predict(m, newdata = Fish_test)
})
sim_ci <- apply(sim_preds, 1, quantile, c(.025, .975))
cbind(prob = est_preds, t(sim_ci))
# prob 2.5% 97.5%
# beach 0.1414336 0.10403634 0.1920795
# boat 0.3869535 0.33521346 0.4406527
# charter 0.3363766 0.28751240 0.3894717
# pier 0.1352363 0.09858375 0.1823240

How to determine the initial points of the maximum likelihood method

I'm currently working on distribution fitting. I used fitdistr function, but having problem in determining the initial points for the MLE. For example, I want to fit my data (rainfall- 13149 by 1 matrix) with gamma distribution.
fit.gamma = fitdistr(rainfall,dgamma,start=list(shape = ?, scale = ?),method="Nelder-Mead")
The library fitdistrplus is very good for this. It will guess gamma parameters for you if you don't have starting values. Also, you can use method of moments if your guesses fail.
x <- rgamma(100, 0.5, 0.5)
library(fitdistrplus)
(pars <- fitdist(x, "gamma"))
# Fitting of the distribution ' gamma ' by maximum likelihood
# Parameters:
# estimate Std. Error
# shape 0.4443304 0.05131369
# rate 0.5622472 0.10644511

Chi square goodness of fit for random numbers generated

I have used Inverse CDF method to generate 1000 samples from an exponential and a Cauchy random variable.
Now to verify whether these belong to their relevant distributions, I have to perform Chi-Squared Test for Goodness of fit.
I have tried two approaches (as below) -
Chisq.test(y) #which has 1000 samples from supposed exponential distribution
chisq.test(z) #cauchy
I am getting the following error:
data: y
X-squared = 234.0518, df = 999, p-value = 1
Warning message:
In chisq.test(y) : Chi-squared approximation may be incorrect
chisq.test(z)
Error in chisq.test(z) :
all entries of 'x' must be nonnegative and finite
I downloaded the vcd library to use goodfit()
and typed:
t1 <- goodfit(y,type= "exponential",method= "MinChiSq")
summary(t1)
In this case, the error message:
Error: could not find function "goodfit"
can somebody please guide on how to implement the Chi-Squared GOF test properly?
Note: The samples are not from normal distribution (exponential and cauchy respectively)
I am trying to understand if it is possible to get the observed and expected data instead with no luck so far.
edit - I did type in library(vcd) before writing the rest of the code. Apologies to have assumed it was obvious.
The chisq.test(...) function is designed primarily for use with counts, so it expects its arguments to be either countable (using table(...) for example), or to be counts already. It basically creates a contingency table for x and y (the first two arguments) and then uses the chisq test to determine if they are from the same distribution.
You are probably better off using the Kolmogorov–Smirnov test, which is designed for problems like yours. The K-S test compares the ecdf of the sample to the cdf of the test distribution and tests the null hypothesis that they are the same.
set.seed(1)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100))
ks.test(df$y,"pexp")
# One-sample Kolmogorov-Smirnov test
#
# data: df$y
# D = 0.0387, p-value = 0.1001
# alternative hypothesis: two-sided
ks.test(df$z,"pcauchy",100,100)
# One-sample Kolmogorov-Smirnov test
#
# data: df$z
# D = 0.0296, p-value = 0.3455
# alternative hypothesis: two-sided
Note that in this case, the K-S test predicts a 90% chance that your sample df$y did not come from an exponential distribution, even though it clearly did.
You can use chisq.test(...) by artificially binning your data and then comparing the counts in each bin to what would be expected from your test distribution (using p=...), but this is convoluted and the answer you get depends on the number of bins.
breaks <- c(seq(0,10,by=1))
O <- table(cut(df$y,breaks=breaks))
p <- diff(pexp(breaks))
chisq.test(O,p=p, rescale.p=T)
# Chi-squared test for given probabilities
#
# data: O
# X-squared = 7.9911, df = 9, p-value = 0.535
In this case the chisq test predicts a 47% chance that your sample did not come from an exponential distribution.
Finally, even though they are qualitative, I find Q-Q plots to be very useful. These plot quantiles of your sample against quantiles of the test distribution. If the sample is drawn from the test distribution, the Q-Q plot should fall close to the line y=x.
par(mfrow=c(1,2))
plot(qexp(seq(0,1,0.01)),quantile(df$y,seq(0,1,0.01)),
main="Q-Q Plot",ylab="df$Y", xlab="Exponential",
xlim=c(0,5),ylim=c(0,5))
plot(qcauchy(seq(0,.99,0.01),100,100),quantile(df$z,seq(0,.99,0.01)),
main="Q-Q Plot",ylab="df$Z",xlab="Cauchy",
xlim=c(-1000,1000),ylim=c(-1000,1000))
Looking at the Q-Q plots gives me much more confidence in asserting that df$y and df$z are drawn, respectively, from the Exponential and Cauchy distributions than either the K-S or ChiSq tests, even though I can't put a number on it.
# Simulation
set.seed(123)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100)
)
#This seems to be different, probably because of how you are simulating the data
chisq.test(df$y)
# Chi-squared test for given probabilities
#
# data: df$y
# X-squared = 978.485, df = 999, p-value = 0.6726
#
# Warning message:
# In chisq.test(df$y) : Chi-squared approximation may be incorrect
3 details:
1) you need to load the package. library(vcd)
2) There is no "exponential" type of distribution in the goodfit function
3) the method is MinChisq, Not MinChiSq
.
library(vcd)
t1 <- goodfit(df$y, type= "binomial", method= "MinChisq")
summary(t1)
# Goodness-of-fit test for binomial distribution
#
# X^2 df P(> X^2)
# Pearson 31.00952 6 2.524337e-05
# Warning message:
# In summary.goodfit(t1) : Chi-squared approximation may be incorrect

Bootstrapping CI for a Logistic Regression Model

I have a logistic regression model that I am using to predict size at maturity for king crab, but I am having trouble setting up the code for bootstrapping using the boot package. This is what I have:
#FEMALE GKC SAM#
LowerChatham<-read.table(file=file.choose(),header=TRUE)
#LOGISTIC REGRESSION FIT#
glm.out<-glm(Mature~CL,family=binomial(link=logit),data=LowerChatham)
plot(Mature~CL,data=LowerChatham)
lines(LowerChatham$CL,glm.out$fitted,col="red")
title(main="Lower Chatham")
summary(glm.out)
segments(98.9,0,98.9,0.5,col=1,lty=3,lwd=3)
SAM<-data.frame(CL=98.97)
predict(glm.out,SAM,type="response")
I would like to to bootstrap the statistic CL=98.97 since I am interested in the size at which 50% of crab are mature, but I have no idea how to setup my function to specify the that statistic and let alone the bootstrap function in general to get my 95% C.I. Any help would be greatly appreciated! Thanks!
In each bootstrap iteration, you want to do something like
range <- 1:100 # this could be any substantively meaningful range
p <- predict(glm.out, newdata = data.frame(CL=range), "response")
range[match(TRUE,p>.5)] # predicted probability of 50% maturity
where you specify a range of values of CL to whatever precision you need. Then calculate the predicted probability of maturity at each of those levels. Then find the threshold value in the range where the predicted probability cross 0.5. This is the statistic it sounds like you want to bootstrap.
You also don't need the boot to do this. If you define a function that samples and outputs that statistic as its result, you can just do replicate(1000, myfun) to get your bootstrap distribution, as follows:
myfun <- function(){
srows <- sample(1:nrow(LowerChatham),nrow(LowerChatham),TRUE)
glm.out <- (Mature ~ CL, family=binomial(link=logit), data=LowerChatham[srows,])
range <- 1:100 # this could be any substantively meaningful range
p <- predict(glm.out, newdata = data.frame(CL=range), "response")
return(range[match(TRUE,p>.5)]) # predicted probability of 50% maturity
}
bootdist <- replicate(1000, myfun()) # your distribution
quantile(unlist(bootdist),c(.025,.975)) # 95% CI

Resources