Error in fitdist with gamma distribution - r

Below are my codes:
library(fitdistrplus)
s <- c(11, 4, 2, 9, 3, 1, 2, 2, 3, 2, 2, 5, 8,3, 15, 3, 9, 22, 0, 4, 10, 1, 9, 10, 11,
2, 8, 2, 6, 0, 15, 0 , 2, 11, 0, 6, 3, 5, 0, 7, 6, 0, 7, 1, 0, 6, 4, 1, 3, 5,
2, 6, 0, 10, 6, 4, 1, 17, 0, 1, 0, 6, 6, 1, 5, 4, 8, 0, 1, 1, 5, 15, 14, 8, 1,
3, 2, 9, 4, 4, 1, 2, 18, 0, 0, 10, 5, 0, 5, 0, 1, 2, 0, 5, 1, 1, 2, 3, 7)
o <- fitdist(s, "gamma", method = "mle")
summary(o)
plot(o)
and the error says:
Error in fitdist(s, "gamma", method = "mle") : the function mle
failed to estimate the parameters,
with the error code 100

The Gamma distribution doesn't allow zero values (the likelihood will evaluate to zero, and the log-likelihood will be infinite, for a response of 0) unless the shape parameter is exactly 1.0 (i.e., an exponential distribution - see below) ... that's a statistical/mathematical problem, not a programming problem. You're going to have to find something sensible to do about the zero values. Depending on what makes sense for your application, you could (for example)
choose a different distribution to test (e.g. pick a censoring point and fit a censored Gamma, or fit a zero-inflated Gamma distribution, or ...)
exclude the zero values (fitdist(s[s>0], ...))
set the zero values to some sensible non-zero value (fitdist(replace(s,which(s==0),0.1),...)
which (if any) of these is best depends on your application.
#Sandipan Dey's first answer (leaving the zeros in the data set) appears to make sense, but in fact it gets stuck at the shape parameter equal to 1.
o <- fitdist(s, "exp", method = "mle")
gives the same answer as #Sandipan's code (except that it estimates rate=0.2161572, the inverse of the scale parameter=4.626262 that's estimated for the Gamma distribution - this is just a change in parameterization). If you choose to fit an exponential instead of a Gamma, that's fine - but you should do it on purpose, not by accident ...
To illustrate that the zeros-included fit may not be working as expected, I'll construct my own negative log-likelihood function and display the likelihood surface for each case.
mfun <- function(sh,sc,dd=s) {
-sum(dgamma(dd,shape=sh,scale=sc,log=TRUE))
}
library(emdbook) ## for curve3d() helper function
Zeros-included surface:
cc1 <- curve3d(mfun(x,y),
## set up "shape" limits" so we evaluate
## exactly shape=1.000 ...
xlim=c(0.55,3.55),
n=c(41,41),
ylim=c(2,5),
sys3d="none")
png("gammazero1.png")
with(cc1,image(x,y,z))
dev.off()
In this case the surface is only defined at shape=1 (i.e. an exponential distribution); the white regions represent infinite log-likelihoods. It's not that shape=1 is the best fit, it's that it's the only fit ...
Zeros-excluded surface:
cc2 <- curve3d(mfun(x,y,dd=s[s>0]),
## set up "shape" limits" so we evaluate
## exactly shape=1.000 ...
xlim=c(0.55,3.55),
n=c(41,41),
ylim=c(2,5),
sys3d="none")
png("gammazero2.png")
with(cc2,image(x,y,z))
with(cc2,contour(x,y,z,add=TRUE))
abline(v=1.0,lwd=2,lty=2)
dev.off()

Just provide the initial values for the gamma distribution parameters (scale, shape) to be computed with mle using optim and also the lower bounds for the parameters, it should work.
o <- fitdist(s, "gamma", lower=c(0,0), start=list(scale=1,shape=1))
summary(o)
#Fitting of the distribution ' gamma ' by maximum likelihood
#Parameters :
# estimate Std. Error
#scale 4.626262 NA
#shape 1.000000 NA
#Loglikelihood: -250.6432 AIC: 505.2864 BIC: 510.4766
As per the comments by #Ben Bolker, we may want to exclude the zero points first:
o <- fitdist(s[s!=0], "gamma", method = "mle", lower=c(0,0), start=list(scale=1,shape=1))
summary(o)
#Fitting of the distribution ' gamma ' by maximum likelihood
#Parameters :
# estimate Std. Error
#scale 3.401208 NA
#shape 1.622378 NA
#Loglikelihood: -219.6761 AIC: 443.3523 BIC: 448.19

Related

Confidence interval of episode duration frequencies

I have the episode duration data (in days)
dur<-c(1, 2, 1, 2, 1, 3, 11, 2, 2, 3, 2, 4, 1, 2, 2, 1, 2, 10, 1, 1, 2, 2, 18, 2, 2, 2, 1, 7, 1, 1, 11, 25, 17, 2, 2, 9, 3, 3, 2, 5, 3, 2, 3, 2, 5, 363, 1, 1, 2, 2)
Which means in one instance the episode duration was 1 days, 2 days, 1 days etc etc
table(dur) summarizes the duration data (12 instances of 1 day, 20 instances of 2 days etc)
freq.table<-(table(dur)/sum(table(dur))) gives me the frequency of the observed durations of episodes (point estimates).
How can I get confidence intervals of freq.table in R? What would be the most appropriate way for this kind of data?
Edit: I am interested in estimating the CI of the frequency of episode durations of 1, 2, ..., n days
A fast and easy way to get CIs for proportions in R is the function binom.test as in
dur <- c(1, 2, 1, 2, 1, 3, 11, 2, 2, 3, 2, 4,
1, 2, 2, 1, 2, 10, 1, 1, 2, 2, 18, 2,
2, 2, 1, 7, 1, 1, 11, 25, 17, 2, 2, 9,
3, 3, 2, 5, 3, 2, 3, 2, 5, 363, 1, 1, 2, 2)
t <- table(dur)
n <- length(dur)
ci <- sapply(t, function(x) binom.test(x, n, conf.level = .95)$conf.int)
rownames(ci) <- c("lower", "upper")
print(ci)
That is supposing, that the data forming process for each episode is anything like a binomial process.
Edit after first comment
As Roland has pointed out in an earlier comment above, you have not stated the problem in inambigous statistical terms, so I made some assumptions. I suppose Roland would suggest trying to find a distribution for all the possible durations as a whole system. Considerung a mode on 2 and the existence of an observation with value 363 this is unlikely to be a common distribution like poisson or binomial etc. Knowing nothing about the data generating process I estimated a confidence interval for each observed outcome on it's own, not regarding the distribution as a whole. For each observed outcome I stated that I assumed a binomial distribution which you should look up before you use my proposition for an answer for anything serious.

How to make predictions even with NAs using predict()?

I want to use predict() with a polr() model to predict variable z, as per the following code. This first is the df to train the model and the subsequent test data.
df <- data.frame(x=c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2),
y=c(32, 67, 12, 89, 45, 78, 43, 47, 14, 67, 16, 36, 25, 23, 56, 26, 35, 79, 13, 44),
z=as.factor(c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 1, 2, 1, 2, 1, 2)))
test <- data.frame(x=c(1, 2, 1, 1, 2, 1, 2, 2, 1, 1),
y=c(34, NA, 78, NA, 89, 17, 27, 83, 23, 48),
z=c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1))
This is the polr() model:
mod <- polr(z ~ x + y, data = df, Hess = TRUE)
And this is the predict() function with its outcome:
predict(mod, newdata = test)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
My problem is that I want the model to make predictions even when there are NAs, as in the 2nd and 4th cases. I have tried the following, with the same result:
predict(mod, newdata = test, na.action = "na.exclude")
predict(mod, newdata = test, na.action = "na.pass")
predict(mod, newdata = test, na.action = "na.omit")
predict(mod, newdata = test, na.rm=T)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
How can I get the model to make predictions even when there's some missing data?
This is more of a statistical or mathematical problem than a programming problem. To simplify things a little bit (and show that it's general, I'll illustrate with a linear regression, but the concept extends to ordinal regression as well.
Suppose I've estimated a linear relationship, say z = 1 + 2*x + 3*y, and I want to predict a response when the predictors are {x=3, y=NA}. I get 1 + 2*3 + 3*NA, which is clearly NA.
If you want predictions when some of the predictor variables are unknown, you have to make some kind of assumption/decision about what to do — this is a question of interpretation, not mathematics. For example, you could set unknown values of y to the mean of the original data set, or the mean of the new data set, or some sensible reference value, or you could do multiple imputation — i.e., making several predictions based on several different draws from a reasonable distribution, then averaging the results. (For a linear regression model this will give you the same answer (point estimate) as using the mean of the distribution, but (1) the results will differ if you have an effectively nonlinear model like an ordinal or generalized linear regression; (2) multiple imputation will allow you to get sensible standard errors on the prediction.)

Plotting an Exponential Best Fit Curve to ggplot2 using Stat_smooth [duplicate]

I am trying to fit data on an exponential decay function (RC like system) with equation:
My data are on the following dataframe:
dataset <- data.frame(Exp = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6), t = c(0, 0.33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10, 0, 33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10, 0, 0.33, 0.67, 1, 1.33, 1.67, 2, 4, 6, 8, 10), fold = c(1, 0.957066345654286, 1.24139015724819, 1.62889151698633, 1.72008539595879, 1.82725412314402, 1.93164365299958, 1.9722929538061, 2.15842019312484, 1.9200507796933, 1.95804730344453, 1, 0.836176542548747, 1.07077717914707, 1.45471712491441, 1.61069357875771, 1.75576377806756, 1.89280913889538, 2.00219054189937, 1.87795513639311, 1.85242493827193, 1.7409346372629, 1, 0.840498729335292, 0.904130905000499, 1.23116185602517, 1.41897551928886, 1.60167656534099, 1.72389226836308, 1.80635095956481, 1.76640786872057, 1.74327897001172, 1.63581509884482))
I have 3 experiment (Exp: 4, 5 and 6) data I want to fit each experiment on the given equation.
I have managed to do it for of the experiment by subsetting my data and using the parameter calculated by nls
test <- subset(dataset,Exp==4)
fit1 = nls(fold ~ 1+(Vmax*(1-exp(-t/tau))),
data=test,
start=c(tau=0.2,Vmax=2))
ggplot(test,aes(t,fold))+
stat_function(fun=function(t){1+coef(fit1)[[2]]*(1-exp(-t/coef(fit1)[[1]]))})+
geom_point()
But if I try to use the geom_smooth function directly on the full dataset with this code
d <- ggplot(test,aes(t,fold))+
geom_point()+
geom_smooth(method="nls",
formula='fold~1+Vmax*(1-exp(-t/tau))',
start=c(tau=0.2,Fmax=2))
print(d)
I get the following error:
Error in model.frame.default(formula = ~fold, data = data, weights = weight) :
variable lengths differ (found for '(weights)')
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
Is there anything wrong with my syntax? I would have this one working in order to use the same function on the dataset and using group to have one fit per Exp level.
There are several problems:
formula is a parameter of nls and you need to pass a formula object to it and not a character.
ggplot2 passes y and x to nls and not fold and t.
By default, stat_smooth tries to get the confidence interval. That isn't implemented in predict.nls.
In summary:
d <- ggplot(test,aes(x=t, y=fold))+
#to make it obvious I use argument names instead of positional matching
geom_point()+
geom_smooth(method="nls",
formula=y~1+Vmax*(1-exp(-x/tau)), # this is an nls argument,
#but stat_smooth passes the parameter along
start=c(tau=0.2,Vmax=2), # this too
se=FALSE) # this is an argument to stat_smooth and
# switches off drawing confidence intervals
Edit:
After the major ggplot2 update to version 2, you need:
geom_smooth(method="nls",
formula=y~1+Vmax*(1-exp(-x/tau)), # this is an nls argument
method.args = list(start=c(tau=0.2,Vmax=2)), # this too
se=FALSE)

Getting (maybe manually) confidence interval of fits after using multi-way clustering package (multiwayvcov)

I am interested in plotting fits with confidence intervals after using two-way clustering package (multiwayvcov).
Here is my reproducible data.
rm(list=ls(all=TRUE))
library(lmtest)
library(multiwayvcov)
dv<-c(1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0)
int1<-c(0.0123, 0.3428, 0.2091, 0.8325, 0.7113, 0.7401, 0.6009, 0.5062, 0.4841, 0.8912, 0.3850, 0.2463, 0.0625, 0.5374, 0.1984)
int2<-c(0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0)
cont<-c(3, 1, 2, 4, 6, 7, 1, 4, 3, 2, 4, 3, 6, 1, 3)
cluster1<-c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)
cluster2<-c(1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 3, 1, 2)
mydata<-as.data.frame(cbind(dv, int1, int2, cont, cluster1, cluster2))
This is my non-clustered model:
result_lm <- lm(dv~int1+int2+cont,data=mydata)
To get clustered results using "cluster1" and "cluster2", I use functions in the package of "lmtest" and "multiwayvcov" as follows.
cluster_vcov<-cluster.vcov(result_lm, ~cluster1+cluster2)
result_2c<-coeftest(result_lm, cluster_vcov)
Here, "cluster_vcov" is just a variance-covariance matrix and "result_2c" is just an atomic vector. Thus, I am not able to use "predict" function to plot fits on a new dataset ("datagrid") such as
grid <- seq(0,1,.2)
datagrid <- data.frame(int1=rep(grid,2),
int2=c(rep(0,length(grid)),
rep(1,length(grid))))
datagrid$cont<-mean(mydata$cont, na.rm=T)
Before moving to what I have done, here is something similar what I would like to have eventually.
fits <- predict(result_lm,newdata=datagrid,interval="confidence")
plotdata <- data.frame(fits,datagrid)
plotdata$int2 <- plotdata$int2==1
ggplot(plotdata,aes(x=int1,y=fit,ymin=lwr,ymax=upr,color=int2)) + geom_line(aes(linetype = int2)) + geom_ribbon(alpha=.2) + theme(legend.position="none") + scale_color_manual(values=c("red", "darkgreen")) + scale_linetype_manual(values=c("dashed", "solid"))
The result is
To address the problem that "result_2c" does not give a dataframe that can be directly used with "predict", I decided to construct a data by myself as follows.
d_twc_result<-data.frame(matrix(0, nrow =4, ncol = 4) )
colnames(d_twc_result) <- c("Estimate","Std. Error","t value", "Pr(>|t|)")
rownames(d_twc_result) <-c("(Intercept)", "int1","int2", "cont")
for (j in 1:4){
for (i in 1:4){
d_twc_result[i, j]<-result_2c[i,j]
}
}
Then, using "d_twc_result$Estimate", I generate a vector that corresponds to "fits" that one could get after running "predict".
fits<-c(1:12)
for (i in 1:12){
fits[i]<-d_twc_result$Estimate[1]+
d_twc_result$Estimate[2]*datagrid$int1[i]+
d_twc_result$Estimate[3]*datagrid$int2[i]+
d_twc_result$Estimate[4]*datagrid$cont[i]
}
Yet, I was still not able to construct vectors for "lwr" and "upr", which requires 'residuals' or 'standard error'. What I was actually stuck is that it seems impossible to get 'residuals' or 'standard error' because there is no observation on 'dv' in the dataset "datagrid".
Nevertheless, "predict" works with the dataset "datagrid", so I guess that I am poorly understanding how "predict" works or the concept of fit.
It will be highly appreciated if you could help me to get "lwr" and "upr" (if my understanding of the concept of fit is incorrect). Thank for any comment in advance.

How to get the order of the model used in auto.arima?

I am trying to use auto.arima on a timeseries. Now I need to know the order of the arima that has been selected. The return value is of type ARIMA, which doesn't hold the order anywhere. (or am I missing the values). Given in code snippet and the output attributes. (This is same as in R Documentation)
double[] list1 = {0, 0, 2, 1, 2, 10, 21, 0, 0, 3, 6, 5, 11, 51, 0, 11, 8, 6, 24, 25, 104, 0, 0, 6, 4, 5, 25, 71};
rconnection.assign("myData1", list1);
rconnection.eval("timeSeries1 <- ts(myData1,start=1,frequency="+staticBookingStage+")");
REXP fc = rconnection.eval("fitModel1 <- auto.arima(timeSeries1)");
System.out.println( fc.asList().names);
Output
[coef, sigma2, var.coef, mask, loglik, aic, arma, residuals, call, series, code, n.cond, nobs, model, bic, aicc, x, fitted]
Use the arimaorder() function:
library(forecast)
fit <- auto.arima(WWWusage)
arimaorder(fit)

Resources