Fitting a zero inflated poisson distribution in R - r

I have a vector of count data that is strongly over dispersed and zero inflated.
The vector looks like this:
i.vec=c(0,63,1,4,1,44,2,2,1,0,1,0,0,0,0,1,0,0,3,0,0,2,0,0,0,0,0,2,0,0,0,0,
0,0,0,0,0,0,0,0,6,1,11,1,1,0,0,0,2)
m=mean(i.vec)
# 3.040816
sig=sd(i.vec)
# 10.86078
I would like to fit a distribution to this, which I strongly suspect will be a zero inflated poisson (ZIP). But I need to perform a significance test to demonstrate that a ZIP distribution fits the data.
If I had a normal distribution, I could do a chi square goodness of fit test using the function goodfit() in the package vcd, but I don't know of any tests that I can perform for zero inflated data.

Here is one approach
# LOAD LIBRARIES
library(fitdistrplus) # fits distributions using maximum likelihood
library(gamlss) # defines pdf, cdf of ZIP
# FIT DISTRIBUTION (mu = mean of poisson, sigma = P(X = 0)
fit_zip = fitdist(i.vec, 'ZIP', start = list(mu = 2, sigma = 0.5))
# VISUALIZE TEST AND COMPUTE GOODNESS OF FIT
plot(fit_zip)
gofstat(fit_zip, print.test = T)
Based on this, it does not look like ZIP is a good fit.

Related

Simulating likelihood ratio test (LRT) pvalue using Monte Carlo method [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 26 days ago.
I'm trying to figure out my assignment to simulate lrt test p-value output using the Monte Carlo method. As far as I understand, the lrt test is supposed to test for "better", more accurate model.
I know how to perform such a test:
nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)
lrtest(nested, complex)
From there I can return my p-value and perform some calculations like type I and type II errors or power of a test and see how it changes depending of number of simulations.
My question is how am I supposed to simulate the random data. It doesn't have to be grades or school related stuff this was just a showcase of my understanding.
I was thinking about making data frame with 3 to 4 columns with 1 column being a dependent value (0,1) and the rest being random numbers generated from the normal distribution or some different distribution.
But I don't know if this approach will create understandable results, or if this even makes sense.
I looked at this function function but it didn't really help me to understand anything.
I came up with something like this:
library(lmtest)
n <- 1000
depentend = sample(c(0,1), replace=TRUE, size=n)
pvalue <- c()
for(i in 1:1000) {
independend_x = rnorm(n, mean = 2,sd = 0.2)
independend_y = rnorm(n, mean = 7,sd = 0.5)
nested <- lm(depentend~independend_x)
complex <- lm(depentend~independend_x + independend_y)
lrtest(nested, complex)
pvalue <- c(pvalue, as.numeric(lrtest(nested, complex)[5][2,1]))
}
but I don't know if this is the right direction.
I would be really thankful if someone could help me to understand how to simulate data for the Monte Carlo sampling method.
Monte Carlo simulations are performed to compute a distribution of something that is difficult to compute or for which one is too lazy to perform the exact computation.
The likelihood ratio test computes a p-value based on the distribution of the likelihood ratio $\Lambda$, and that distribution is the value that you want to simulate instead of compute or estimate with formula's. The trick is to use simulation instead of computations.
Your problem does not seem to be so much how to perform the simulations, but more like what is the distribution that you are interested in and want to simulate and what are the boundary conditions that you need to fix. Which computation or estimation is it that you want to replace/estimate with simulation?
For your likelihood ratio test you probably want to test the hypothesis $H_0: \theta_{age} = 0$ against the alternative hypothesis $H_a: \theta_{age} \neq 0$. In this case you compute the ratio of the likelihood $\mathcal{L}$ where one of the hypotheses is a composite hypothesis and you select the highest likelihood among them.
$$\Lambda = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\text{sup}_{\theta_{age} \neq 0}\mathcal{L}( \theta_{age} | \text{some data})} = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\mathcal{L}( \hat\theta_{age} | \text{some data})} $$ where the supremum is found by using the likelihood for the maximum likelihood estimator $ \hat\theta_{age} $
To compute these likelihood functions you need assumptions about the distributions. In your case you do this with glm (where you need to decide on some distribution and link function) or more simple lm (which assumes Gaussian conditional distribution for the data).
The simulations are then computed for a given null hypothesis. For instance, given some data, you assume that $\theta_{age} = 0$ and you want to compute what the distribution of the outcomes of $\Lambda$ is. You need some more data and parameters
The independent variables. These you probably want to fix at some values that relate to your practical problem. You want to know the distribution given some independent variables. Potentially you may wish to study what happens when there is an error in these independent variables, in that case you may also simulate these variables.
The variance/dispersion/noise-level of the conditional distribution. This you may vary to see how this influences the statistic. Or you have some value of interest, for instance if you have data for which you estimated the noise.
The other coefficients. These you may likewise vary or keep fixed depending on the situation, whether you want to model a particular situation or a more range of situations.
Example
The code below computes a simulation for a given regressor matrix (the independent variables) and given other coefficients. For large sample size the distribution will approach a chi-squared distribution. The simulation shows that using that limit as an estimate for the distribution underestimates the p-value by a lot.
(I ran the code with only 5000 simulations because I am using an online r-editor an compiler, on a computer you can get more precise results)
n_sim = 5*10^3
### simulate likelihood ratio test
### given coefficient and independent variables
### we assume a logistic model with binomial distribution
sim = function(theta1, X) {
### compute model
Z = X %*% theta1
p = 1/(1+exp(-Z))
### simulate dependent variable
Y = rbinom(length(p), 1, p)
### compute (log)likelihood ratio
mod1 = glm(Y ~ 1 + X[,2] + X[,3], family = binomial)
mod0 = glm(Y ~ 1 + X[,2], family = binomial)
logratio = -2*(logLik(mod0)-logLik(mod1))
return(as.numeric(logratio))
}
set.seed(1)
n = 10
### coefficients with the last one zero
theta1 = c(1,1,0)
### some regressor matrix, independent variables
X = cbind(rep(1,n), matrix(rnorm(n*2),n)) ### first column is intercept
### simulate
Lsim = replicate(n_sim,sim(theta1,X))
### ordering for empirical distribution
Lsim = Lsim[order(Lsim)]
perc = c(1:length(Lsim))/length(Lsim)
plot(Lsim,1-perc, main = "emperical distribution", ylab = "P(likelihood > L)", xlab = "L", type = "l")
lines(qchisq(perc,1),1-perc, lty = 2)
legend(8,1, c("n=10","n=40", "chi-squared estimate"), lty = c(1,1,2), col = c(1,2,1))
#### repeat with larger n
set.seed(1)
n = 40
theta1 = c(1,1,0)
X = cbind(rep(1,n), matrix(rnorm(n*2),n))
Lsim2 = replicate(n_sim,sim(theta1,X))
Lsim2 = Lsim2[order(Lsim2)]
lines(Lsim2, 1-perc, col = 2)
Note that there are many variants and this is just an example what simulation does. Here we simulate data based on a given distribution. (And it replaces a computation that we could not perform. We had an estimate with a chi-squared distribution, but that is not accurate for small $n$.)
Other times this distribution is not know and one uses the data and some resampling method to simulate/estimate the distribution of the statistic.
For your situation you need to figure out what exact computation (for which information/conditions are given) it is that you want to replace by using simulations.

is there an R function to obtain the minimal depth distribution from a conditional random forest estimated with the party package?

I ran a conditional random forest regression model using the cforest function from the party package because I have both categorical and continuous predictor variables that are correlated with each other, and a continuous outcome variable.
Here is my code to run the conditional random forest model, obtain out-of-bag estimates, and estimate the permutation variable importance.
# 1. fit the random forest
crf <- party::cforest(Y ~ ., data = df,
controls = party::cforest_unbiased(ntree = 10000, mtry = 7))
# 2. obtain out-of-bag estimates
pred_oob <- as.data.frame(caret::predict(crf, OOB = T, newdata = NULL))
# 3. estimate permutation variable importance
vi <- permimp::permimp(crf, condition = T, threshold = 0.5, nperm = 1000, OOB = T,
mincriterion = 0)))
I would like to visualize the minimal depth distribution and calculate mean minimal depth similar to the output from the RandomForestExplainer package. However, the RandomForestExplainer package only takes in objects from the randomForest function in the randomForest package. It's not an option for me to use this function due to the nature of my data (described above).
I have been combing the internet and have not been able to find a solution. Can someone point me to a way to visualize the minimal depth distribution for all predictors and calculate the mean minimal depth?

Using ROC curve to find optimum cutoff for my weighted binary logistic regression (glm) in R

I have build a binary logistic regression for churn prediction in Rstudio. Due to the unbalanced data used for this model, I also included weights. Then I tried to find the optimum cutoff by try and error, however To complete my research I have to incorporate ROC curves to find the optimum cutoff. Below I provided the script I used to build the model (fit2). The weight is stored in 'W'. This states that the costs of wrongly identifying a churner is 14 times as large as the costs of wrongly identifying a non-churner.
#CH1 logistic regression
library(caret)
W = 14
lvl = levels(trainingset$CH1)
print(lvl)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trainingset$CH1==lvl[2],W,1)
fit2 = glm(CH1 ~ RET + ORD + LVB + REVA + OPEN + REV2KF + CAL + PSIZEF + COM_P_C + PEN + SHOP, data = trainingset, weight=fit_wts, family=binomial(link='logit'))
# we test it on the test set
predlog1 = ifelse(predict(fit2,testset,type="response")>0.5,lvl[2],lvl[1])
predlog1 = factor(predlog1,levels=lvl)
predlog1
confusionMatrix(pred,testset$CH1,positive=lvl[2])
For this research I have also build ROC curves for decision trees using the pROC package. However, of course the same script does not work the same for a logistic regression. I have created a ROC curve for the logistic regression using the script below.
prob=predict(fit2, testset, type=c("response"))
testset$prob=prob
library(pROC)
g <- roc(CH1 ~ prob, data = testset, )
g
plot(g)
Which resulted in the ROC curve below.
How do I get the optimum cut off from this ROC curve?
Getting the "optimal" cutoff is totally independent of the type of model, so you can get it like you would for any other type of model with pROC. With the coords function:
coords(g, "best", transpose = FALSE)
Or directly on a plot:
plot(g, print.thres=TRUE)
Now the above simply maximizes the sum of sensitivity and specificity. This is often too simplistic and you probably need a clear definition of "optimal" that is adapted to your use case. That's mostly beyond the scope of this question, but as a starting point you should a look at Best Thresholds section of the documentation of the coords function for some basic options.

rstanarm for adaptive trials

I started exploring the rstanarm package and was curious as how this package could potentially be used in an adaptive trial scenario. The example scenario given within vignette provides a posterior of -0.622 with a credible interval from -0.69 to -0.56.
What would my script look like if I wanted to use this posterior as a prior for my next model when I have additional data from the adaptive trial?
# Code from vignette
t_prior <- student_t(df = 7, location = 0, scale = 2.5)
fit1 <- stan_glm(switch ~ dist100, data = wells,
family = binomial(link = "logit"),
prior = t_prior, prior_intercept = t_prior,
chains = 10, cores = 2, seed = 3245, iter = 100)
Your question is not so easily answered within the rstanarm framework because it only offers limited choices for the priors.
It is entirely valid to use your original prior with the total data from phase I and phase II combined to obtain a posterior distribution (essentially ignoring the intermediate posterior distribution you had after phase I). Alternatively, you could do as you suggest in phase I, then call
draws <- as.matrix(fit1)
mu <- colMeans(draws)
Sigma <- cov(mu)
and use these (estimated) mu and Sigma values as the hyperparameters in a multivariate normal prior over the coefficients in phase II. Unfortunately, such a prior is not supported by rstanarm, so you would need to write your own model with a Bernoulli likelihood, a logit link, and a multivariate normal prior in the Stan language or I think you could accomplish all that using the brm function in the brms package, which generates Stan code from R syntax and draws from the corresponding posterior distribution.
Both approaches conceptually should give you the same posterior distribution after phase II. However, with a finite number of posterior draws they will differ a little bit numerically and the multivariate normal prior might not be a complete description of the posterior distribution you obtained after phase I.

how to define a GEV (generalized extreme value) distribution to a copula?

I am trying to fit a copula for two variables which have extreme value distribution. for "mvdc" class, I need to define margins and parammargins. Since GEV is not included in default distribution functions of Rcopula, I got these two values by using "evd" package, by these two functions:
# pgev gives the Generalized Extreme Value distribution function
GEVmarginU1<-pgev(U1, loc=0, scale=1, shape=0, lower.tail = TRUE)
GEVmarginV2<-pgev(V2, loc=0, scale=1, shape=0, lower.tail = TRUE)
#fit a generalised extreme value distribution to my data
MU1 <- fgev(U1, scale = 1, shape = 0)
MV2 <- fgev(V2, scale = 1, shape = 0)
but when I give these values to "mvdc" function, I get an error
myMvd <- mvdc(copula = ellipCopula(family = "Frank", param = 0), margins = c(pgev, pgev),
paramMargins = list(list(MU1), list(MV2))
Most importantly, I want to be sure whether I am in a right track. Since two variables are obtained from discrete choice model, I have extreme value distribution. Also the marginal have GEV distribution, right? So I need to define GEV for "mvdc" otherwise my fitted copula will not wok well.
(1) Ui = β1Xi1 + β2Xi2 + β3Xi3 + εi
(2) Vi = γ1Yj1 + γ2Yj2 + γ3Yj3 + ηi
in summary:
(1) Ui = β'Xi' + εi
(2) Vi = γ'Yj' + ηi
Since these models are made from discrete choice modelling approach, the distribution function follows “extreme value” distribution. First step: I estimate coefficients of β1,β2,β3,γ1,γ2,γ3 separately for each variable of i and Vj by using multinomial logit model using Biogeme software. But intuitively I know that they are dependent variables, so I try to fit a copula and again estimate coefficients by considering dependency value. So, the joint probability that Ui and Vi is chosen by decision-maker n is:
These marginals are transformed to continuous, but still have extreme value distribution, am I right?!???
1) How can I define GEV when using “mvdc” copula class in Rcopula?
Second, assume I used “fitcopula” instead of “mvdc”, and got param(dependency parameter of copula), if I understood correctly, “fitcopula” is for parametric and in my case, it’s non-parametric, am I right?
2) Now, how should I update coefficients by using a joint distribution and dependency parameter???
For the first question, I found out that my marginals are logistic randomly distributed, since they are the difference between two error terms in the utility model and we know that error terms follow type 1 extreme value or Gumbel distribution, and the difference between two Gumbel distribution follow logistics distribution, according to the Wikipedia.

Resources