how to define a GEV (generalized extreme value) distribution to a copula? - r

I am trying to fit a copula for two variables which have extreme value distribution. for "mvdc" class, I need to define margins and parammargins. Since GEV is not included in default distribution functions of Rcopula, I got these two values by using "evd" package, by these two functions:
# pgev gives the Generalized Extreme Value distribution function
GEVmarginU1<-pgev(U1, loc=0, scale=1, shape=0, lower.tail = TRUE)
GEVmarginV2<-pgev(V2, loc=0, scale=1, shape=0, lower.tail = TRUE)
#fit a generalised extreme value distribution to my data
MU1 <- fgev(U1, scale = 1, shape = 0)
MV2 <- fgev(V2, scale = 1, shape = 0)
but when I give these values to "mvdc" function, I get an error
myMvd <- mvdc(copula = ellipCopula(family = "Frank", param = 0), margins = c(pgev, pgev),
paramMargins = list(list(MU1), list(MV2))
Most importantly, I want to be sure whether I am in a right track. Since two variables are obtained from discrete choice model, I have extreme value distribution. Also the marginal have GEV distribution, right? So I need to define GEV for "mvdc" otherwise my fitted copula will not wok well.
(1) Ui = β1Xi1 + β2Xi2 + β3Xi3 + εi
(2) Vi = γ1Yj1 + γ2Yj2 + γ3Yj3 + ηi
in summary:
(1) Ui = β'Xi' + εi
(2) Vi = γ'Yj' + ηi
Since these models are made from discrete choice modelling approach, the distribution function follows “extreme value” distribution. First step: I estimate coefficients of β1,β2,β3,γ1,γ2,γ3 separately for each variable of i and Vj by using multinomial logit model using Biogeme software. But intuitively I know that they are dependent variables, so I try to fit a copula and again estimate coefficients by considering dependency value. So, the joint probability that Ui and Vi is chosen by decision-maker n is:
These marginals are transformed to continuous, but still have extreme value distribution, am I right?!???
1) How can I define GEV when using “mvdc” copula class in Rcopula?
Second, assume I used “fitcopula” instead of “mvdc”, and got param(dependency parameter of copula), if I understood correctly, “fitcopula” is for parametric and in my case, it’s non-parametric, am I right?
2) Now, how should I update coefficients by using a joint distribution and dependency parameter???

For the first question, I found out that my marginals are logistic randomly distributed, since they are the difference between two error terms in the utility model and we know that error terms follow type 1 extreme value or Gumbel distribution, and the difference between two Gumbel distribution follow logistics distribution, according to the Wikipedia.

Related

Simulating likelihood ratio test (LRT) pvalue using Monte Carlo method [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 26 days ago.
I'm trying to figure out my assignment to simulate lrt test p-value output using the Monte Carlo method. As far as I understand, the lrt test is supposed to test for "better", more accurate model.
I know how to perform such a test:
nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)
lrtest(nested, complex)
From there I can return my p-value and perform some calculations like type I and type II errors or power of a test and see how it changes depending of number of simulations.
My question is how am I supposed to simulate the random data. It doesn't have to be grades or school related stuff this was just a showcase of my understanding.
I was thinking about making data frame with 3 to 4 columns with 1 column being a dependent value (0,1) and the rest being random numbers generated from the normal distribution or some different distribution.
But I don't know if this approach will create understandable results, or if this even makes sense.
I looked at this function function but it didn't really help me to understand anything.
I came up with something like this:
library(lmtest)
n <- 1000
depentend = sample(c(0,1), replace=TRUE, size=n)
pvalue <- c()
for(i in 1:1000) {
independend_x = rnorm(n, mean = 2,sd = 0.2)
independend_y = rnorm(n, mean = 7,sd = 0.5)
nested <- lm(depentend~independend_x)
complex <- lm(depentend~independend_x + independend_y)
lrtest(nested, complex)
pvalue <- c(pvalue, as.numeric(lrtest(nested, complex)[5][2,1]))
}
but I don't know if this is the right direction.
I would be really thankful if someone could help me to understand how to simulate data for the Monte Carlo sampling method.
Monte Carlo simulations are performed to compute a distribution of something that is difficult to compute or for which one is too lazy to perform the exact computation.
The likelihood ratio test computes a p-value based on the distribution of the likelihood ratio $\Lambda$, and that distribution is the value that you want to simulate instead of compute or estimate with formula's. The trick is to use simulation instead of computations.
Your problem does not seem to be so much how to perform the simulations, but more like what is the distribution that you are interested in and want to simulate and what are the boundary conditions that you need to fix. Which computation or estimation is it that you want to replace/estimate with simulation?
For your likelihood ratio test you probably want to test the hypothesis $H_0: \theta_{age} = 0$ against the alternative hypothesis $H_a: \theta_{age} \neq 0$. In this case you compute the ratio of the likelihood $\mathcal{L}$ where one of the hypotheses is a composite hypothesis and you select the highest likelihood among them.
$$\Lambda = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\text{sup}_{\theta_{age} \neq 0}\mathcal{L}( \theta_{age} | \text{some data})} = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\mathcal{L}( \hat\theta_{age} | \text{some data})} $$ where the supremum is found by using the likelihood for the maximum likelihood estimator $ \hat\theta_{age} $
To compute these likelihood functions you need assumptions about the distributions. In your case you do this with glm (where you need to decide on some distribution and link function) or more simple lm (which assumes Gaussian conditional distribution for the data).
The simulations are then computed for a given null hypothesis. For instance, given some data, you assume that $\theta_{age} = 0$ and you want to compute what the distribution of the outcomes of $\Lambda$ is. You need some more data and parameters
The independent variables. These you probably want to fix at some values that relate to your practical problem. You want to know the distribution given some independent variables. Potentially you may wish to study what happens when there is an error in these independent variables, in that case you may also simulate these variables.
The variance/dispersion/noise-level of the conditional distribution. This you may vary to see how this influences the statistic. Or you have some value of interest, for instance if you have data for which you estimated the noise.
The other coefficients. These you may likewise vary or keep fixed depending on the situation, whether you want to model a particular situation or a more range of situations.
Example
The code below computes a simulation for a given regressor matrix (the independent variables) and given other coefficients. For large sample size the distribution will approach a chi-squared distribution. The simulation shows that using that limit as an estimate for the distribution underestimates the p-value by a lot.
(I ran the code with only 5000 simulations because I am using an online r-editor an compiler, on a computer you can get more precise results)
n_sim = 5*10^3
### simulate likelihood ratio test
### given coefficient and independent variables
### we assume a logistic model with binomial distribution
sim = function(theta1, X) {
### compute model
Z = X %*% theta1
p = 1/(1+exp(-Z))
### simulate dependent variable
Y = rbinom(length(p), 1, p)
### compute (log)likelihood ratio
mod1 = glm(Y ~ 1 + X[,2] + X[,3], family = binomial)
mod0 = glm(Y ~ 1 + X[,2], family = binomial)
logratio = -2*(logLik(mod0)-logLik(mod1))
return(as.numeric(logratio))
}
set.seed(1)
n = 10
### coefficients with the last one zero
theta1 = c(1,1,0)
### some regressor matrix, independent variables
X = cbind(rep(1,n), matrix(rnorm(n*2),n)) ### first column is intercept
### simulate
Lsim = replicate(n_sim,sim(theta1,X))
### ordering for empirical distribution
Lsim = Lsim[order(Lsim)]
perc = c(1:length(Lsim))/length(Lsim)
plot(Lsim,1-perc, main = "emperical distribution", ylab = "P(likelihood > L)", xlab = "L", type = "l")
lines(qchisq(perc,1),1-perc, lty = 2)
legend(8,1, c("n=10","n=40", "chi-squared estimate"), lty = c(1,1,2), col = c(1,2,1))
#### repeat with larger n
set.seed(1)
n = 40
theta1 = c(1,1,0)
X = cbind(rep(1,n), matrix(rnorm(n*2),n))
Lsim2 = replicate(n_sim,sim(theta1,X))
Lsim2 = Lsim2[order(Lsim2)]
lines(Lsim2, 1-perc, col = 2)
Note that there are many variants and this is just an example what simulation does. Here we simulate data based on a given distribution. (And it replaces a computation that we could not perform. We had an estimate with a chi-squared distribution, but that is not accurate for small $n$.)
Other times this distribution is not know and one uses the data and some resampling method to simulate/estimate the distribution of the statistic.
For your situation you need to figure out what exact computation (for which information/conditions are given) it is that you want to replace by using simulations.

LASSO-type regressions with non-negative continuous dependent variable (dependent var)

I am using "glmnet" package (in R) mostly to perform regularized linear regression.
However I am wondering if it can perform LASSO-type regressions with non-negative (integer) continuous (dependent) outcome variable.
I can use family = poisson, but the outcome variable is not specifically "count" variable. It is just a continuous variable with lower limit 0.
I aware of "lower.limits" function, but I guess it is for covariates (independent variables). (Please correct me if my understanding of this function not right.)
I look forward to hearing from you all! Thanks :-)
You are right that setting lower limit in glmnet is meant for covariates. Poisson will set a lower limit to zero because you exponentiate to get back the "counts".
Going along those lines, most likely it will work if you transform your response variable. One quick way is to take the log of your response variable, do the fit and transform it back, this will ensure that it's always positive. you have to deal with zeros
An alternative is a power transformation. There's a lot to think about and I can only try a two parameter box-cox with a dataset since you did not provide yours:
library(glmnet)
library(mlbench)
library(geoR)
data(BostonHousing)
data = BostonHousing
data$chas=as.numeric(data$chas)
# change it to min 0 and max 1
data$medv = (data$medv-min(data$medv))/diff(range(data$medv))
Then here I use a quick approximation via pca (without fitting all the variables) to get the suitable lambda1 and lambda2 :
bcfit = boxcoxfit(object = data[,14],
xmat = prcomp(data[,-14],scale=TRUE,center=TRUE)$x[,1:2],
lambda2=TRUE)
bcfit
Fitted parameters:
lambda lambda2 beta0 beta1 beta2 sigmasq
0.42696313 0.00001000 -0.83074178 -0.09876102 0.08970137 0.05655903
Convergence code returned by optim: 0
Check the lambda2, it is the one thats critical for deciding whether you get a negative value.. It should be rather small.
Create the functions to power transform:
bct = function(y,l1,l2){((y+l2)^l1 -1)/l1}
bctinverse = function(y,l1,l2){(y*l1+1)^(1/l1) -l2}
Now we transform the response:
data$medv_trans = bct(data$medv,bcfit$lambda[1],bcfit$lambda[2])
And fit glmnet:
fit = glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans,nlambda=500)
Get predictions over all lambdas, and you can see there's no negative predictions once you transform back:
pred = predict(fit,as.matrix(data[,1:13]))
range(bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]))
[1] 0.006690685 0.918473356
And let's say we do a fit with cv:
fit = cv.glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans)
pred = predict(fit,as.matrix(data[,1:13]))
pred_transformed = bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]
plot(data$medv,pred_transformed,xlab="orig response",ylab="predictions")

Using ROC curve to find optimum cutoff for my weighted binary logistic regression (glm) in R

I have build a binary logistic regression for churn prediction in Rstudio. Due to the unbalanced data used for this model, I also included weights. Then I tried to find the optimum cutoff by try and error, however To complete my research I have to incorporate ROC curves to find the optimum cutoff. Below I provided the script I used to build the model (fit2). The weight is stored in 'W'. This states that the costs of wrongly identifying a churner is 14 times as large as the costs of wrongly identifying a non-churner.
#CH1 logistic regression
library(caret)
W = 14
lvl = levels(trainingset$CH1)
print(lvl)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trainingset$CH1==lvl[2],W,1)
fit2 = glm(CH1 ~ RET + ORD + LVB + REVA + OPEN + REV2KF + CAL + PSIZEF + COM_P_C + PEN + SHOP, data = trainingset, weight=fit_wts, family=binomial(link='logit'))
# we test it on the test set
predlog1 = ifelse(predict(fit2,testset,type="response")>0.5,lvl[2],lvl[1])
predlog1 = factor(predlog1,levels=lvl)
predlog1
confusionMatrix(pred,testset$CH1,positive=lvl[2])
For this research I have also build ROC curves for decision trees using the pROC package. However, of course the same script does not work the same for a logistic regression. I have created a ROC curve for the logistic regression using the script below.
prob=predict(fit2, testset, type=c("response"))
testset$prob=prob
library(pROC)
g <- roc(CH1 ~ prob, data = testset, )
g
plot(g)
Which resulted in the ROC curve below.
How do I get the optimum cut off from this ROC curve?
Getting the "optimal" cutoff is totally independent of the type of model, so you can get it like you would for any other type of model with pROC. With the coords function:
coords(g, "best", transpose = FALSE)
Or directly on a plot:
plot(g, print.thres=TRUE)
Now the above simply maximizes the sum of sensitivity and specificity. This is often too simplistic and you probably need a clear definition of "optimal" that is adapted to your use case. That's mostly beyond the scope of this question, but as a starting point you should a look at Best Thresholds section of the documentation of the coords function for some basic options.

What are the differences between directly plotting the fit function and plotting the predicted values(they have same shape but different ranges)?

I am trying to learn gam() in R for a logistic regression using spline on a predictor. The two methods of plotting in my code gives the same shape but different ranges of response in the logit scale, seems like an intercept is missing in one. Both are supposed to be correct but, why the differences in range?
library(ISLR)
attach(Wage)
library(gam)
gam.lr = gam(I(wage >250) ~ s(age), family = binomial(link = "logit"), data = Wage)
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
pred=predict(gam.lr, newdata = list(age = age.grid), type = "link")
par(mfrow = c(2,1))
plot(gam.lr)
plot(age.grid, pred)
I expected that both of the methods would give the exact same plot. plot(gam.lr) plots the additive effects of each component and since here there's only one so it is supposed to give the predicted logit function. The predict method is also giving me estimates in the link scale. But the actual outputs are on different ranges. The minimum value of the first method is -4 while that of the second is less than -7.
The first plot is of the estimated smooth function s(age) only. Smooths are subject to identifiability constraints as in the basis expansion used to parametrise the smooth, there is a function or combination of functions that are entirely confounded with the intercept. As such, you can't fit the smooth and an intercept in the same model as you could subtract some value from the intercept and add it back to the smooth and you have the same fit but different coefficients. As you can add and subtract an infinity of values you have an infinite supply of models, which isn't helpful.
Hence identifiability constraints are applied to the basis expansions, and the one that is most useful is to ensure that the smooth sums to zero over the range of the covariate. This involves centering the smooth at 0, with the intercept then representing the overall mean of the response.
So, the first plot is of the smooth, subject to this sum to zero constraint, so it straddles 0. The intercept in this model is:
> coef(gam.lr)[1]
(Intercept)
-4.7175
If you add this to values in this plot, you get the values in the second plot, which is the application of the full model to the data you supplied, intercept + f(age).
This is all also happening on the link scale, the log odds scale, hence all the negative values.

Fitting a zero inflated poisson distribution in R

I have a vector of count data that is strongly over dispersed and zero inflated.
The vector looks like this:
i.vec=c(0,63,1,4,1,44,2,2,1,0,1,0,0,0,0,1,0,0,3,0,0,2,0,0,0,0,0,2,0,0,0,0,
0,0,0,0,0,0,0,0,6,1,11,1,1,0,0,0,2)
m=mean(i.vec)
# 3.040816
sig=sd(i.vec)
# 10.86078
I would like to fit a distribution to this, which I strongly suspect will be a zero inflated poisson (ZIP). But I need to perform a significance test to demonstrate that a ZIP distribution fits the data.
If I had a normal distribution, I could do a chi square goodness of fit test using the function goodfit() in the package vcd, but I don't know of any tests that I can perform for zero inflated data.
Here is one approach
# LOAD LIBRARIES
library(fitdistrplus) # fits distributions using maximum likelihood
library(gamlss) # defines pdf, cdf of ZIP
# FIT DISTRIBUTION (mu = mean of poisson, sigma = P(X = 0)
fit_zip = fitdist(i.vec, 'ZIP', start = list(mu = 2, sigma = 0.5))
# VISUALIZE TEST AND COMPUTE GOODNESS OF FIT
plot(fit_zip)
gofstat(fit_zip, print.test = T)
Based on this, it does not look like ZIP is a good fit.

Resources