Wald Test for Multinomial Reg. in R - r

I asked this question before but never got an answer, so I am trying again and providing a sample data set so someone can tell me why I'm getting the errors I'm getting when I try implementing the Wald Test from aod and lmtest packages.
Sample data:
marital <- sample(1:5, 64614, replace = T)
race <- sample(1:3, 64614, replace = T)
educ <- sample(1:20, 64614, replace = T)
test <- data.frame(educ, marital, race)
test$marital <- as.factor(test$marital)
test$race <- as.factor(test$race)
test$marital <- relevel(test$marital, ref = "3")
require(nnet)
require(aod)
require(lmtest)
testmod <- multinom(marital ~ race*educ, data = test)
testnull <- multinom(marital ~ 1, data = test) #null model for the global test
waldtest(testnull, testmod)
wald.test(b = coef(testmod), Sigma = vcov(testmod), Terms = 1:24) #testing all terms for the global test
As you can see, when I use the waldtest function from lmtest package I get the following error:
Error in solve.default(vc[ovar, ovar]) : 'a' is 0-diml
When I use the wald.test function from aod, I get the following error:
Error in L %*% b : non-conformable arguments
I assume these are related errors as they both seem to do with the variance matrix. I'm not sure why I'm getting these errors though as the data set has no missing values.

Just as heads up when using nnet package with multinom: You can also use the package broom to tidy things a bit by doing this:
tidy(multinom_model, conf.int= True, conf.level = 0.95, exponentiate = T)
This will return a tibble with the coefficients exponentiated, confidence intervals (similiar to confint used in lm), as well as the Z-scores, Standard errors and the respective p-value for the Wald Z test (essentially doing z = summary(multinom_model)$coefficients/summary(multinom_model)$standard.errors and the round((1 - pnorm(abs(z), 0, 1)) * 2,digits=5) already

Related

Can glmmLasso be used with the Tweedie distribution?

I have a linear mixed effects model and I am trying to do variable selection. The model is testing the level of forest degradation in 1000 sampled points. Most points have no degradation, and so the dependent variable is highly skewed with many zeros. Therefore, I am using the Tweedie distribution to fit the model. My main question is: can the Tweedie distribution actually be used in the glmmLasso function? My second question is: do I even need to use this distribution in glmmLasso()? Any help is much appreciated!
When I run the function with family = tweedie(var.power=1.2,link.power=0) I get the following error:
Error in logLik.glmmLasso(y = y, yhelp = yhelp, mu = mu, family = family, :
object 'loglik' not found
If I change the link.power from 0 to 1 (which I think is not correct for my model, but just for the sake of figuring out the problem), I get a different error:
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
Here tweedie comes from the statmod package. A simple example:
library(tweedie)
library(tidyverse)
library(glmmLasso)
library(statmod)
power <- 2
mu <- 1
phi <- seq(2, 8, by=0.1)
set.seed(10000)
y <- rtweedie( 100, mu=mu, power=power, phi=3)
x <- rnorm(100)
z <- c(rep(1, 50), rep(2,50))
df = as.data.frame(cbind(y,x,z))
df$z = as.factor(df$z)
f = y ~ x
varSelect = glmmLasso(fix = f, rnd = list(z=~1), data = df,
lambda = 5, family = tweedie(var.power=1.2,link.power=0))
I created a hacked version of glmmLasso that incorporates the Tweedie distribution as an option and put it on Github. I had to change two aspects of the code:
add a clause to compute the log-likelihood if family$family == "Tweedie"
in a number of places where the code was essentially if (family$family in list_of_families) ..., add "Tweedie" as an option.
remotes::install_github("bbolker/glmmLasso-bmb")
packageVersion("glmmLasso")
## [1] ‘1.6.2.9000’
Your example runs for me now, but I haven't checked at all to see if the results are sensible.

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Error: $ operator not defined for this S4 class while running hoslem.test

I'm working on an optimization of a logistic regression model made with glm, the optimization is a lasso regression using glmnet. I want to compare both models using the output of a Hosmer Lemeshow test and I get this output.
For the glm I get
> hl <- hoslem.test(trainingDatos$Exited, fitted(logit.Mod))
> hl
Hosmer and Lemeshow goodness of fit (GOF) test
data: trainingDatos$Exited, fitted(logit.Mod)
X-squared = 2.9161, df = 8, p-value = 0.9395
And when I try to run the test for the lasso regression I get
> hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model), g=10)
Error in cut.default(yhat, breaks = qq, include.lowest = TRUE) :
'x' must be numeric
I also tried to use the coefficients of the lasso regression to make it numeric and I get
> hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model$beta), g=10)
Error: $ operator not defined for this S4 class
But when I treat it as an S4
> hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model#beta), g=10)
Error in fitted(lasso.model#beta) :
trying to get slot "beta" from an object (class "lognet") that is not an S4 object
Any way to run the test for my lasso regression?
Here is my full code for the lasso regression, can't share the database right now sorry
#Creation of Training Data Set
input_ones <- Datos[which(Datos$Exited == 1), ] #All 1s
input_zeros <- Datos[which(Datos$Exited == 0), ] #All 0s
set.seed(100)
#Training 1s
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones))
#Training 0s
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones))
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingDatos <- rbind(training_ones, training_zeros)
library(glmnet)
#Conversion of training data into matrix form
x <- model.matrix(Exited ~ CreditScore + Geography + Gender
+ Age + Tenure + Balance + IsActiveMember
+ EstimatedSalary, trainingDatos)[,-1]
#Defining numeric response variable
y <- trainingDatos$Exited
sed.seed(100)
#Grid search to find best lambda
cv.lasso<-cv.glmnet(x, y, alpha = 1, family = "binomial")
#Creation of the model
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.1se)
coef(cv.lasso, cv.lasso$lambda.1se)
#Now trying to run the test
library(ResourceSelection)
set.seed(12657)
hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model), g=10)#numeric value error
hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model$beta), g=10)#$ not defined for S4
hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model#beta), g=10)#saying that beta is nos S4
glmnet uses a unique predict() method for obtaining fitted values. As rightly mentioned, the errors came from using fitted(). Meanwhile, running such tests could be easier with the gofcat package. Supported objects are passed directly to the functions. Your glm model, for instance, goes hosmerlem(logit.Mod).

R: obtain coefficients&CI from bootstrapping mixed-effect model results

The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.

Why parametric bootstrapping bias and standard error are zero here?

I'm performing parametric bootstrapping in R for a simple problem and getting Bias and Standard Error zero always. What am I doing wrong?
set.seed(12345)
df <- rnorm(n=10, mean = 0, sd = 1)
Boot.fun <-
function(data) {
m1 <- mean(data)
return(m1)
}
Boot.fun(data = df)
library(boot)
out <- boot(df, Boot.fun, R = 20, sim = "parametric")
out
PARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = Boot.fun, R = 20, sim = "parametric")
Bootstrap Statistics :
original bias std. error
t1* -0.1329441 0 0
You need to add line of code to do the sampling, ie.
Boot.fun <-
function(data) {
data <- sample(data, replace=T)
m1 <- ...
since you didn't supply a function to the argument rand.gen to generate random values. This is discussed in the documentation for ?boot. If sim = "parametric" and you don't supply a generating function, then the original data is passed to statistic and you need to sample in that function. Since your simulation was run on the same data, there is no standard error or bias.

Resources