Multiple imputation and mlogit for a multinomial regression - r

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?

Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))

Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

Related

Apply logistic regression in a function in R

I want to run logistic regression for multiple parameters and store the different metrics i.e AUC.
I wrote the function below but I get an error when I call it: Error in eval(predvars, data, env) : object 'X0' not found even if the variable exists in both my training and testing dataset. Any idea?
new.function <- function(a) {
model = glm(extry~a,family=binomial("logit"),data = train_df)
pred.prob <- predict(model,test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
# Call the function new.function supplying 6 as an argument.
les <- new.function(X0)
The main reason why your function didn't work is that you are trying to call an object into a formula. You can fix it with paste formula function, but that is ultimately quite limiting.
I suggest instead that you consider using update. This allow you more flexibility to change with multiple variable combination, or change a training dataset, without breaking the function.
model = glm(extry~a,family=binomial("logit"),data = train_df)
new.model = update(model, .~X0)
new.function <- function(model){
pred.prob <- predict(model, test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
les <- new.function(new.model)
The function can be further improved by calling the test_df as a separate argument, so that you can fit it with an alternative testing data.
To run the function in the way you intended, you would need to use non-standard evaluation to capture the symbol and insert it in a formula. This can be done using match.call and as.formula. Here's a fully reproducible example using dummy data:
new.function <- function(a) {
# Convert symbol to character
a <- as.character(match.call()$a)
# Build formula from character strings
form <- as.formula(paste("extry", a, sep = "~"))
model <- glm(form, family = binomial("logit"), data = train_df)
pred.prob <- predict(model, test_df, type = 'response')
predictFull <- ROCR::prediction(pred.prob, test_df$extry)
auc_ROCR <- ROCR::performance(predictFull, "auc")
list("AUC" = auc_ROCR)
}
Now we can call the function in the way you intended:
new.function(X0)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
new.function(X1)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
If you want to see the actual area under the curve you would need to do:
new.function(X0)$AUC#y.values[[1]]
#> [1] 0.6599759
So you may wish to modify your function so that the list contains auc_ROCR#y.values[[1]] rather than auc_ROCR
Data used
set.seed(1)
train_df <- data.frame(X0 = sample(100), X1 = sample(100))
train_df$extry <- rbinom(100, 1, (train_df$X0 + train_df$X1)/200)
test_df <- data.frame(X0 = sample(100), X1 = sample(100))
test_df$extry <- rbinom(100, 1, (test_df$X0 + test_df$X1)/200)
Created on 2022-06-29 by the reprex package (v2.0.1)

How to cluster standard error in clubSandwich's vcovCR()?

I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.

Passing strings into 'contrasts' argument of lme/lmer

I am writing a sub-routine to return output of longitudinal mixed-effects models. I want to be able to pass elements from lists of variables into lme/lmer as the outcome and predictor variables. I would also like to be able to specify contrasts within these mixed-effects models, however I am having trouble with getting the contrasts() argument to recognise the strings as the variable names referred to in the model specification within the same lme/lme4 call.
Here's some toy data,
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now the following call to lme works just fine, with contrasts specified (I know these are the default so this is all purely pedagogical).
mod <- lme(score ~ group*time, random = ~1|id, data = df, contrasts = list(group = contr.treatment(2), time = contr.treatment(4)))
The following also works, passing strings as variable names into lme using the reformulate() function.
t <- "time"
g <- "group"
dv <- "score"
mod1R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df)
But if I want to specify contrasts, like in the first example, it doesn't work
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df, contrasts = list(g = contr.treatment(2), t = contr.treatment(4)))
# Error in `contrasts<-`(`*tmp*`, value = contrasts[[i]]) : contrasts apply only to factors
How do I get lme to recognise that the strings specified to in the contrasts argument refer to the variables passed into the reformulate() function?
You should be able to use setNames() on the list of contrasts to apply the full names to the list:
# Using a %>% pipe so need to load magrittr
library(magrittr)
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"),
random = ~1|id,
data = df,
contrasts = list(g = contr.treatment(2), t = contr.treatment(4)) %>%
setNames(c(g, t))
)

Convincing R to exclude single level factors when using lm() in a for loop in some subsets (but not all)

I am working on a large dataset with 19 subcohorts for which I want to run a lineair regression model to estimate BMI.
One of the covariates I am using is sex, but some subcohorts consist only of men, which causes problems in my loop.
If I try to run a linear regression model, I get the following error:
tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I have found a solution for this problem, by running seperate loops for subcohorts with men and subcohorts with men and women by the following (simplified) code:
men <- c(1,6,15) # Cohort nrs that only contain men
menandwomen <- c(2,3,4,5,7,8,9,10,11,12,13,14,16,17,18,19)
trenddpmodelm <-list()
for(i in men) {
trenddpmodelm[[i]] <- lm(BMI ~ age + sex,
data=subcohort[subcohort$centre_a==i, ],)
}
trenddpmodelmw <-list()
for(i in menandwomen) {
trenddpmodelmw[[i]] <- lm(BMI ~ age + sex,
data=subcohort[subcohort$centre_a==i, ],)
}
trenddpmodel <- c(list(trenddpmodelm[[1]]), list(trenddpmodelmw[[2]]), list(trenddpmodelmw[[3]]), list(trenddpmodelmw[[4]]), list(trenddpmodelmw[[5]]), list(trenddpmodelm[[6]]), list(trenddpmodelmw[[7]]), list(trenddpmodelmw[[8]]), list(trenddpmodelmw[[9]]), list(trenddpmodelmw[[10]]), list(trenddpmodelmw[[11]]), list(trenddpmodelmw[[12]]), list(trenddpmodelmw[[13]]), list(trenddpmodelmw[[14]]), list(trenddpmodelm[[15]]), list(trenddpmodelmw[[16]]), list(trenddpmodelmw[[17]]), list(trenddpmodelmw[[18]]), list(trenddpmodelmw[[19]]))
After this step, I extract relevant information from the summaries and put this in a df to export to excel.
My problem is that I will be running quite a lot of analyses, which will result in pages and pages of code.
My question is therefore: Is there a setting in R that I could use that allows non varying factors to be dropped from my lineair regression model in subcohorts where this is applicable? (similar to what happens in coxph; R gives a warning that the factor does not always vary, but the loop does run)
It is not like I cannot continue working without a solution, but I have been trying to find an answer to this question for days without succes and I think it must be possible somehow. Any advice is much appreciated :)
I would recommend building your formula dynamically within the loop.
DF <- list(Cohort1 = data.frame(bmi = rnorm(25, 24, 1),
age = rnorm(25, 50, 3),
sex = sample(c("F", "M"), 25, replace = TRUE)),
Cohort2 = data.frame(bmi = rnorm(15, 24, 1),
age = rnorm(15, 55, 4),
sex = rep("M", 15)))
candidate_vars <- c("age", "sex")
Models <- vector("list", length(DF))
for (i in seq_along(DF)){
# Determine if the variables are either numeric, or factor with more than 1 level
indep <- vapply(X = DF[[i]][candidate_vars],
FUN = function(x){
if (is.numeric(x)) return(TRUE)
else return(nlevels(x) > 1)
},
FUN.VALUE = logical(1))
# Write the formula
form <- paste("bmi ~ ", paste(candidate_vars[indep], collapse = " + "))
# Create the model
Models[[i]] <- lm(as.formula(form), data = DF[[i]])
}

R code: Rolling regression After stepwise

I have spent whole day today for resolving this.. please help me.
Although I just write very simple example here, my original data has a huge number of variables- about 2,000. Therefore, to run regression I need to pick certain variables.
I do need to develop many models, so I should do this procedure automatically.
I run stepwie.
I don't know how many variables are selected by stepwise.
after selecting variables, I run rolling regression for prediction.
library(car)
library(zoo)
# run regression
m <- lm(mpg~., data=mtcars)
# run stepwise
s<-step(m, direction="both")
# select variables
variable<- attr(s$terms,"term.labels")
b<-paste(dep,paste(s, collapse="+"),sep = "~")
rollapply(mtcars, width = 2,
FUN = function(z) coef(lm(b, data = as.data.frame(z))),
by.column = FALSE, align = "right")
# Here is the automatic model I developed..
models2 <- lapply(1:11, function(x) {
dep<-names(mtcars)[x]
ind<-mtcars[-x]
w<-names(ind)
indep<-paste(dep,paste(w, collapse="+"),sep = "~")
m<-lm(indep,data=mtcars)
s<-step(m, direction="both")
b<-paste(dep,paste(s, collapse="+"),sep = "~")
rollapply(mtcars, width = 2,
FUN = function(z) coef(lm(b, data = as.data.frame(z))),
by.column = FALSE, align = "right")})
I want to calculate prediction from rolling regression..
However, it is very hard to set up
data.frame without pre-knowldege about independent variables..
There is a similar one here, but in this model independent variables are known already.
You do not need to know the independent variables! If you provide a data.frame that contains all variables, the predict function will select the necessary ones. Similar to the post you have linked, you could to this:
mtcars[,"int"] <- seq(nrow(mtcars)) # add variable used to choose newdata
models2 <- lapply(1:11, function(x) {
dep <- names(mtcars)[x]
ind <- mtcars[-x]
w <- names(ind)
form <- paste(dep,paste(w, collapse="+"),sep = "~")
m <- lm(form, data=mtcars)
s <- step(m, direction="both", trace=0) # model selection (don't print trace)
b <- formula(s) # This is clearer than your version
rpl <- rollapply(mtcars, width = 20, # if you use width=2, your model will always be overdetermined
FUN = function(z) {
nextD <- max(z[,'int'])+1 # index of row for new data
fit <- lm(b, data = as.data.frame(z)) # fit the model
c(coef=coef(fit), # coefficients
predicted=predict(fit, newdata=mtcars[nextD,])) # predict using the next row
},
by.column = FALSE, align = "right")
rpl
})

Resources