predict.lm() with an unknown factor level in test data - r

I am fitting a model to factor data and predicting. If the newdata in predict.lm() contains a single factor level that is unknown to the model, all of predict.lm() fails and returns an error.
Is there a good way to have predict.lm() return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?
Example code:
foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)
I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA corresponding to the unknown level "D".

You have to remove the extra levels before any calculation, like:
> id <- which(!(foo.new$predictor %in% levels(foo$predictor)))
> foo.new$predictor[id] <- NA
> predict(model,newdata=foo.new)
1 2 3 4
-0.1676941 -0.6454521 0.4524391 NA
This is a more general way of doing it, it will set all levels that do not occur in the original data to NA. As Hadley mentioned in the comments, they could have chosen to include this in the predict() function, but they didn't
Why you have to do that becomes obvious if you look at the calculation itself. Internally, the predictions are calculated as :
model.matrix(~predictor,data=foo) %*% coef(model)
[,1]
1 -0.1676941
2 -0.6454521
3 0.4524391
At the bottom you have both model matrices. You see that the one for foo.new has an extra column, so you can't use the matrix calculation any more. If you would use the new dataset to model, you would also get a different model, being one with an extra dummy variable for the extra level.
> model.matrix(~predictor,data=foo)
(Intercept) predictorB predictorC
1 1 0 0
2 1 1 0
3 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"
> model.matrix(~predictor,data=foo.new)
(Intercept) predictorB predictorC predictorD
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"
You can't just delete the last column from the model matrix either, because even if you do that, both other levels are still influenced. The code for level A would be (0,0). For B this is (1,0), for C this (0,1) ... and for D it is again (0,0)! So your model would assume that A and D are the same level if it would naively drop the last dummy variable.
On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.

Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.
Additional features
drops unused factor levels rather than just setting the missing values to NA.
issues a message to the user that factor levels have been dropped
checks for existence of factor variables in test_data and returns original data.frame if non are present
works not only for lm, glm and but also for glmmPQL
Note: The function shown here may change (improve) over time.
#' #title remove_missing_levels
#' #description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' #import magrittr
#' #importFrom gdata unmatrix
#' #importFrom stringr str_split
#'
#' #param fit fitted model on training data
#'
#' #param test_data data to make predictions for
#'
#' #return data.frame with matching factor levels to fitted model
#'
#' #keywords internal
#'
#' #export
remove_missing_levels <- function(fit, test_data) {
# https://stackoverflow.com/a/39495480/4185785
# drop empty factor levels in test data
test_data %>%
droplevels() %>%
as.data.frame() -> test_data
# 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
# account for it
if (any(class(fit) == "glmmPQL")) {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$contrasts))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
map(fit$contrasts, function(x) names(unmatrix(x))) %>%
unlist() -> factor_levels
factor_levels %>% str_split(":", simplify = TRUE) %>%
extract(, 1) -> factor_levels
model_factors <- as.data.frame(cbind(factors, factor_levels))
} else {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$xlevels))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
factor_levels <- unname(unlist(fit$xlevels))
model_factors <- as.data.frame(cbind(factors, factor_levels))
}
# Select column names in test data that are factor predictors in
# trained model
predictors <- names(test_data[names(test_data) %in% factors])
# For each factor predictor in your data, if the level is not in the model,
# set the value to NA
for (i in 1:length(predictors)) {
found <- test_data[, predictors[i]] %in% model_factors[
model_factors$factors == predictors[i], ]$factor_levels
if (any(!found)) {
# track which variable
var <- predictors[i]
# set to NA
test_data[!found, predictors[i]] <- NA
# drop empty factor levels in test data
test_data %>%
droplevels() -> test_data
# issue warning to console
message(sprintf(paste0("Setting missing levels in '%s', only present",
" in test data but missing in train data,",
" to 'NA'."),
var))
}
}
return(test_data)
}
We can apply this function to the example in the question as follows:
predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))
While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.
A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.

If you want to deal with the missing levels in your data after creating your lm model but before calling predict (given we don't know exactly what levels might be missing beforehand) here is function I've built to set all levels not in the model to NA - the prediction will also then give NA and you can then use an alternative method to predict these values.
object will be your lm output from lm(...,data=trainData)
data will be the data frame you want to create predictions for
missingLevelsToNA<-function(object,data){
#Obtain factor predictors in the model and their levels ------------------
factors<-(gsub("[-^0-9]|as.factor|\\(|\\)", "",names(unlist(object$xlevels))))
factorLevels<-unname(unlist(object$xlevels))
modelFactors<-as.data.frame(cbind(factors,factorLevels))
#Select column names in your data that are factor predictors in your model -----
predictors<-names(data[names(data) %in% factors])
#For each factor predictor in your data if the level is not in the model set the value to NA --------------
for (i in 1:length(predictors)){
found<-data[,predictors[i]] %in% modelFactors[modelFactors$factors==predictors[i],]$factorLevels
if (any(!found)) data[!found,predictors[i]]<-NA
}
data
}

Sounds like you might like random effects. Look into something like glmer (lme4 package). With a Bayesian model, you'll get effects that approach 0 when there's little information to use when estimating them. Warning, though, that you'll have to do prediction yourself, rather than using predict().
Alternatively, you can simply make dummy variables for the levels you want to include in the model, e.g. a variable 0/1 for Monday, one for Tuesday, one for Wednesday, etc. Sunday will be automatically removed from the model if it contains all 0's. But having a 1 in the Sunday column in the other data won't fail the prediction step. It will just assume that Sunday has an effect that's average the other days (which may or may not be true).

One of the assumptions of Linear/Logistic Regressions is to little or no multi-collinearity; so if the predictor variables are ideally independent of each other, then the model does not need to see all the possible variety of factor levels. A new factor level (D) is a new predictor, and can be set to NA without affecting the predicting ability of the remaining factors A,B,C. This is why the model should still be able to make predictions. But addition of the new level D throws off the expected schema. That's the whole issue. Setting NA fixes that.

The lme4 package will handle new levels if you set the flag allow.new.levels=TRUE when calling predict.
Example: if your day of week factor is in a variable dow and a categorical outcome b_fail, you could run
M0 <- lmer(b_fail ~ x + (1 | dow), data=df.your.data, family=binomial(link='logit'))
M0.preds <- predict(M0, df.new.data, allow.new.levels=TRUE)
This is an example with a random effects logistic regression. Of course, you can perform regular regression ... or most GLM models. If you want to head further down the Bayesian path, look at Gelman & Hill's excellent book and the Stan infrastructure.

A quick-and-dirty solution for split testing, is to recode rare values as "other". Here is an implementation:
rare_to_other <- function(x, fault_factor = 1e6) {
# dirty dealing with rare levels:
# recode small cells as "other" before splitting to train/test,
# assuring that lopsided split occurs with prob < 1/fault_factor
# (N.b. not fully kosher, but useful for quick and dirty exploratory).
if (is.factor(x) | is.character(x)) {
min.cell.size = log(fault_factor, 2) + 1
xfreq <- sort(table(x), dec = T)
rare_levels <- names(which(xfreq < min.cell.size))
if (length(rare_levels) == length(unique(x))) {
warning("all levels are rare and recorded as other. make sure this is desirable")
}
if (length(rare_levels) > 0) {
message("recoding rare levels")
if (is.factor(x)) {
altx <- as.character(x)
altx[altx %in% rare_levels] <- "other"
x <- as.factor(altx)
return(x)
} else {
# is.character(x)
x[x %in% rare_levels] <- "other"
return(x)
}
} else {
message("no rare levels encountered")
return(x)
}
} else {
message("x is neither a factor nor a character, doing nothing")
return(x)
}
}
For example, with data.table, the call would be something like:
dt[, (xcols) := mclapply(.SD, rare_to_other), .SDcol = xcols] # recode rare levels as other
where xcols is a any subset of colnames(dt).

Related

Looping GLM model and Printing Results

I have a data set that I want to run a number of univariate regressions with 20 variables or so. Listed below is a truncated data frame showing just two of the variables of interest (Age and Anesthesia). The regression works fine but the issue I am running into is how to store the data. Since Age has only coefficient but Anesthesia has >4 and the error I get is:
Error in results[i, i] <- summary.glm(logistic.model)$coefficients :
number of items to replace is not a multiple of replacement length
Here is the truncated data.
COMPLICATION Age Anesthesia
0 45 General
1 23 Local
1 33 Lumbar
0 21 Other
varlist <- c("Age", “Anesthesia")
univars <- data.frame() # create an empty data frame
for (i in seq_along(varlist))
{
mod <- as.formula(sprintf("COMPLICATION ~ %s", varlist[i]))
logistic.model<- glm(formula = mod, family = binomial, data=data.clean)
results[i,i] <-summary.glm(logistic.model)$coefficients
}
Filling the dataframe in a loop is not a good idea and it can be inefficient. Moreover, summary.glm(logistic.model)$coefficients returns more than 1 row hence you get the error. Try using this lapply approach.
varlist <- c("Age", "Anesthesia")
lapply(varlist, function(x) {
mod <- glm(reformulate(x, 'COMPLICATION'), data.clean, family = binomial)
summary.glm(mod)$coefficients
}) -> result
result
If you want to combine the result in one dataframe you can do :
result <- do.call(rbind, result)

vegan dbrda species scores are empty despite community matrix provided

I have performed "Distance based redundancy analysis" (dbRDA) in R using the vegan community ecology package. I would like to show the relative contribution of (fish) trophic groups to the dissimilarity between samples (abundance data of trophic level fish assemblages) in an ordination plot of the dbRDA results. I.e. Overlay arrows and trophic level group names onto the ordination plot, where the length of the arrow line indicates the relative contribution to dissimilarity. This should be accessible via the vegan::scores() function, or stored with the dbrda.model$CCA$v object, as I understand.
However, species scores are empty (NA) when using dbrda(). I understand that the dbrda function requires the community matrix to be defined within the function in order to provide species-scores. I have defined it as such, but still unable to produce the species scores. What puzzles me is that when I use capscale() in the vegan package, with the same species-community and environmental variable data, and formulated the same within the respective functions, species scores are produced. Is dbrda in vegan able to produce species scores? How are these scores different to that produced by capscale (when the same data and formula are used)? I provide an example of my data, and the formula used. (I am fairly confident in actually plotting the species-scores once obtained - so limit the code to producing the species-scores.)
#Community data matrix (comm.dat): site names = row names, trophic level = column names
>head(comm.dat[1:5,1:4])
algae/invertebrates corallivore generalist carnivore herbivore
h_m_r_3m_18 1 0 3 0
h_m_r_3m_22 6 4 8 26
h_m_r_3s_19 0 0 4 0
h_m_r_3s_21 3 0 7 0
l_pm_r_2d_7 1 0 5 0
> str(comm.dat)
num [1:47, 1:8] 1 6 0 3 1 8 11 2 6 9 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:47] "h_m_r_3m_18" "h_m_r_3m_22" "h_m_r_3s_19" "h_m_r_3s_21" ...
..$ : chr [1:8] "algae/invertebrates" "corallivore" "generalist carnivore" "herbivore" ...
# environmental data (env.dat): Already standardised env data.
>head(env.dat[1:5,1:3])
depth water.level time.in
-0.06017376 1.3044232 -1.7184415
-0.67902862 1.3044232 -1.7907181
-0.99619174 1.3044232 -1.7569890
-1.06581291 1.3044232 -1.7762628
2.39203863 -0.9214933 0.1703884
# Dissimilarity distance: Modified Gower (AltGower) with logbase 10 transformation of the community data matrix
> dis.comm.mGow <- vegan::vegdist(x = decostand(comm.dat, "log", logbase = 10), method = "altGower")
# Distance based RDA model: Trophic level data logbase transformed modified Gower distance, constrained against the interaction of dept and water level (tide), and the interaction of depth and time of day sampled`
> m.dbrda <- dbrda(formula = dis.comm.mGow ~ depth*water.level + depth*time.in, data = env.dat, scaling = 2, add = "lingoes", comm = decostand(comm.dat, "log", logbase = 10), dfun = "altGower")
# Check species scores: PROBLEM: No species level scores available
> m.dbrda$CCA$v
dbRDA1 dbRDA2 dbRDA3 dbRDA4 dbRDA5
[1,] NA NA NA NA NA
# OR pull species scores using scores(): Also does not show species scores...
>scrs <- scores(m.dbrda,display="species"); scrs
dbRDA1 dbRDA2
spe1 NA NA
attr(,"const")
[1] 6.829551
# when replacing dbrda with capscale, species scores are produced, e.g.
> m.cap <- capscale(formula = dis.comm.mGow ~ depth*water.level + depth*time.in, data = env.dat, scaling = 2, add = "lingoes", comm = decostand(comm.dat, "log", logbase = 10), dfun = "altGower")
> m.cap$CCA$v[1:5,1:3]
CAP1 CAP2 CAP3
algae/invertebrates 0.2044097 -0.04598088 -0.37200097
corallivore 0.3832594 0.06416886 -0.27963122
generalist carnivore 0.1357668 -0.08566365 -0.06789812
herbivore 0.5745226 -0.45647341 0.73085661
invertebrate carnivore 0.1987651 0.68036211 -0.19174283
The dbrda() function really does not get the species scores. However, if we had the following function in vegan, you could add those scores:
#' Add Species Scores to Ordination Results
#'
#' #param object Ordination object
#' #param comm Community data
#'
#' #return Function returns the ordination object amended with species
#' scores.
#'
#' #rdname specscores
#' #export
`specscores` <-
function(object, comm)
{
UseMethod("specscores")
}
#' importFrom vegan decostand
#'
#' #rdname specscores
#' #export
`specscores.dbrda` <-
function(object, comm)
{
comm <- scale(comm, center = TRUE, scale = FALSE)
if (!is.null(object$pCCA) && object$pCCA$rank > 0) {
comm <- qr.resid(object$pCCA$QR, comm)
}
if (!is.null(object$CCA) && object$CCA$rank > 0) {
v <- crossprod(comm, object$CCA$u)
v <- decostand(v, "normalize", MARGIN = 2)
object$CCA$v <- v
comm <- qr.resid(object$CCA$QR, comm)
}
if (!is.null(object$CA) && object$CA$rank > 0) {
v <- crossprod(comm, object$CA$u)
v <- decostand(v, "normalize", MARGIN = 2)
object$CA$v <- v
}
object
}
I don't know if we are going to have this function (with other methods), but you can use it if you source it in your session.
This looks like either a bug in documentation or missing feature and I'll raise this with Jari on Github.
However, note that dbrda() doesn't have argument comm and hence passing this doesn't actually do anything. comm is for capscale only.
If this is a documentation bug, then it arises because we document both capscale() and dbrda() on the same page, but capscale() existed first and dbrda() only arrived later and the documentation was updated to accommodate the latter, but this may not have have been done perfectly, hence confusion about what dbrda() is or is not documented to do.
As it stands, the code behind dbrda() makes no attempt to compute species scores; it could, if I follow the code correctly. As such, this may just be missing functionality.
For the moment, the NAs are the expected result from the underlying ordConstrained function that implements all these models, and is what we return if the analysis was on a dissimilarity matrix, which as far as this function is concerned, it was.

Permutations and combinations of all the columns in R

I want to check all the permutations and combinations of columns while selecting models in R. I have 8 columns in my data set and the below piece of code lets me check some of the models, but not all. Models like column 1+6, 1+2+5 will not be covered by this loop. Is there any better way to accomplish this?
best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
for(j in 1:8){
for(x in k){
diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
if( best_model[1] < accuracy[x] ){
best_model[1] = accuracy[x]
best_model[2] = i
best_model[3] = j
}
}
}
}
Well, this answer isn't complete, but maybe it'll get you started. You want to be able to subset by all possible subsets of columns. So instead of having i:j for some i and j, you want to be able to subset by c(1,6) or c(1,2,5), etc.
Using the sets package, you can for the power set (set of all subsets) of a set. That's the easy part. I'm new to R, so the hard part for me is understanding the difference between sets, lists, vectors, etc. I'm used to Mathematica, in which they're all the same.
library(sets)
my.set <- 1:8 # you want column indices from 1 to 8
my.power.set <- set_power(my.set) # this creates the set of all subsets of those indices
my.names <- c("a") #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
names(my.power.set) <- my.names
my.indices <- vector("list",length(my.power.set)-1)
for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work
I wanted to create a list of lists called my.indices, so that my.indices[i] was a subset of {1,2,3,4,5,6,7,8} that could be used in place of where you have i:j. Then, your for loop would have to run from 1:length(my.indices).
But alas, I have been spoiled by Mathematica, and thus cannot decipher the incredibly complicated world of R data types.
Solved it, below is the code with explanatory comments:
# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
model <- c()
for(i in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[i])
model <- c(model, i)
}
for(x in k){
# for the columns decides by model, find out the accuracies of model for k=1:27
diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
if( best_model_accuracy < accuracy[x] ){
best_model_accuracy = accuracy[x]
best_model = model
print(model)
}
}
}
I trained with Pima.tr and tested with Pima.te. KNN Accuracy for pre-processed predictors was 78% and 80% without pre-processing (and this because of the large influence of some variables).
The 80% performance is at par with a Logistic Regression model. You don't need to preprocess variables in Logistic Regression.
RandomForest, and Logistic Regression provide a hint on which variables to drop, so you don't need to go and perform all possible combinations.
Another way is to look at a matrix Scatter plot
You get a sense that there is difference between type 0 and type 1 when it comes to npreg, glu, bmi, age
You also notice the highly skewed ped and age, and you notice that there may be an anomaly data point between skin and and and other variables (you may need to remove that observation before going further)
Skin Vs Type box plot shows that for type Yes, an extreme outlier exist (try removing it)
You also notice that most of the boxes for Yes type are higher than No type=> the variables may add prediction to the model (you can confirm this through a Wilcoxon Rank Sum Test)
The high correlation between Skin and bmi means that you can use one or the other or an interact of both.
Another approach to reducing the number of predictors is to use PCA

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

Manually conduct leave-one-out cross validation for a GLMM using a for() loop in R

I am trying to build a for() loop to manually conduct leave-one-out cross validations for a GLMM fit using the lmer() function from the lme4 pkg. I need to remove an individual, fit the model and use the beta coefficients to predict a response for the individual that was withheld, and repeat the process for all individuals.
I have created some test data to tackle the first step of simply leaving an individual out, fitting the model and repeating for all individuals in a for() loop.
The data have a binary (0,1) Response, an IndID that classifies 4 individuals, a Time variable, and a Binary variable. There are N=100 observations. The IndID is fit as a random effect.
require(lme4)
#Make data
Response <- round(runif(100, 0, 1))
IndID <- as.character(rep(c("AAA", "BBB", "CCC", "DDD"),25))
Time <- round(runif(100, 2,50))
Binary <- round(runif(100, 0, 1))
#Make data.frame
Data <- data.frame(Response, IndID, Time, Binary)
Data <- Data[with(Data, order(IndID)), ] #**Edit**: Added code to sort by IndID
#Look at head()
head(Data)
Response IndID Time Binary
1 0 AAA 31 1
2 1 BBB 34 1
3 1 CCC 6 1
4 0 DDD 48 1
5 1 AAA 36 1
6 0 BBB 46 1
#Build model with all IndID's
fit <- lmer(Response ~ Time + Binary + (1|IndID ), data = Data,
family=binomial)
summary(fit)
As stated above, my hope is to get four model fits – one with each IndID left out in a for() loop. This is a new type of application of the for() command for me and I quickly reached my coding abilities. My attempt is below.
fit <- list()
for (i in Data$IndID){
fit[[i]] <- lmer(Response ~ Time + Binary + (1|IndID), data = Data[-i],
family=binomial)
}
I am not sure storing the model fits as a list is the best option, but I had seen it on a few other help pages. The above attempt results in the error:
Error in -i : invalid argument to unary operator
If I remove the [-i] conditional to the data=Data argument the code runs four fits, but data for each individual is not removed.
Just as an FYI, I will need to further expand the loop to:
1) extract the beta coefs, 2) apply them to the X matrix of the individual that was withheld and lastly, 3) compare the predicted values (after a logit transformation) to the observed values. As all steps are needed for each IndID, I hope to build them into the loop. I am providing the extra details in case my planned future steps inform the more intimidate question of leave-one-out model fits.
Thanks as always!
The problem you are having is because Data[-i] is expecting i to be an integer index. Instead, i is either AAA, BBB, CCC or DDD. To fix the loop, set
data = Data[Data$IndID != i, ]
in you model fit.

Resources