random model formula object - r

I want to put formula in random model, but I think following error is due to wrong formula object (?), but could not fix it.
set.seed(1234)
mydata <- data.frame (A = rep(1:3, each = 20), B = rep(1:2, each = 30),
C = rnorm(60, 10, 5))
mydata$A <- as.factor(mydata$A)
mydata$B <- as.factor(mydata$B)
myfunction <- function (mydata, yvars, genovar, replication) {
require("lme4")
formula = paste ("yvars" ~ 1|"genovar" + 1|"replication")
model1 <- lmer(formula, data = dataframe, REML = TRUE)
return(ranef(model2))
}
myfunction(mydata=dataf, yvars = "C", genovar = "A", replication = "B")
Error: length(formula <- as.formula(formula)) == 3 is not TRUE

There were several wonky things in here, but this is I think close to what you want.
set.seed(1234)
mydata <- data.frame (A = factor(rep(1:3, each = 20)),
B = factor(rep(1:2, each = 30)),
C = rnorm(60, 10, 5))
require("lme4")
myfunction <- function (mydata, yvars, genovar, replication) {
formula <- paste (yvars,"~ (1|",genovar,") + (1|",replication,")")
model1 <- lmer(as.formula(formula), data = mydata, REML = TRUE)
return(ranef(model1))
}
myfunction(mydata=mydata, yvars = "C", genovar = "A", replication = "B")
Beware, however, that lmer doesn't work the way that classical random-effects ANOVA does -- it may perform very badly with such small numbers of replicates. (In the example I tried it set the variance of A to zero, which is at least not unreasonable.) The GLMM FAQ has some discussion of this issue. (Random-effects ANOVA would have exceedingly low power in that case but might not be quite as bad.) If you really want to do random-effects models on such small samples you might want to consider reconstructing the classical method-of-moments approach (as I recall there is/was a raov formula in S-PLUS that did random-effects ANOVA, but I don't know if it was ever implemented in R).
Finally, for future questions along these lines you may do better on the r-sig-mixed-models#r-project.org mailing list -- Stack Overflow is nice but there is more R/mixed-model expertise over there.

Related

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

Does caret::train() in r have a standardized output across different fit methods/models?

I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)

biglm - Error: $ operator is invalid for atomic vectors

I am trying to run a generalized linear model on a very large dataset (several million rows). R doesn't seem able to handle the analysis, however, as I keep getting memory allocation errors (unable to allocate vector of size...etc.).
The data fit in RAM, but seem to be too large to estimate complex models. As a solution, I'm exploring using the ff package to replace r's in-RAM storage mechanism with on-disk storage.
I have successfully (I think) off-loaded the data to my hard drive, but when I attempt to estimate the glm (via the biglm package) I get the following error:
Error: $ operator is invalid for atomic vectors
I'm not sure why I'm getting this specific error when I use the bigglm function. When I run the glm on the full dataset, it doesn't give me this specific error, though perhaps r is running out of memory before it gets far enough for the "operator is invalid" error to trigger.
I've provided an example data set and code below. Note that the standard glm runs just fine on this sample data. The problem arises when using biglm.
Please let me know if you have any questions.
Thank you in advance!
#Load required packages
library(readr)
library(ff)
library(ffbase)
library(LaF)
library(biglm)
#Create sample data
df <- data.frame("id" = as.character(1:20), "group" = rep(seq(1:5), 4),
"x1" = as.character(rep(c("a", "b", "c", "d"), 5)),
"x2" = rnorm(20, 50, 1), y = sample(0:1, 20, replace=T),
stringsAsFactors = FALSE)
#Write data to file
write_csv(df, "df.csv")
#Create connection to sample data using laf
con <- laf_open_csv(filename = "df.csv",
column_types = c("string", "string", "string",
"double", "string"),
column_names = c("id", "group", "x1", "x2", "y"),
skip = 1)
#Use LaF to import data into ffdf object
ff <- laf_to_ffdf(laf = con)
#Fit glm on data stored in RAM (note this model runs fine)
fit.glm <- glm(y ~ factor(x1) + x2 + factor(group), data=df,
family="binomial")
#Fit glm on data stored on hard-drive (note this model fails)
fit.big <- bigglm(y ~ factor(x1) + x2 + factor(group), data=ff,
family="binomial")
You are using the wrong family argument.
library(ffbase)
library(biglm)
df <- data.frame("id" = factor(as.character(1:20)), "group" = factor(rep(seq(1:5), 4)),
"x1" = factor(as.character(rep(c("a", "b", "c", "d"), 5))),
"x2" = rnorm(20, 50, 1), y = sample(0:1, 20, replace=T),
stringsAsFactors = FALSE)
d <- as.ffdf(df)
fit.big <- bigglm.ffdf(y ~ x1 + x2 , data = d,
family = binomial(link = "logit"), chunksize = 3)

Consistency of categorical encodings in h2o (and R) for training and new test sample

I'm having trouble understanding whether I need to be consistent with the categorical / factor encodings of variables. With consistency I mean that I need to assure that the encodings from integers and levels should be the same in the training and the new testing sample.
This answer seems to suggest that it is not necessary. On the contrary, this answer suggests that IT is indeed necessary.
Suppose I have a training sample with an xcat that can take values a, b, c. The expected result is that the y variable will tend to take values close to 1 when xcat is a, 2when xcat is b, and 3 when xcat is c.
First I'll create the dataframe, pass it to h2o and then encode with the function as.factor:
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
n = 20
y <- sample(1:3, size = n, replace = T)
xcat <- letters[y]
xnum <- sample(1:10, size = n, replace = T)
y <- dep + rnorm(0, 0.3, n = 20)
df <- data.frame(xcat=xcat, xnum=xnum , y=y)
df.hex <- as.h2o(df, destination_frame="df.hex")
#Encode as factor. You will get: a=1, b=2, c=3
df.hex[ , "xcat"] = as.factor(df.hex[, "xcat"])
Now I'll estimate it with an glm model and predict on the same sample:
x = c("xcat", "xnum")
glm <- h2o.glm( y = c("y"), x = x, training_frame=df.hex,
family="gaussian", seed=1234)
glm.fit <- h2o.predict(object=glm, newdata=df.hex)
glm.fit gives the expected results (no surprises here).
Now I'll create a new test dataset that only has a and c, no b value:
xcat2 = c("c", "c", "a")
xnum2 = c(2, 3, 1)
y = c(1, 2, 1) #not really needed
df.test = data.frame(xcat=xcat2, xnum=xnum2, y=y)
df.test.hex <- as.h2o(df.test, destination_frame="df.test.hex")
df.test.hex[ , "xcat"] = as.factor(df.test.hex[, "xcat"])
Running str(df.test.hex$xcat) shows that this time the factor encoding has assigned 2 to c and 1 to a. This looked like it could be trouble, but then the fitting works as expected:
test.fit = h2o.predict(object=glm, newdata=df.test.hex)
test.fit
#gives 2.8, 2.79, 1.21 as expected
What's going on here? Is it that the glm model carries around the information of levels of the x variables so it doesn't mind if the internal encoding is different in the training and the new test data? Is that the general case for all h2o models?
From looking at one of the answers I linked above, it seems that at least some R models do require consistency.
Thanks and best!

Scoping-related (?): anova() on list of created mixed-effects models

In a project where I'm performing mixed-effects modelling using lme, I'm trying to compare models with different correlation structures and equal fixed parts. As I'll be building a lot of these models (for different dependent variables), I tried to write a function to generate a list of models with different correlation structures, as in the example below (I really tried to keep it to a minimum working example).
If I run an anova() on the elements of this list, this works, but only if fixedPart is in my global environment. Why is this the case? Is there a way to circumvent this problem, so that I can just keep m and re-use/delete fixedPart?
I presume this problem is related to the (lexical) scoping in R, but I cannot find a way to actually fix it.
Thanks in advance!
#Dependencies
library(multilevel)
library(multcomp)
#Generate sample data
nVals = 100
sData = rnorm(nVals, mean = 1, sd = 1)
dF <- data.frame(nSubject = 1:nVals,
v1data = sData + rnorm(nVals, mean = 0, sd = 0.1),
v2data = sData + rnorm(nVals, mean = 0, sd = 0.1),
v3data = sData + rnorm(nVals, mean = 0, sd = 0.4))
dLongF = reshape(data=dF, varying=c("v1data","v2data","v3data"), v.names='data', direction="long", idvar="nSubject", times=1:3)
#Define function to assess different covariance structures
doAllCorrModels <- function(dataF, subjVarName, visitVarName, fixedPart){
mList <- vector("list",2)
mList[[1]] <- lme(fixedPart, #Random intercept, homogeneous variance
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=NULL)
mList[[2]] <- lme(fixedPart, #Random intercept, heterogeneous variance
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=varIdent(form = as.formula(paste("~1|", visitVarName)))
)
mList
}
#Get different covariance structures
dataF <- dLongF
subjVarName <- "nSubject"
visitVarName <- "time"
fixedPart <- data ~ time
m <- doAllCorrModels(dataF, subjVarName, visitVarName, fixedPart)
#This works:
a1 <- anova(m[[1]], m[[2]])
#But this does not:
rm(fixedPart)
a2 <- anova(m[[1]], m[[2]])
You can avoid this by using do.call:
doAllCorrModels <- function(dataF, subjVarName, visitVarName, fixedPart){
mList <- vector("list",2)
mList[[1]] <- do.call(lme, list(fixed = fixedPart,
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=NULL))
mList[[2]] <- do.call(lme, list(fixed = fixedPart,
random=as.formula(paste("~1|", subjVarName)),
data=dataF,
weights=varIdent(form = as.formula(paste("~1|", visitVarName)))))
mList
}

Resources