biglm - Error: $ operator is invalid for atomic vectors - bigdata

I am trying to run a generalized linear model on a very large dataset (several million rows). R doesn't seem able to handle the analysis, however, as I keep getting memory allocation errors (unable to allocate vector of size...etc.).
The data fit in RAM, but seem to be too large to estimate complex models. As a solution, I'm exploring using the ff package to replace r's in-RAM storage mechanism with on-disk storage.
I have successfully (I think) off-loaded the data to my hard drive, but when I attempt to estimate the glm (via the biglm package) I get the following error:
Error: $ operator is invalid for atomic vectors
I'm not sure why I'm getting this specific error when I use the bigglm function. When I run the glm on the full dataset, it doesn't give me this specific error, though perhaps r is running out of memory before it gets far enough for the "operator is invalid" error to trigger.
I've provided an example data set and code below. Note that the standard glm runs just fine on this sample data. The problem arises when using biglm.
Please let me know if you have any questions.
Thank you in advance!
#Load required packages
library(readr)
library(ff)
library(ffbase)
library(LaF)
library(biglm)
#Create sample data
df <- data.frame("id" = as.character(1:20), "group" = rep(seq(1:5), 4),
"x1" = as.character(rep(c("a", "b", "c", "d"), 5)),
"x2" = rnorm(20, 50, 1), y = sample(0:1, 20, replace=T),
stringsAsFactors = FALSE)
#Write data to file
write_csv(df, "df.csv")
#Create connection to sample data using laf
con <- laf_open_csv(filename = "df.csv",
column_types = c("string", "string", "string",
"double", "string"),
column_names = c("id", "group", "x1", "x2", "y"),
skip = 1)
#Use LaF to import data into ffdf object
ff <- laf_to_ffdf(laf = con)
#Fit glm on data stored in RAM (note this model runs fine)
fit.glm <- glm(y ~ factor(x1) + x2 + factor(group), data=df,
family="binomial")
#Fit glm on data stored on hard-drive (note this model fails)
fit.big <- bigglm(y ~ factor(x1) + x2 + factor(group), data=ff,
family="binomial")

You are using the wrong family argument.
library(ffbase)
library(biglm)
df <- data.frame("id" = factor(as.character(1:20)), "group" = factor(rep(seq(1:5), 4)),
"x1" = factor(as.character(rep(c("a", "b", "c", "d"), 5))),
"x2" = rnorm(20, 50, 1), y = sample(0:1, 20, replace=T),
stringsAsFactors = FALSE)
d <- as.ffdf(df)
fit.big <- bigglm.ffdf(y ~ x1 + x2 , data = d,
family = binomial(link = "logit"), chunksize = 3)

Related

Error when using regTermTest from R survey() package and MIResult object

I'm trying to perform a Wald test on an interaction term in a survey-weighted model with imputed data - see below for a toy reprex.
When I run the last three lines of code with regTermTest using syntax modeled directly off the function documentation I get the following error: Error in terms.default(model) : no terms component nor attribute.
A quick Google search seems to suggest this error means I'm passing an unsupported object type to the function; I read this as perhaps regTermTest does not support passing an MIResult to model. However, it seems that as of version 3.6 of the survey package, regTermTest supports MIResult models? This page also seems to suggest that as well.
Appreciate any guidance on what I'm doing wrong. Alternatively I could be happy if someone knows how to get p-values on individual model terms from an MIResult object (e.g. like is shown in this post for a regular model object).
# load packages
library(tidyverse)
library(survey)
library(mi)
library(mitools)
# load data on school performance included in survey package
# documentation available here: https://r-survey.r-forge.r-project.org/survey/html/api.html
data(api)
# remove problematic variables that are unnecessary for this example
apisub <- apiclus1 %>% select(-c("name", "sname", "dname", "cname", "flag",
"acs.46", "acs.core"))
# create and update variable types in missing_data.frame
mdf <- missing_data.frame(apisub)
mdf <- change(mdf, "cds", what = "type", to = "irrelevant")
mdf <- change(mdf, "stype", what = "type", to = "irrelevant")
mdf <- change(mdf, "snum", what = "type", to = "irrelevant")
mdf <- change(mdf, "dnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "cnum", what = "type", to = "irrelevant")
mdf <- change(mdf, "fpc", what = "type", to = "irrelevant")
mdf <- change(mdf, "pw", what = "type", to = "irrelevant")
# summarize the missing_data.frame
show(mdf)
# impute missing data
imputations <- mi(mdf)
# create imputation list to pass to svydesign
imp_list <- complete(imputations, m = 5)
# create complex survey design using imputed data
dsn <- svydesign(id = ~dnum,
weights = ~pw,
data = imputationList(imp_list),
fpc = ~fpc)
# subset the survey design to remove schools that did not meet both targets
# just as an example of subsetting
dsn_sub <- subset(dsn, both == "No")
# specify analytic model
anl <- with(dsn_sub,
svyglm(api99 ~ enroll + meals + avg.ed*ell,
family = gaussian(),
design = dsn
)
)
# combine results into a single output
res <- MIcombine(anl)
# perform wald test for main and ixn terms
regTermTest(res, ~meals)
regTermTest(res, ~avg.ed:ell)
regTermTest(res, ~avg.ed*ell)
Didn't figure out how to do this with regTermTest but I learned that you can run the following to run a global test for interaction instead:
library(aod)
> aod::wald.test(Sigma = vcov(res),
+ b = coef(res),
+ Terms = 6)
Wald test:
----------
Chi-squared test:
X2 = 0.031, df = 1, P(> X2) = 0.86

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

lm formula with variable names in it

I want to write a function that would take a lm model, try to add some feature and test its statistical significance. I've give it a go with the code as follows:
library(rlang)
library(tidyverse)
dataset <- data.frame(y = rnorm(100, 2, 3),
x1 = rnorm(100, 0, 4),
x2 = rnorm(100, 2, 1),
x3 = rnorm(100, 9, 1))
model1 <- lm(y ~ ., data = dataset)
dataset2 <- dataset %>%
mutate(x10 = rnorm(100, 20, 9),
x11 = rnorm(100, 3, 3))
test_var <- function(data, var, model){
y_name <- names(model$model)[1]
dataset_new <- data %>%
select_at(vars(y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(y_name ~ ., data = dataset_new)
return(summary(model_new))
}
As you can notice, to create a new model from available dataset I need to specify which variable should be dependent variable. However, I don't know this name directly, I just need to pull it out from the original model. So I did it in a function above, but it results in an error:
Error in model.frame.default(formula = y_name ~ ., data = dataset_new, :
variable lengths differ (found for 'y')
Correct me if I'm wrong but I believe this is due to y_name being a string, not a symbol. So I have tried the following editions:
test_var <- function(data, var, model){
y_name <- sym(names(model$model)[1])
dataset_new <- data %>%
select_at(vars(!!y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(eval(y_name) ~ ., data = dataset_new)
return(summary(model_new))
}
Although it seems to work, the resulting model is a perfect fit, as y is taken not only as dependent variable, but also as one of the features. Specifying formula with eval(y_name) ~ . - eval(y_name) doesn't help here. So my question is: how should I pass the dependent variable name to lm formula to build a correct model?
Since dataset_new contains the dependent variable in the first column, you may in fact use simply
lm(dataset_new)

Passing strings into 'contrasts' argument of lme/lmer

I am writing a sub-routine to return output of longitudinal mixed-effects models. I want to be able to pass elements from lists of variables into lme/lmer as the outcome and predictor variables. I would also like to be able to specify contrasts within these mixed-effects models, however I am having trouble with getting the contrasts() argument to recognise the strings as the variable names referred to in the model specification within the same lme/lme4 call.
Here's some toy data,
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now the following call to lme works just fine, with contrasts specified (I know these are the default so this is all purely pedagogical).
mod <- lme(score ~ group*time, random = ~1|id, data = df, contrasts = list(group = contr.treatment(2), time = contr.treatment(4)))
The following also works, passing strings as variable names into lme using the reformulate() function.
t <- "time"
g <- "group"
dv <- "score"
mod1R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df)
But if I want to specify contrasts, like in the first example, it doesn't work
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df, contrasts = list(g = contr.treatment(2), t = contr.treatment(4)))
# Error in `contrasts<-`(`*tmp*`, value = contrasts[[i]]) : contrasts apply only to factors
How do I get lme to recognise that the strings specified to in the contrasts argument refer to the variables passed into the reformulate() function?
You should be able to use setNames() on the list of contrasts to apply the full names to the list:
# Using a %>% pipe so need to load magrittr
library(magrittr)
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"),
random = ~1|id,
data = df,
contrasts = list(g = contr.treatment(2), t = contr.treatment(4)) %>%
setNames(c(g, t))
)

random model formula object

I want to put formula in random model, but I think following error is due to wrong formula object (?), but could not fix it.
set.seed(1234)
mydata <- data.frame (A = rep(1:3, each = 20), B = rep(1:2, each = 30),
C = rnorm(60, 10, 5))
mydata$A <- as.factor(mydata$A)
mydata$B <- as.factor(mydata$B)
myfunction <- function (mydata, yvars, genovar, replication) {
require("lme4")
formula = paste ("yvars" ~ 1|"genovar" + 1|"replication")
model1 <- lmer(formula, data = dataframe, REML = TRUE)
return(ranef(model2))
}
myfunction(mydata=dataf, yvars = "C", genovar = "A", replication = "B")
Error: length(formula <- as.formula(formula)) == 3 is not TRUE
There were several wonky things in here, but this is I think close to what you want.
set.seed(1234)
mydata <- data.frame (A = factor(rep(1:3, each = 20)),
B = factor(rep(1:2, each = 30)),
C = rnorm(60, 10, 5))
require("lme4")
myfunction <- function (mydata, yvars, genovar, replication) {
formula <- paste (yvars,"~ (1|",genovar,") + (1|",replication,")")
model1 <- lmer(as.formula(formula), data = mydata, REML = TRUE)
return(ranef(model1))
}
myfunction(mydata=mydata, yvars = "C", genovar = "A", replication = "B")
Beware, however, that lmer doesn't work the way that classical random-effects ANOVA does -- it may perform very badly with such small numbers of replicates. (In the example I tried it set the variance of A to zero, which is at least not unreasonable.) The GLMM FAQ has some discussion of this issue. (Random-effects ANOVA would have exceedingly low power in that case but might not be quite as bad.) If you really want to do random-effects models on such small samples you might want to consider reconstructing the classical method-of-moments approach (as I recall there is/was a raov formula in S-PLUS that did random-effects ANOVA, but I don't know if it was ever implemented in R).
Finally, for future questions along these lines you may do better on the r-sig-mixed-models#r-project.org mailing list -- Stack Overflow is nice but there is more R/mixed-model expertise over there.

Resources