I want to write a function that would take a lm model, try to add some feature and test its statistical significance. I've give it a go with the code as follows:
library(rlang)
library(tidyverse)
dataset <- data.frame(y = rnorm(100, 2, 3),
x1 = rnorm(100, 0, 4),
x2 = rnorm(100, 2, 1),
x3 = rnorm(100, 9, 1))
model1 <- lm(y ~ ., data = dataset)
dataset2 <- dataset %>%
mutate(x10 = rnorm(100, 20, 9),
x11 = rnorm(100, 3, 3))
test_var <- function(data, var, model){
y_name <- names(model$model)[1]
dataset_new <- data %>%
select_at(vars(y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(y_name ~ ., data = dataset_new)
return(summary(model_new))
}
As you can notice, to create a new model from available dataset I need to specify which variable should be dependent variable. However, I don't know this name directly, I just need to pull it out from the original model. So I did it in a function above, but it results in an error:
Error in model.frame.default(formula = y_name ~ ., data = dataset_new, :
variable lengths differ (found for 'y')
Correct me if I'm wrong but I believe this is due to y_name being a string, not a symbol. So I have tried the following editions:
test_var <- function(data, var, model){
y_name <- sym(names(model$model)[1])
dataset_new <- data %>%
select_at(vars(!!y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(eval(y_name) ~ ., data = dataset_new)
return(summary(model_new))
}
Although it seems to work, the resulting model is a perfect fit, as y is taken not only as dependent variable, but also as one of the features. Specifying formula with eval(y_name) ~ . - eval(y_name) doesn't help here. So my question is: how should I pass the dependent variable name to lm formula to build a correct model?
Since dataset_new contains the dependent variable in the first column, you may in fact use simply
lm(dataset_new)
Related
Let's consider data following:
set.seed(42)
y <- runif(100)
df <- data.frame("Exp" = rexp(100), "Norm" = rnorm(100), "Wei" = rweibull(100, 1))
I want to perform linear regression but when formula is a string in format:
form <- "Exp + Norm + Wei"
I thought that I only have to use:
as.formula(lm(y~form, data = df))
However it's not working. The error is about variety in length of variables. (it seems like it still treats form as a string vector of length 1, but I have no idea why).
Do you know how I can do it ?
We can use paste to construct the formula, and use it directly on lm
lm(paste('y ~', form), data = df)
-output
#Call:
#lm(formula = paste("y ~", form), data = df)
#Coefficients:
#(Intercept) Exp Norm Wei
# 0.495861 0.026988 0.046689 0.003612
I know variations of this question have been asked before but I haven't yet seen an answer on how to implement the multinomial Poisson transformation with multilevel models.
I decided to make a fake dataset and follow the method outlined here, also consulting the notes the poster mentions as well as the Baker paper on MP transformation.
In order to check if I'm doing the coding correctly, I decided to create a binary outcome variable as a first step; because glmer can handle binary response variables, this will let me check I'm correctly recasting the logit regression as multiple Poissons.
The context of this problem is running multilevel regressions with survey data where the outcome variable is response to a question and the possible predictors are demographic variables. As I mentioned above, I wanted to see if I could properly code the binary outcome variable as a Poisson regression before moving on to multi-level outcome variables.
library(dplyr)
library(lme4)
key <- expand.grid(sex = c('Male', 'Female'),
age = c('18-34', '35-64', '45-64'))
set.seed(256)
probs <- runif(nrow(key))
# Make a fake dataset with 1000 responses
n <- 1000
df <- data.frame(sex = sample(c('Male', 'Female'), n, replace = TRUE),
age = sample(c('18-34', '35-64', '45-64'), n, replace = TRUE),
obs = seq_len(n), stringsAsFactors = FALSE)
age <- model.matrix(~ age, data = df)[, -1]
sex <- model.matrix(~ sex, data = df)[, -1]
beta_age <- matrix(c(0, 1), nrow = 2, ncol = 1)
beta_sex <- matrix(1, nrow = 1, ncol = 1)
# Create class probabilities as a function of age and sex
probs <- plogis(
-0.5 +
age %*% beta_age +
sex %*% beta_sex +
rnorm(n)
)
id <- ifelse(probs > 0.5, 1, 0)
df$y1 <- id
df$y2 <- 1 - df$y1
# First run the regular hierarchical logit, just with a varying intercept for age
glm_out <- glmer(y1 ~ (1|age), family = 'binomial', data = df)
summary(glm_out)
#Next, two Poisson regressions
glm_1 <- glmer(y1 ~ (1|obs) + (1|age), data = df, family = 'poisson')
glm_2 <- glmer(y2 ~ (1|obs) + (1|age), data = df, family = 'poisson')
coef(glm_1)$age - coef(glm_2)$age
coef(glm_out)$age
The outputs for the last two lines are:
> coef(glm_1)$age - coef(glm_2)$age
(Intercept)
18-34 0.14718933
35-64 0.03718271
45-64 1.67755129
> coef(glm_out)$age
(Intercept)
18-34 0.13517758
35-64 0.02190587
45-64 1.70852847
These estimates seem close but they are not exactly the same. I'm thinking I've specified an equation wrong with the intercept.
I am writing a sub-routine to return output of longitudinal mixed-effects models. I want to be able to pass elements from lists of variables into lme/lmer as the outcome and predictor variables. I would also like to be able to specify contrasts within these mixed-effects models, however I am having trouble with getting the contrasts() argument to recognise the strings as the variable names referred to in the model specification within the same lme/lme4 call.
Here's some toy data,
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now the following call to lme works just fine, with contrasts specified (I know these are the default so this is all purely pedagogical).
mod <- lme(score ~ group*time, random = ~1|id, data = df, contrasts = list(group = contr.treatment(2), time = contr.treatment(4)))
The following also works, passing strings as variable names into lme using the reformulate() function.
t <- "time"
g <- "group"
dv <- "score"
mod1R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df)
But if I want to specify contrasts, like in the first example, it doesn't work
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df, contrasts = list(g = contr.treatment(2), t = contr.treatment(4)))
# Error in `contrasts<-`(`*tmp*`, value = contrasts[[i]]) : contrasts apply only to factors
How do I get lme to recognise that the strings specified to in the contrasts argument refer to the variables passed into the reformulate() function?
You should be able to use setNames() on the list of contrasts to apply the full names to the list:
# Using a %>% pipe so need to load magrittr
library(magrittr)
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"),
random = ~1|id,
data = df,
contrasts = list(g = contr.treatment(2), t = contr.treatment(4)) %>%
setNames(c(g, t))
)
In the modelr package the function gather_predictions can be used to add predictions from multiple models to a data frame, I'm however unsure on how to specify these models in the function call. The help documentation gives the following exmaple:
df <- tibble::data_frame(
x = sort(runif(100)),
y = 5 * x + 0.5 * x ^ 2 + 3 + rnorm(length(x))
)
m1 <- lm(y ~ x, data = df)
grid <- data.frame(x = seq(0, 1, length = 10))
grid %>% add_predictions(m1)
m2 <- lm(y ~ poly(x, 2), data = df)
grid %>% spread_predictions(m1, m2)
grid %>% gather_predictions(m1, m2)
here the models are specifically mentioned in the function call. That works fine if we have a few models we want predictions for, but what if we have a large or unknown amount of models? In this case manually specifying the models isn't really workable anymore.
the way the help documentation phrases the arguments segment seems to suggest you need to add every model as a separate argument.
gather_predictions and spread_predictions take multiple models. The
name will be taken from either the argument name of the name of the
model.
And for example inputting a list of models into gather_predictions doesn't work.
Is there some easy way to input a list / large amount of models to gather_predictions?
example for 10 models in a list:
modelslist <- list()
for (N in 1:10) {
modelslist[[N]] <- lm(y ~ poly(x, N), data = df)
}
If having the models stored some other way than a list works better, that's fine as well.
m <- grid %>% gather_predictions(lm(y ~ poly(x, 1), data = df))
for (N in 2:10) {
m <- rbind(m, grid %>% gather_predictions(lm(y ~ poly(x, N), data = df)))
}
There are workarounds to solve this problem. My approach was to:
1. build a list of models with specific names
2. use a tweaked version of modelr::gather_predictions() to apply all models in the list to data
# prerequisites
library(tidyverse)
set.seed(1363)
# I'll use generic name 'data' throughout the code, so you can easily try other datasets.
# for this example I'll use your data df
data=df
# data visualization
ggplot(data, aes(x, y)) +
geom_point(size=3)
your sample data
# build a list of models
models <-vector("list", length = 5)
model_names <- vector("character", length=5)
for (i in 1:5) {
modelformula <- str_c("y ~ poly(x,", i, ")", sep="")
models[[i]] <- lm(as.formula(modelformula), data = data)
model_names[[i]] <- str_c('model', i) # remember we name the models here sequantially
}
# apply names to the models list
names(models) <- model_names
# this is modified verison of modelr::gather_predictions() in order to accept list of models
gather.predictions <- function (data, models, .pred = "pred", .model = "model")
{
df <- map2(models, .pred, modelr::add_predictions, data = data)
names(df) <- names(models)
bind_rows(df, .id = .model)
}
# the rest is the same as modelr's function...
grids <- gather.predictions(data = data, models = models, .pred = "y")
ggplot(data, aes(x, y)) +
geom_point() +
geom_line(data = grids, colour = "red") +
facet_wrap(~ model)
example of polynomial models (degree 1:5) applied to your sample data
side note: there are good reasons why I chose strings to build the model...to discuss.
I want to take a list of matched data sets (where observations are being matched on their propensity scores, using the MatchIt Package) for subsequent modelling in the Zelig Package.
In this example, there are two treatments I'll match on (t1 and t2), two independent variables (x1 and x2), and an outcome (y1).
library(Zelig)
library(MatchIt)
library(plyr)
d1 <- data.frame(y1 = rbinom(100, 1, .5),
x1 = runif(100),
x2 = runif(100),
t1 = rbinom(100, 1, .5),
t2 = rbinom(100, 1, .5))
First, I'll make a list of matched data frames:
list.dfs <- llply(c("t1", "t2"),
function(i)
matchit(as.formula(paste0(i, "~ x1 + x2")), data= d1))
Just a check--each element of list.dfs has the right class:
class(list.dfs[[1]])
[1] "matchit"
Next, I want to take element matched data frame from this list, and make a list of Zelig model objects
list.mods <- llply(list.dfs,
function(i)
zelig(y1 ~ x1 + x2, model = "logit", data = match.data(i)))
Which provides the following error:
Error in match.data(i) : object 'i' not found
But this is clearly something to do with the list, since everything works if I do the same function to a single element of list.dfs:
class(zelig(y1 ~ x1 + x2, model = "logit", data = match.data(list.dfs[[1]])))
[1] "zelig" "logit"
What am I missing? How can I get Zelig to work on separate items in this list?
There seems to be some weird stuff inside zelig that looks for the value of data by name. Looks like you're going to have to do an explicit loop:
list.mods <- list()
for(i in seq_along(list.dfs)) {
list.mods[[i]] <- zelig(y1 ~ x1 + x2, model = "logit", data = match.data(list.dfs[[i]]))
}
list.mods