How to run regression when formula is given by a string?

How to run regression when formula is given by a string? - r

Let's consider data following:
set.seed(42)
y <- runif(100)
df <- data.frame("Exp" = rexp(100), "Norm" = rnorm(100), "Wei" = rweibull(100, 1))
I want to perform linear regression but when formula is a string in format:
form <- "Exp + Norm + Wei"
I thought that I only have to use:
as.formula(lm(y~form, data = df))
However it's not working. The error is about variety in length of variables. (it seems like it still treats form as a string vector of length 1, but I have no idea why).
Do you know how I can do it ?

We can use paste to construct the formula, and use it directly on lm
lm(paste('y ~', form), data = df)
-output
#Call:
#lm(formula = paste("y ~", form), data = df)
#Coefficients:
#(Intercept) Exp Norm Wei
# 0.495861 0.026988 0.046689 0.003612

Related

Dummies not included in summary

I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?

One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.

lm formula with variable names in it

I want to write a function that would take a lm model, try to add some feature and test its statistical significance. I've give it a go with the code as follows:
library(rlang)
library(tidyverse)
dataset <- data.frame(y = rnorm(100, 2, 3),
x1 = rnorm(100, 0, 4),
x2 = rnorm(100, 2, 1),
x3 = rnorm(100, 9, 1))
model1 <- lm(y ~ ., data = dataset)
dataset2 <- dataset %>%
mutate(x10 = rnorm(100, 20, 9),
x11 = rnorm(100, 3, 3))
test_var <- function(data, var, model){
y_name <- names(model$model)[1]
dataset_new <- data %>%
select_at(vars(y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(y_name ~ ., data = dataset_new)
return(summary(model_new))
}
As you can notice, to create a new model from available dataset I need to specify which variable should be dependent variable. However, I don't know this name directly, I just need to pull it out from the original model. So I did it in a function above, but it results in an error:
Error in model.frame.default(formula = y_name ~ ., data = dataset_new, :
variable lengths differ (found for 'y')
Correct me if I'm wrong but I believe this is due to y_name being a string, not a symbol. So I have tried the following editions:
test_var <- function(data, var, model){
y_name <- sym(names(model$model)[1])
dataset_new <- data %>%
select_at(vars(!!y_name,
str_remove_all(labels(model), '`'),
var))
model_new <- lm(eval(y_name) ~ ., data = dataset_new)
return(summary(model_new))
}
Although it seems to work, the resulting model is a perfect fit, as y is taken not only as dependent variable, but also as one of the features. Specifying formula with eval(y_name) ~ . - eval(y_name) doesn't help here. So my question is: how should I pass the dependent variable name to lm formula to build a correct model?

Since dataset_new contains the dependent variable in the first column, you may in fact use simply
lm(dataset_new)

Dynamically create model formula in a loop [duplicate]

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}

See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433

An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2

Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])

You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))

I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

How to programmatically create formulas using tildes in R [duplicate]

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}

See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433

An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2

Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])

You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))

I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

Formula with dynamic number of variables

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}

See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433

An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2

Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])

You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))

I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)