Dropping variable in lm formula still triggers contrast error - r

I'm trying to run lm() on only a subset of my data, and running into an issue.
dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data
lm( y ~ ., dt) # Use all x: Works
lm( y ~ ., dt[x3 == 'men']) # Use all x, limit to men: doesn't work (as expected)
The above doesn't work because the dataset now has only men, and we therefore can't
include x3, the gender variable, into the model. BUT...
lm( y ~ . -x3, dt[x3 == 'men']) # Exclude x3, limit to men: STILL doesn't work
lm( y ~ x1 + x2, dt[x3 == 'men']) # Exclude x3, with different notation: works great
This is an issue with the "minus sign" notation in the formula? Please advice. Note: Of course I can do it a different way; for example, I could exclude the variables prior to putting them into lm(). But I'm teaching a class on this stuff, and I don't want to confuse the students, having already told them they can exclude variable using a minus sign in the formula.

The error you are getting is because x3 is in the model with only one value = "men" (see comment below from #Artem Sokolov)
One way to solve it is to subset ahead of time:
dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data
dmen<-dt[x3 == 'men'] # create a new subsetted dataset with just men
lm( y ~ ., dmen[,-"x3"]) # now drop the x3 column from the dataset (just for the model)
Or you can do both in the same step:
lm( y ~ ., dt[x3 == 'men',-"x3"])

Related

How to do rolling regression against multiple independent variables using the roll_lm function

I'm trying to regress returns against FF 3-factors with a rolling window.
To do so, I have found the function roll_lm in R, but the function is only producing regression output for one of the 3 variables.
The code is described here:
Y <- as.matrix(Portfolio_returns[,2])
X1 <- as.matrix(Mydata[,2])
X2 <- as.matrix(Mydata[,3])
X3 <- as.matrix(Mydata[,4])
Five_years_Rolling_reg <- roll_lm(X1 + X2 + X3,Y,60)
When I apply the coef function, I only get output for X1 and not X2 nor X3.
What am I doing wrong?
You problem seems to be a basic misunderstanding of how the function works. Looking at ?roll_lm
Arguments
x
matrix or xts object. Rows are observations and columns are the independent variables.
Currently it seems like you are trying to use a formula = X1 + X2 + X3 style of input, which is not what the help page is saying. As such it is adding the columns together as if it was: x1 = 2; x2 = 3; x1 + x2 = 5
Instead you should bind the rows together.
Y <- as.matrix(Portfolio_returns[,2])
X <- as.matrix(Mydata[,2:4]
roll_lm(X, Y, 60)
Or alternatively use the model.frame, model.response, model.matrix functions from base-R, which gives you the familiarity of the formula settings.
names(Mydata)[1:4] <- c("Y", "X1", "X2", "X3")
frame <- model.frame(Y ~ X1 + X2 + X3, data = Mydata)
X <- model.matrix(Y ~ X1 + X2 + X3, data = Mydata)
roll_lm(X, model.response(frame), 60)

Dynamically create model formula in a loop [duplicate]

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

How to programmatically create formulas using tildes in R [duplicate]

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

I want to give new data to the predict.lm. Why an object is not found in data.frame(), which I have used its logarithm in the linear regression model?

Using a dataset I built a model as below:
fit <- lm(y ~ as.numeric(X1) + as.factor(x2) + log(1 + x3) + as.numeric(X4) , dataset)
Then I build new data:
X1 <- 1
X2 <- 10
X3 <- 15
X4 <- 0.5
new <- data.frame(X1, X2, X3, X4)
predict(fit, new , se.fit=TRUE)
Then I get the Error below:
Error in data.frame(state_today, daily_creat, last1yr_min_hosp_icu_MDRD, :
object 'X2' is not found
What am I doing wrong? Is this because of logarithm in the model?
A great way of looking at your problem another way is by constructing a self contained reproducible example. With no copy/pasting. This often gives you a fresh perspective and often teases out the weirdest bugs imaginable.
As flodel and Ben have pointed out, your problem is probably due to bad choice of variable names. I'm guessing you're using Rstudio, which in my opinion uses a terrible default font exactly for this reason. I can't tell x and X apart (easily).
Here is something similar to what you're trying to do, with all variable names correctly (un)capitalized.
xy <- data.frame(y = runif(20), x1 = runif(20), x2 = sample(1:5, 20, replace = TRUE), x3 = runif(20))
fit <- lm(y ~ as.numeric(x1) + as.factor(x2) + log(1+x3), data = xy)
predict(fit, newdata = data.frame(x1 = 1, x2 = as.factor(3), x3 = 15))
1
0.05015187

Formula with dynamic number of variables

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

Resources