Adding extra variables to a formula - r

I want to add extra variables to a formula, with the use of a separate object part_B. As an example:
part_A <- as.formula("y ~ x1")
part_B <- c("x2", "x3")
I tried a couple of things, but one issue is that you cannot call as.formula on the object part_B (because in that case I could have created the formula by combining character vectors).
Desired Result
as.formula("y ~ x1 + x2 + x3")
Is there any way to do this? I guess one solution would be to create a function that writes the character vector as "y ~ x1 + x2 + x3" so it can be fed to as.formula.

Use reformulate and update like this:
update(part_A, reformulate(c(".", part_B)))
## y ~ x1 + x2 + x3
This also works:
v <- all.vars(part_A)
reformulate(c(v[-1], part_B), v[1])
## y ~ x1 + x2 + x3

If you write part_A as a vector instead:
part_A <- c("y", "x1")
part_B <- c("x2", "x3")
new_formula <- as.formula(paste(part_A[1], paste(c(part_A[2], part_B), collapse = " + "), sep = " ~ "))

Related

R model.matrix column names for factors

I use model.matrix to create a matrix used by GLM.
formula_test <- as.formula("Y ~ x1 + x2")
data_test <- expand.grid(
Y = 1:100
, x1 = c("A","B")
, x2 = 1:20
)
result_test <- data.frame(model.matrix(
object = formula_test
, data = data_test
))
names(result_test)
Interestingly, the column names of the result_test data are "X.Intercept." "x1B" "x2"
How come the second column name is not "x1A"?
I then tried data_test$x1 <- factor(x = data_test$x1, levels = c("A","B"))but it's still the same.
That is because if you had c("X.Intercept.", "x1A", "x1B", "x2"), then you would have perfect multicollinearity: x1A + x1B would be a column of ones, just like the X.Intercept. column. If, for the sake of interpretation, you prefer having x1A instead of the intercept, we may use
formula_test <- as.formula("Y ~ -1 + x1 + x2")
giving
names(result_test)
# [1] "x1A" "x1B" "x2"
and
all(rowSums(result_test[, c("x1A", "x1B")]) == 1)
# [1] TRUE
As for why it is x1A that is dropped rather than x1B, the rule seems to be that the first factor levels goes away. If instead we use
levels(data_test$x1) <- c("B", "A")
then this gives
names(result_test)
# [1] "X.Intercept." "x1A" "x2"

How to paste formula into model.matrix function in R?

By way of simplified example, say you have the following data:
n <- 10
df <- data.frame(x1 = rnorm(n, 3, 1), x2 = rnorm(n, 0, 1))
And you wish to create a model matrix of the following form:
model.matrix(~ df$x1 + df$x2)
or more preferably:
model.matrix(~ x1 + x2, data = df)
but instead by pasting the formula into model.matrix. I have experimented with the following but encounter errors with all of them:
form1 <- "df$x1 + df$x2"
model.matrix(~ as.formula(form1))
model.matrix(~ eval(parse(text = form1)))
model.matrix(~ paste(form1))
model.matrix(~ form1)
I've also tried the same with the more preferable structure:
form2 <- "x1 + x2, data = df"
Is there a direct solution to this problem? Or is the model.matrix function not conducive to this approach?
Do you mean something like this?
expr <- "~ x1 + x2"
model.matrix(as.formula(expr), df)
You need to give df as the data argument outside of as.formula, as the data argument defines the environment within which to evaluate the formula.
If you don't want to specify the data argument you can do
model.matrix(as.formula("~ df$x1 + df$x2"))

How to preserve column names when dynamically passing data frame columns to `aggregate`

With a data frame like below
df1 <- data.frame(a=seq(1.1,9.9,1.1), b=seq(0.1,0.9,0.1),
c=rev(seq(10.1, 99.9, 11.1)))
I want to aggregate cols b and c by a
So I would do something like this
aggregate(cbind(b,c) ~ a, data = df1, mean)
This would get it done. However I want to generalize without hard coded column names like in a function.
myAggFunction <- function (df, col_main, col_1, col_2){
return (aggregate(cbind(df[,col1], df[,col2]) ~ df[,col_main], df, mean))
}
myAggFunction(df, 1, 2, 3)
The issue I have is that the col names of the returned data frame is as below
df2[, 1] V1 V2
How do I get the column names in the original data frame in the returned data frame?
I will be assuming a general case, where you have multiple LHS (left hand sides) as well as multiple RHS (right hand sides).
Using "data.frame" method
## S3 method for class 'data.frame'
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)
If you pass object as a named list, you get names preserved. So do not access your data frame with [, ], but with []. You may construct your function as:
## `LHS` and `RHS` are vectors of column names or numbers giving column positions
fun1 <- function (df, LHS, RHS){
## call `aggregate.data.frame`
aggregate.data.frame(df[LHS], df[RHS], mean)
}
Still using "formula" method?
## S3 method for class 'formula'
aggregate(formula, data, FUN, ...,
subset, na.action = na.omit)
It is slightly tedious, but we want to construct a nice formula via:
as.formula( paste(paste0("cbind(", toString(LHS), ")"),
paste(RHS, collapse = " + "), sep = " ~ ") )
For example:
LHS <- c("y1", "y2", "y3")
RHS <- c("x1", "x2")
as.formula( paste(paste0("cbind(", toString(LHS), ")"),
paste(RHS, collapse = " + "), sep = "~") )
# cbind(y1, y2, y3) ~ x1 + x2
If you feed this formula to aggregate, you will get decent column names preserved.
So construct your function as such:
fun2 <- function (df, LHS, RHS){
## ideally, `LHS` and `RHS` should readily be vector of column names
## but specifying vector of numeric positions are allowed
if (is.numeric(LHS)) LHS <- names(df)[LHS]
if (is.numeric(RHS)) RHS <- names(df)[RHS]
## make a formula
form <- as.formula( paste(paste0("cbind(", toString(LHS), ")"),
paste(RHS, collapse = " + "), sep = "~") )
## call `aggregate.formula`
stats:::aggregate.formula(form, df, mean)
}
Remark
aggregate.data.frame is the best. aggregate.formula is a wrapper and will call model.frame inside to construct a data frame first.
I give "formula" method as an option, because the way I construct a formula is useful for lm, etc.
Simple, reproducible example
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10),
x1 = gl(2,5, labels = letters[1:2]))
## "data.frame" method with `fun1`
fun1(dat, 1:2, 3)
# x1 y1 y2
#1 a 0.79071819 -0.3543499
#2 b -0.07287026 -0.3706127
## "formula" method with `fun2`
fun2(dat, 1:2, 3)
# x1 y1 y2
#1 a 0.79071819 -0.3543499
#2 b -0.07287026 -0.3706127
fun2(dat, c("y1", "y2"), "x1")
# x1 y1 y2
#1 a 0.79071819 -0.3543499
#2 b -0.07287026 -0.3706127

Using * in lm for variables with common names

If I have 100 variables with a common name, such as year_1951, year_1952, year_1953 etc, is there a way to do a linear regression that includes all variables that start with year_ ? In Stata this is easy by using the *, but in R, I'm not sure how to go about this.
THanks.
Stata Example :
regress y year_*
Is there an equivalence in R, such as
ols.lm <- lm(y ~ year_*, data = d)
I don't think R support that kind of expansion inside formula. It do support y ~ . kind of expansion.
Here is how you can do it
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
myformulae <- as.formula(paste(depVar,paste(indepVars,collapse=' + '),sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Edit
: Solving the problem mentioned in the comment (Adding constants in the formulae)
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
indepVarsCollapse <- paste(paste(indepVars,collapse=' + '), '-1')
myformulae <- as.formula(paste(depVar,indepVarsCollapse,sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Rather than selecting the columns in the formula, select them in the data argument:
nms <- c("y", grep("year_", names(d), value = TRUE))
lm(y ~., d[nms])
Alternately, select all the desired columns in the grep
ix <- grep("^(y$|year_)", names(d))
lm(y ~., d[ix])
or if we knew that the unwanted columns do not start with y:
ix <- grep("^y", names(d))
lm(y ~., d[ix])

How to construct a big regular formula for a model in R?

I am trying create model to predict "y" from data "D" that contain predictor x1 to x100 and other 200 variables . since all Xs are not stored consequently I can't call them by column.
I can't use ctree( y ~ , data = D) because other variables , Is there a way that I can refer them x1:100 ?? in the model ?
instead of writing a very long code
ctree( y = x1 + x2 + x..... x100)
Some recommendation would be appreciated.
Two more. The simplest in my mind is to subset the data:
ctree(y ~ ., data = D[, c("y", paste0("x", 1:100))]
Or a more functional approach to building dynamic formulas:
ctree(reformulate(paste0("x", 1:100), "y"), data = D)
Construct your formula as a text string, and convert it with as.formula.
vars <- names(D)[1:100] # or wherever your desired predictors are
fm <- paste("y ~", paste(vars, collapse="+"))
fm <- as.formula(fm)
ctree(fm, data=D, ...)
You can use this:
fml = as.formula(paste("y", paste0("x", 1:100, collapse=" + "), sep=" ~ "))
ctree(fmla)

Resources