I'm a beginner and I'm trying to write a function that fits a model (univariate) one variable at a time. I want to able to enter all the dependent variables when calling the function and then make the function to fit one variable at a time. Something like:
f <-function(y, x1,x2,x3) {(as.formula(print()))...}
I tried to make a list of x1, x2, x3 inside the function but it didn't work.
Maybe someone here could help me with this or point me in the right direction.
Here's something that might get you going. You can put your x variables inside a list, and then fit once for each element inside the list.
fitVars <- function(y, xList) {
lapply(xList, function(x) lm(y ~ x))
}
y <- 1:10
set.seed(10)
xVals <- list(rnorm(10), rnorm(10), rnorm(10))
fitVars(y, xVals)
This returns one fit result for each element of xVals:
[[1]]
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
4.984 -1.051
[[2]]
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
5.986 -1.315
[[3]]
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
7.584 2.282
Another option is to use the ... holder to use an arbitrary number of arguments:
fitVars <- function(y, ...) {
xList <- list(...)
lapply(xList, function(x) lm(y ~ x))
}
set.seed(10)
fitVars(y, rnorm(10), rnorm(10), rnorm(10))
This gives the same result as above.
Related
The code below creates a linear model with R's lm, then a weighted model with a weights column. Finally, I try to pass in the weight column name with a variable weight_col and that fails. I'm pretty sure it's looking for "weight_col" in df, then the caller's environment, finds a variable of length 1, and the lengths don't match.
How do I get it to use weight_col as a name for the weights column in df?
I've tried several combinations of things without success.
> df <- data.frame(
x=c(1,2,3),
y=c(4,5,7),
w=c(1,3,5)
)
> lm(y ~ x, data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
2.333 1.500
> lm(y ~ x, data=df, weights=w)
Call:
lm(formula = y ~ x, data = df, weights = w)
Coefficients:
(Intercept) x
1.947 1.658
> weight_col <- 'w'
> lm(y ~ x, data=df, weights=weight_col)
Error in model.frame.default(formula = y ~ x, data = df, weights = weight_col, :
variable lengths differ (found for '(weights)')
> R.version.string
[1] "R version 3.6.3 (2020-02-29)"
You can use the data frame name with extractor operator:
lm(y ~ x, data = df, weights = df[[weight_col]])
Or you can use function get:
lm(y ~ x, data = df, weights = get(weight_col))
We can use [[ to extract the value of the column
lm(y ~ x, data=df, weights=df[[weight_col]])
Or with tidyverse
library(dplyr)
df %>%
summarise(model = list(y ~ x, weights = .data[[weight_col]]))
Your first example if weights = w, which is using non-standard evaluation to find w in the context of df. So far, this is normal for interactive use.
Your second set is weights = weight_col which resolves to weights = "w", which is very different. There is nothing in R's non-standard (or standard) evaluation in which that makes sense.
As I said in my comment, use the standard-evaluation form with [[.
lm(y ~ x, data=df, weights=df[[weight_col]])
# Call:
# lm(formula = y ~ x, data = df, weights = df[[weight_col]])
# Coefficients:
# (Intercept) x
# 1.947 1.658
I am trying to create a function that passes a parameter in as the dependent variable with the independent variables staying the same.
I have tried to use {{}} but see the problem as something like the below if select contains was possible.
test_func <- function(dataframe, dependent){
model <- tidy(lm({{ dependent }} ~ . - select(contains("x")), data = dataframe))
return(model)
}
test_func(datasets::anscombe, x1)
The function should pass as function(dataframe, dependent) with a single model.
Use reformulate().
f <- function(d, y) lm(reformulate(names(d)[grep("x", names(d))], response=y), data=d)
f(datasets::anscombe, "y1")
# Call:
# lm(formula = reformulate(names(d)[grep("x", names(d))], response = y),
# data = d)
#
# Coefficients:
# (Intercept) x1 x2 x3 x4
# 4.33291 0.45073 NA NA -0.09873
I want to estimate an equation such as:
(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:
lm(Y ~ .*Z, data= sdata)
This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:
Edit:
As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:
x1 <- runif(100)
x2 <- runif(100)
Z <- runif(100)
Y <- exp(x1) + exp(x2) + exp(z)
##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)
##basically what I want is that the following terms, and their interactions with Z are also used:
# X1^2 - mean(X1^2)
# X2^2 - mean(X2^2)
# X1*X2 - mean(X1*X2)
Edit 2:
Now, what I want to achieve is basically what
lm(Y ~ .^2*Z, data= sdata)
would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)
To show that scale works inside a formula:
lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)
Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)
Coefficients:
(Intercept) cyl scale(disp * hp, scale = F)
3.312e+01 -2.105e+00 -4.642e-05
Now for comparison let's scale the interaction outside the formula:
mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))
lm(mpg ~ cyl + scaled_interaction, data=mtcars)
Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)
Coefficients:
(Intercept) cyl scaled_interaction
3.312e+01 -2.105e+00 -4.642e-05
At least in these examples, it seems as if scale inside formulae is working.
To provide a solution to your specific issue:
Alternative 1: Use formulae
# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2
Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) +
Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z
Alternative 2: Use Model Matrices
# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))
Here, all x's and the interactions are demeaned:
> head(modmat)
(Intercept) x1 x2 x1:x2
[1,] 0 0.1042908 -0.08989091 -0.01095459
[2,] 0 0.1611867 -0.32677059 -0.05425087
[3,] 0 0.2206845 0.29820499 0.06422944
[4,] 0 0.3462069 -0.15636463 -0.05571430
[5,] 0 0.3194451 -0.38668844 -0.12510551
[6,] 0 -0.4708222 -0.32502269 0.15144812
> round(colMeans(modmat), 2)
(Intercept) x1 x2 x1:x2
0 0 0 0
You can use the model matrix as follows:
modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)
It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.
As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.
Comparing both approaches:
Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.
> lm(sdata$Y ~ modmat*sdata$Z)
Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)
Coefficients:
(Intercept) modmatx1 modmatx2 modmatx1:x2 sdata$Z
4.33105 1.56455 1.43979 -0.09206 1.72901
modmatx1:sdata$Z modmatx2:sdata$Z modmatx1:x2:sdata$Z
0.25332 0.38155 -0.66292
> lm(f2, data=sdata)
Call:
lm(formula = f2, data = sdata)
Coefficients:
(Intercept) scale(x1, scale = F) scale(x2, scale = F)
4.33105 1.56455 1.43979
scale(x1 * x2, scale = F) Z scale(x1, scale = F):Z
-0.09206 1.72901 0.25332
scale(x2, scale = F):Z scale(x1 * x2, scale = F):Z
0.38155 -0.66292
I have a set of dependent variables y1, y2, ...., a set of independent variables x1,x2,..., and a set of controls d1,d2,.... These are all inside a data.table, lets call it data.
I need to do something along the lines of
out1 <- lm(y1 ~ x1, data=data)
out2 <- lm(y1 ~ x1 + d1 + d2, data=data)
....
This is of course not very nice, so I was thinking about writing a list containing all these regressions, and than just iterate through that. Something along the lines of
myRegressions <- list('out1' = y1 ~ x1, 'out2' = y1 ~ x1 + d1 + d2)
output <- NULL
for (reg in myRegressions)
{
output[reg] <- lm(myRegressions[[reg]])
}
This of course won't work: I cannot construct the list as the syntax is invalid outside of lm(). What's a good approach here instead?
You can use paste0 and as.formula to generate formulas and then simply put them into lm(), e. g.
regressors <- c("x1", "x1 + x2", "x1 + x2 + x3")
for (i in 1:length(regressors)) {
print(as.formula(paste0("y1", "~", regressors[i])))
}
This gives you the formulas (printed). Just store them in a list and iterate over that list with lapply like
lapply(stored_formulas, function(x) { lm(x, data=yourData) })
Using the built in anscombe data frame try this:
formulas = list(y1 ~ x1, y2 ~ x2)
lapply(formulas, function(fo) do.call("lm", list(fo, data = quote(anscombe))))
giving:
[[1]]
Call:
lm(formula = y1 ~ x1, data = anscombe)
Coefficients:
(Intercept) x1
3.0001 0.5001
[[2]]
Call:
lm(formula = y2 ~ x2, data = anscombe)
Coefficients:
(Intercept) x2
3.001 0.500
Note that the Call: portion of the output is accurately produced which will be useful if there are many components to the output list.
Formulas can be quoted :
myReg <- list('out1' = "mpg ~ cyl")
lm(myReg[[1]],data=mtcars)
Call:
lm(formula = myReg[[1]], data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876
I'm trying to create a series of models based on subsets of different categories in my data. Instead of creating a bunch of individual model objects, I'm using lapply() to create a list of models based on subsets of every level of my category factor, like so:
test.data <- data.frame(y=rnorm(100), x1=rnorm(100), x2=rnorm(100), category=rep(c("A", "B"), 2))
run.individual.models <- function(x) {
lm(y ~ x1 + x2, data=test.data, subset=(category==x))
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
# [[1]]
# Call:
# lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
# x))
# Coefficients:
# (Intercept) x1 x2
# 0.10852 -0.09329 0.11365
# ....
This works fantastically, except the model call shows subset = (category == x) instead of category == "A", etc. This makes it more difficult to use both for diagnostic purposes (it's hard to remember which model in the list corresponds to which category) and for functions like predict().
Is there a way to substitute the actual character value of x into the lm() call so that the model doesn't use the raw x in the call?
Along the lines of Explicit formula used in linear regression
Use bquote to construct the call
run.individual.models <- function(x) {
lmc <- bquote(lm(y ~ x1 + x2, data=test.data, subset=(category==.(x))))
eval(lmc)
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"A"))
Coefficients:
(Intercept) x1 x2
-0.08434 0.05881 0.07695
[[2]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"B"))
Coefficients:
(Intercept) x1 x2
0.1251 -0.1854 -0.1609