Pass dynamically variable names in lm formula inside a function - r

I have a function that asks for two parameters:
dataRead (dataframe from the user)
variableChosen (which dependent variable the user wants to utilize
in the model)
Obs: indepent variable will always be the first column
But if the user gives me for example, a dataframe called dataGiven which columns names are: "Doses", "Weight"
I want that my model name has these names in my results
My actual function correctly make the lm, but my formula names from the data frame are gone (and shows how I got the data from the function)
Results_REG<- function (dataRead, variableChosen){
fit1 <- lm(formula = dataRead[,1]~dataRead[,variableChosen])
return(fit1)
}
When I call:
test1 <- Results_REG(dataGive, "Weight")
names(teste1$model)
shows:
"dataRead[, 1]" "dataRead[, variableChosen]"
I wanted to show my dataframe columns names, like:
"Doses" "Weight"

First off, it's always difficult to help without a reproducible code example. For future posts I recommend familiarising yourself with how to provide such a minimal reproducible example.
I'm not entirely clear on what you're asking, so I assume this is about how to create a function that fits a simple linear model based on data with a single user-chosen predictor var.
Here is an example based on mtcars
results_LM <- function(data, var) {
lm(data[, 1] ~ data[, var])
}
results_LM(mtcars, "disp")
#Call:
#lm(formula = data[, 1] ~ data[, var])
#
#Coefficients:
#(Intercept) data[, var]
# 29.59985 -0.04122
You can confirm that this gives the same result as
lm(mpg ~ disp, data = mtcars)
Or perhaps you're asking how to carry through the column names for the predictor? In that case we can use as.formula to construct a formula that we use together with the data argument in lm.
results_LM <- function(data, var) {
fm <- as.formula(paste(colnames(data)[1], "~", var))
lm(fm, data = data)
}
fit <- results_LM(mtcars, "disp")
fit
#Call:
#lm(formula = fm, data = data)
#
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
names(fit$model)
#[1] "mpg" "disp"

outcome <- 'mpg'
model <- lm(mtcars[,outcome] ~ . ,mtcars)
yields the same result as:
data(mtcars)
model <- lm( mpg ~ . ,mtcars)
but allows you to pass a variable (the column name). However, this may cause an error where mpg is included in the right hand side of the equation as well. Not sure if anyone knows how to fix that.

Related

How to run a loop regression in r

I am want to run a loop regression for fund1 till fund10 based on the the LIQ-factor. I want to do this regression:
lm(fund1S~ LIQ, data = Merge_Liq_Size)
but for all of the funds simultaneously.
I have attached some picture of the dataset to show you the setup. The dataset has 479 observation/rows. Can anyoune help me to sturcture a code? Sorry if this question is phrased in a wrong way.
Perhaps this is what you are looking for:
my_models <- lapply(paste0("fund", 1:10, "S ~ LIQ"), function(x) lm(as.formula(x), data = Merge_Liq_Size))
You can access each model by my_models[[1]] to my_models[[10]].
If it is lm, we can do this without lapply as well i.e. create a matrix as the dependent variable and construct the formula
lm(as.matrix(Merge_Liq_Size[paste0("fund", 1:10, "S")]) ~ LIQ, data = Merge_Liq_Size)
Using a small reproducible example
> data(mtcars)
> lm(as.matrix(mtcars[1:3]) ~ vs, data = mtcars)
Call:
lm(formula = as.matrix(mtcars[1:3]) ~ vs, data = mtcars)
Coefficients:
mpg cyl disp
(Intercept) 16.617 7.444 307.150
vs 7.940 -2.873 -174.693

linear regression function creating a list instead of a model

I'm trying to fit an lm model using R. However, for some reason this code creates a list of the data instead of the usual regression model.
The code I use is this one
lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden )
But instead of the usual coefficients, the title of the variable appears mixed with the data in this way:
(Intercept) Inclusivity0.631 Inclusivity0.681 Inclusivity0.716 Inclusivity0.9
35.00 -4.00 -6.74 -4.30 4.90
Does anybody know why this happened and how it can be fixed?
What you are seeing is called a named num (a numeric vector with names). You can do the following:
Model <- lm(Soc_vote ~ Inclusivity + Gini + Un_den, data = general_inclusivity_sweden) # Assign the model to an object called Model
summary(Model) # Summary table
Model$coefficients # See only the coefficients as a named numeric vector
Model$coefficients[[1]] # See the first coefficient without name
If you want all the coefficients without names (so just a numeric vector), try:
unname(coef(Model))
It would be good if you could provide a sample of your data but I'm guessing that the key problem is that the numeric data in Inclusivity is stored as a factor. e.g.,
library(tidyverse)
x <- tibble(incl = as.factor(c(0.631, 0.681, 0.716)),
soc_vote=1:3)
lm(soc_vote ~ incl, x)
Call:
lm(formula = soc_vote ~ incl, data = x)
Coefficients:
(Intercept) incl0.681 incl0.716
1 1 2
Whereas, if you first convert the Inclusivity column to double, you get
y <- x %>% mutate(incl = as.double(as.character(incl)))
lm(soc_vote ~ incl, y)
Call:
lm(formula = soc_vote ~ incl, data = y)
Coefficients:
(Intercept) incl
-13.74 23.29
Note that I needed to convert to character first since otherwise I just get the ordinal equivalent of each factor.

Regarding multiple linear models simultaneously

I am trying to run many linear regressions models simultaneously. Please help me to make a code for this.
I am working on two data frames. In first data frame have 100 dependent variables and in second data frame i have 100 independent variables. Now I want simple linear models like
lm1 <- lm(data_frame_1[[1]] ~ data_frame_2[[1]])
lm2 <- lm(data_frame[[2]] ~ data_frame[[2]])
and so on .That means I have to run 100 regression models. I want to do this simultaneously. Please help me to make respective codes to run these all models simultaneously.
It is not that clear what you mean by simultaneously. But maybe doing a loop is fine in your case?
model.list = list()
for (i in 1:100){
model.list[[i]] = lm(data.frame.1[[i]] ~ data.frame2[[i]])
}
Using dataframe_1 and dataframe_2 defined in the Note at the end we define a function LM that takes an x name and y name and performs a regression of y on x using the columns from those data frames. The result is a list of lm objects. Note that the Call: line in the output of each output list component correctly identifies which columns were used.
LM <- function(xname, yname) {
fo <- formula(paste(yname, "~", xname))
do.call("lm", list(fo, quote(cbind(dataframe_1, dataframe_2))))
}
Map(LM, names(dataframe_1), names(dataframe_2))
giving:
$x1
Call:
lm(formula = y1 ~ x1, data = cbind(dataframe_1, dataframe_2))
Coefficients:
(Intercept) x1
3.0001 0.5001
... etc ...
Note
Using the builtin anscombe data frame define dataframe_1 as the x columns and data_frame_2 as the y columns.
dataframe_1 <- anscombe[grep("x", names(anscombe))]
dataframe_2 <- anscombe[grep("y", names(anscombe))]

How to understand the arguments of "data" and "subset" in randomForest R package?

Arguments
data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from
subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?
Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?
This is not an uncommon methodology, and certainly not unique to randomForests.
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.
Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.
Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

How do I create a "macro" for regressors in R?

For long and repeating models I want to create a "macro" (so called in Stata and there accomplished with global var1 var2 ...) which contains the regressors of the model formula.
For example from
library(car)
lm(income ~ education + prestige, data = Duncan)
I want something like:
regressors <- c("education", "prestige")
lm(income ~ #regressors, data = Duncan)
I could find is this approach. But my application on the regressors won't work:
reg = lm(income ~ bquote(y ~ .(regressors)), data = Duncan)
as it throws me:
Error in model.frame.default(formula = y ~ bquote(.y ~ (regressors)), data =
Duncan, : invalid type (language) for variable 'bquote(.y ~ (regressors))'
Even the accepted answer of same question:
lm(formula(paste('var ~ ', regressors)), data = Duncan)
strikes and shows me:
Error in model.frame.default(formula = formula(paste("var ~ ", regressors)),
: object is not a matrix`.
And of course I tried as.matrix(regressors) :)
So, what else can I do?
Here are some alternatives. No packages are used in the first 3.
1) reformulate
fo <- reformulate(regressors, response = "income")
lm(fo, Duncan)
or you may wish to write the last line as this so that the formula that is shown in the output looks nicer:
do.call("lm", list(fo, quote(Duncan)))
in which case the Call: line of the output appears as expected, namely:
Call:
lm(formula = income ~ education + prestige, data = Duncan)
2) lm(dataframe)
lm( Duncan[c("income", regressors)] )
The Call: line of the output look like this:
Call:
lm(formula = Duncan[c("income", regressors)])
but we can make it look exactly as in the do.call solution in (1) with this code:
fo <- formula(model.frame(income ~., Duncan[c("income", regressors)]))
do.call("lm", list(fo, quote(Duncan)))
3) dot
An alternative similar to that suggested by #jenesaisquoi in the comments is:
lm(income ~., Duncan[c("income", regressors)])
The approach discussed in (2) to the Call: output also works here.
4) fn$ Prefacing a function with fn$ enables string interpolation in its arguments. This solution is nearly identical to the desired syntax shown in the question using $ in place of # to perform substitution and the flexible substitution could readily extend to more complex scenarios. The quote(Duncan) in the code could be written as just Duncan and it will still run but the Call: shown in the lm output will look better if you use quote(Duncan).
library(gsubfn)
rhs <- paste(regressors, collapse = "+")
fn$lm("income ~ $rhs", quote(Duncan))
The Call: line looks almost identical to the do.call solutions above -- only spacing and quotes differ:
Call:
lm(formula = "income ~ education+prestige", data = Duncan)
If you wanted it absolutely the same then:
fo <- fn$formula("income ~ $rhs")
do.call("lm", list(fo, quote(Duncan)))
For the scenario you described, where regressors is in the global environment, you could use:
lm(as.formula(paste("income~", paste(regressors, collapse="+"))), data =
Duncan)
Alternatively, you could use a function:
modincome <- function(regressors){
lm(as.formula(paste("income~", paste(regressors, collapse="+"))), data =
Duncan)
}
modincome(c("education", "prestige"))

Resources