Passing Argument to lm in R within Function - r

I would like to able to call lm within a function and specify the weights variable as an argument passed to the outside function that is then passed to lm. Below is a reproducible example where the call works if it is made to lm outside of a function, but produces the error message Error in eval(expr, envir, enclos) : object 'weightvar' not found when called from within a wrapper function.
olswrapper <- function(form, weightvar, df){
ols <- lm(formula(form), weights = weightvar, data = df)
}
df <- mtcars
ols <- lm(mpg ~ cyl + qsec, weights = gear, data = df)
summary(ols)
ols2 <- olswrapper(mpg ~ cyl + qsec, weightvar = gear, df = df)
#Produces error: "Error in eval(expr, envir, enclos) : object 'weightvar' not found"

Building on the comments, gear isn't defined globally. It works inside the stand-alone lm call as you specify the data you are using, so lm knows to take gear from df.
Howver, gear itself doesn't exist outside that stand-alone lm function. This is shown by the output of gear
> gear
Error: object 'gear' not found
You can pass the gear into the function using df$gear
weightvar <- df$gear
ols <- olswrapper(mpg ~ cyl + qsec, weightvar , df = df)

I know I'm late on this, but I believe the previous explanation is incomplete. Declaring weightvar <- df$gear and then passing it in to the function only works because you use weightvar as the name for your weight argument. This is just using weightvar as a global variable. That's why df$gear doesn't work directly. It also doesn't work if you use any name except weightvar.
The reason why it doesn't work is that lm looks for data in two places: the dataframe argument (if specified), and the environment of your formula. In this case, your formula's environment is R_GlobalEnv. (You can test this by running print(str(form)) from inside olswrapper). Thus, lm will only look in the global environment and in df, not the function environment.
edit: In the lm documentation the description of the data argument says:
"an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called."
A quick workaround is to say environment(form) <- environment() to change your formula's environment. This won't cause any problems because the data in the formula is in the data frame you specify.

eval(substitute(...)) inside a body of a function allows us to employ non-standard evaluation
df <- mtcars
olswrapper <- function(form, weightvar, df)
eval(substitute(ols <- lm(formula(form), weights = weightvar, data = df)))
summary(ols)
olswrapper(mpg ~ cyl + qsec, weightvar = gear, df = df)
More here:
http://adv-r.had.co.nz/Computing-on-the-language.html

Related

glm fit with iris invalid first argument, must be vector (list or atomic)

I have the following working code
glm.fit <- glm(Income ~ .,data=train,family=binomial)
summary(glm.fit)
However there are some questions I want to ask, and so I can ask the questions I decided to try and reproduce the code using the iris data set.
I tried
cf<-iris
glm.fit(Petal.Width ~ ., cf, family = binomial)
but I get an error
Error in dim(data) <- dim : invalid first argument, must be vector (list or atomic)
[Update]
I see the data I expect using the following
library(dplyr)
cf<-iris
cf %>% head(10)
There are some issues with your code.
First, there's no need to create the variable cf. You can just use iris.
Second, glm.fit takes as its first 2 arguments x and y. From the documentation, accessible at ?glm.fit:
For glm.fit: x is a design matrix of dimension n * p, and y is a vector of observations of length n.
Your first line of code uses glm to create a variable named glm.fit - this is not the same as the function of that name.
If you want to use glm, that function can take a formula and the name of a data frame as arguments. So this works:
glm(Petal.Width ~ ., data = iris)
But this gives an error:
glm(Petal.Width ~ ., data = iris, family = binomial)
Error in eval(family$initialize) : y values must be 0 <= y <= 1
That's because the response variable, Petal.Width is continuous. You use the binomial family when the response takes 2 values (yes/no, 0/1, true/false).

Pass dynamically variable names in lm formula inside a function

I have a function that asks for two parameters:
dataRead (dataframe from the user)
variableChosen (which dependent variable the user wants to utilize
in the model)
Obs: indepent variable will always be the first column
But if the user gives me for example, a dataframe called dataGiven which columns names are: "Doses", "Weight"
I want that my model name has these names in my results
My actual function correctly make the lm, but my formula names from the data frame are gone (and shows how I got the data from the function)
Results_REG<- function (dataRead, variableChosen){
fit1 <- lm(formula = dataRead[,1]~dataRead[,variableChosen])
return(fit1)
}
When I call:
test1 <- Results_REG(dataGive, "Weight")
names(teste1$model)
shows:
"dataRead[, 1]" "dataRead[, variableChosen]"
I wanted to show my dataframe columns names, like:
"Doses" "Weight"
First off, it's always difficult to help without a reproducible code example. For future posts I recommend familiarising yourself with how to provide such a minimal reproducible example.
I'm not entirely clear on what you're asking, so I assume this is about how to create a function that fits a simple linear model based on data with a single user-chosen predictor var.
Here is an example based on mtcars
results_LM <- function(data, var) {
lm(data[, 1] ~ data[, var])
}
results_LM(mtcars, "disp")
#Call:
#lm(formula = data[, 1] ~ data[, var])
#
#Coefficients:
#(Intercept) data[, var]
# 29.59985 -0.04122
You can confirm that this gives the same result as
lm(mpg ~ disp, data = mtcars)
Or perhaps you're asking how to carry through the column names for the predictor? In that case we can use as.formula to construct a formula that we use together with the data argument in lm.
results_LM <- function(data, var) {
fm <- as.formula(paste(colnames(data)[1], "~", var))
lm(fm, data = data)
}
fit <- results_LM(mtcars, "disp")
fit
#Call:
#lm(formula = fm, data = data)
#
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
names(fit$model)
#[1] "mpg" "disp"
outcome <- 'mpg'
model <- lm(mtcars[,outcome] ~ . ,mtcars)
yields the same result as:
data(mtcars)
model <- lm( mpg ~ . ,mtcars)
but allows you to pass a variable (the column name). However, this may cause an error where mpg is included in the right hand side of the equation as well. Not sure if anyone knows how to fix that.

How to understand the arguments of "data" and "subset" in randomForest R package?

Arguments
data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from
subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?
Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?
This is not an uncommon methodology, and certainly not unique to randomForests.
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.
Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.
Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

Issue running glmnet() for mtcars dataset

Whenever I run glmnet(mpg ~ ., data = mtcars, alpha=1) (from the glmnet package) I get the following error:
"Error in glmnet(mpg ~ ., data = mtcars, alpha = 1) : unused argument (data = mtcars)"
Any ideas for how to deal with this?
I think its because the glmnet() function is supposed to take in x and y as separate arguments. If I need separate x and y arguments, how would I write the formula so that glmnet::glmnet() runs for all variables of mtcars?
As the commenter suggests you need to use the glmnet method like so:
fit <- glmnet(as.matrix(mtcars[-1]), mtcars$mpg, alpha=1)
plot(fit)

Use glm with data.table and a parametric definition of the predictors and the response

I want to do VIF testing running consecutive regressions within a dataset, each time using one variable as the response and the remaining as predictors.
To that end I will put my code within a for loop which will give consecutive values to the index of the column that will be used as the response and leave the remaining as predictors.
I am going to use the data.table package and I will use the mtcars dataset found in base R to create a reproducible example:
data(mtcars)
setDT(mtcars)
# Let i-- the index of the response -- be 1 for demonstration purposes
i <- 1
variables <- names(mtcars)
response <- names(mtcars)[i]
predictors <- setdiff(variables, response)
model <- glm(mtcars[, get(response)] ~ mtcars[, predictors , with = FALSE], family = "gaussian")
However, this results to an error message:
Error in model.frame.default(formula = mtcars[, get(response)] ~
mtcars[, :
invalid type (list) for variable 'mtcars[, predictors, with = FALSE]'
Could you explain the error and help me correct the code?
Your advice will be appreciated.
=============================================================================
Edit:
In reproducing the code suggested I got an error message:
> library(car)
> library(data.table)
>
> data(mtcars)
> setDT(mtcars)
> model <- glm(formula = mpg ~ .,data=mtcars , family = "gaussian")
> vif(model)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘vif’ for signature ‘"glm"’
Update:
The code run without problem when I specified explicitly the package, i.e.:
car::vif(model)
Edit 2
I had to amend Fredrik's code as follows to get the coefficients of all the variables:
rhs <- paste(predictors, collapse ="+")
full_formula <- paste(response, "~", rhs)
full_formula <- as.formula(full_formula)
If you want to calculate the VIF of your predictors I would suggest looking at the vif function in package car. It will do the calculations for you and generalizes to predictors with multiple degrees of freedom such as factors.
To get all the vifs you would just hav
library(car)
library(data.table)
data(mtcars)
setDT(mtcars)
model <- glm(formula = mpg ~ .,data=mtcars , family = "gaussian")
vif(model)
As for your error, I see it as you are mixing up glm which takes a formula and a dataset and glm.fit which takes the design matrix and predictions, in that order. You have concepts from both functions in your call.
To fit your model I suggest going with the glm since this will give you an object of class glm with extra features such as the ability to do plot(model) as opposed to glm.fit where you only get a list of values related to the model.
In that case you would just have to create the formula, looking something like:
library(data.table)
data(mtcars)
setDT(mtcars)
# Let i-- the index of the response -- be 1 for demonstration purposes
i <- 1
variables <- names(mtcars)
response <- names(mtcars)[i]
predictors <- setdiff(variables, response)
rhs <- paste(predictors, sep = " + ")
full_formula <- paste(response, "~", rhs)
model <- glm(formula = full_formula ,data=mtcars, family = "gaussian")
In contrast to:
model <- glm.fit(y=mtcars[, get(response)] ,
x=mtcars[, predictors , with = FALSE],
family=gaussian())
Another solution is based on the use of glm.fit:
model <- glm.fit(x=mtcars[, ..predictors], y=mtcars[[response]], family = gaussian())

Resources