I need to run lm on nested data and use map with lm. But I put it in a function to have a different set of independent variables in the model. The code looks like this
predictors1<-c("TWO_YEAR_PRIOR_SCALE_0","PV1.Dim1")
student_progress_model<-function(df, predictors){
lm(
as.formula(paste("SCALE_SCORE_0", paste(predictors, collapse="+"), sep="~")),
data=df, na.action=na.exclude)
}
data_nested<-data_nested%>%
mutate(
model=map(data, student_progress_model(df=., predictors=predictors1)),
stand_resids = map2(data, model, rstandard)
)
It comes up with the error that object 'SCALE_SCORE_0' not found
How can i specify that my variables are in the data passed to the function? Is it something to do with tidy eval?
I tried to put {{}} in my function like
lm(
as.formula(paste("SCALE_SCORE_0", paste({{predictors}}, collapse="+"), sep="~")),
data=df, na.action=na.exclude)
But it does not help...
also, will rstandard give me standardised residuals this way?
Thank you
Related
dplyr's pipe does not pass the name of objects passed down the chain. This is well known. However, it leads to unexpected complications after you fit a glm model. Functions using glm objects expect the call to contain the correct name of the object containing data.
#sample data
p_load(ISLR)
mydata = ISLR::Default
#fit glm
fitted=
mydata %>%
select(default, income) %>%
glm(default~.,data=.,family=binomial)
#dot in call
fitted$call
#pscl's pR2 pseudo r2 function does not work
p_load(pscl)
pR2(fitted)
How to fix this behavior?
I want to keep using pipes, including the select function. I also want to obtained a glm objected in fitted than can be used with pR2 or other function that need a working call.
One can re-arrange the data-preprocessing into the glm call, but it takes away the elegance of the code.
fitted=
glm(default~.,
data=mydata %>%
select(default, income),
family=binomial)
1) Since you are explicitly writing out all the variables in the select anyways you can just as easily write them out in the formula instead and get rid of the select -- you can keep the select if you like but it does seem pointless if the variables are already explicitly given in the formula. Then this works:
library(dplyr)
library(magrittr)
library(pscl)
library(ISLR)
fitted <- Default %$% glm(default ~ income, family=binomial)
fitted %>% pR2
2) Another possibilty is to invert it so that instead of putting glm inside the pipe put the pipe inside glm:
fitted <-
glm(default ~ ., data = Default %>% select(income, default), family = binomial)
fitted %>% pR2
3) A third approach is to generate the formula argument of glm rather than the data argument.
fitted <- Default %>%
select(starts_with("inc")) %>%
names %>%
reformulate("default") %>%
glm(data = Default, family = binomial)
fitted %>% pR2
Replace the glm line with this if it is important that the Call: line in the output look nice.
{ do.call("glm", list(., data = quote(Default), family = quote(binomial))) }
or using purrr:
{ invoke("glm", list(., data = expr(Default), family = expr(binomial))) }
Whenever I run glmnet(mpg ~ ., data = mtcars, alpha=1) (from the glmnet package) I get the following error:
"Error in glmnet(mpg ~ ., data = mtcars, alpha = 1) : unused argument (data = mtcars)"
Any ideas for how to deal with this?
I think its because the glmnet() function is supposed to take in x and y as separate arguments. If I need separate x and y arguments, how would I write the formula so that glmnet::glmnet() runs for all variables of mtcars?
As the commenter suggests you need to use the glmnet method like so:
fit <- glmnet(as.matrix(mtcars[-1]), mtcars$mpg, alpha=1)
plot(fit)
I would like to able to call lm within a function and specify the weights variable as an argument passed to the outside function that is then passed to lm. Below is a reproducible example where the call works if it is made to lm outside of a function, but produces the error message Error in eval(expr, envir, enclos) : object 'weightvar' not found when called from within a wrapper function.
olswrapper <- function(form, weightvar, df){
ols <- lm(formula(form), weights = weightvar, data = df)
}
df <- mtcars
ols <- lm(mpg ~ cyl + qsec, weights = gear, data = df)
summary(ols)
ols2 <- olswrapper(mpg ~ cyl + qsec, weightvar = gear, df = df)
#Produces error: "Error in eval(expr, envir, enclos) : object 'weightvar' not found"
Building on the comments, gear isn't defined globally. It works inside the stand-alone lm call as you specify the data you are using, so lm knows to take gear from df.
Howver, gear itself doesn't exist outside that stand-alone lm function. This is shown by the output of gear
> gear
Error: object 'gear' not found
You can pass the gear into the function using df$gear
weightvar <- df$gear
ols <- olswrapper(mpg ~ cyl + qsec, weightvar , df = df)
I know I'm late on this, but I believe the previous explanation is incomplete. Declaring weightvar <- df$gear and then passing it in to the function only works because you use weightvar as the name for your weight argument. This is just using weightvar as a global variable. That's why df$gear doesn't work directly. It also doesn't work if you use any name except weightvar.
The reason why it doesn't work is that lm looks for data in two places: the dataframe argument (if specified), and the environment of your formula. In this case, your formula's environment is R_GlobalEnv. (You can test this by running print(str(form)) from inside olswrapper). Thus, lm will only look in the global environment and in df, not the function environment.
edit: In the lm documentation the description of the data argument says:
"an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called."
A quick workaround is to say environment(form) <- environment() to change your formula's environment. This won't cause any problems because the data in the formula is in the data frame you specify.
eval(substitute(...)) inside a body of a function allows us to employ non-standard evaluation
df <- mtcars
olswrapper <- function(form, weightvar, df)
eval(substitute(ols <- lm(formula(form), weights = weightvar, data = df)))
summary(ols)
olswrapper(mpg ~ cyl + qsec, weightvar = gear, df = df)
More here:
http://adv-r.had.co.nz/Computing-on-the-language.html
I have a regression model created with by. I know I can use sapply to extract specific parts of the model for each factor, but what if I wanted something like the whole summary, anova, etc.?
model <- with(data, by(data, factor, function(data) lm(y ~ x, data=data)))
sapply will coerce the results of summary.lm and anova.lm to a matrix. I think you may want to use lapply, which applies a function (here summary) on each element in the list produced by by, and returns a list.
models <- by(warpbreaks, warpbreaks$tension, function(x){
lm(breaks ~ wool, data = x)
})
lapply(models, summary)
I am using ddply to execute glm on subsets of my data. I am having difficulty accessing the estimated Y values. I am able to get the model parameter estimates using the below code, but all the variations I've tried to get the fitted values have fallen short. The dependent and independent variables in the glm model are column vectors, as is the "Dmsa" variable used in the ddply operation.
Define the model:
Model <- function(df){coef(glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=df))}
Execute the model on subsets:
Modrpt <- ddply(msadata, "Dmsa", Model)
Print Modrpt gives the model coefficients, but no Y estimates.
I know that if I wasn't using ddply, I can access the glm estimated Y values by using the code:
Model <- glm(Y~D+O+B+A+log(M), family=poisson(link="log"), data=msadata)
fits <- Model$fitted.values
I have tried both of the following to get the fitted values for the subsets, but no luck:
fits <- fitted.values(ddply(msadata, "Dmsa", Model))
fits <- ddply(msadata, "Dmsa", fitted.values(Model))
I'm sure this is a very easy to code...unfortunately, I'm just learning R. Does anyone know where I am going wrong?
You can use an anonymous function in your call to ddply e.g.
require(plyr)
data(iris)
model <- function(df){
lm( Petal.Length ~ Sepal.Length + Sepal.Width , data = df )
}
ddply( iris , "Species" , function(x) fitted.values( model(x) ) )
This has the advantage that you can also, without rewriting your model function, get thecoef values by doing
ddply( iris , "Species" , function(x) coef( model(x) ) )
As #James points out, this will fall down if you have splits of unequal size, better to use dlply which puts the result of each subset in it's own list element.
(I make no claims for statistical relevance or correctness of the example model - it is just an example)
I'd recommending doing this in two steps:
library(plyr)
# First first the models
models <- dlply(iris, "Species", lm,
formula = Petal.Length ~ Sepal.Length + Sepal.Width )
# Next, extract the fitted values
ldply(models, fitted.values)
# Or maybe
ldply(models, as.data.frame(fitted.values))