calling the glm() function within a user-defined function - r

I have been trying to create a function that uses a glm() inside it. But I always get an error message. It looks like the function does not retrieve the value of the variable.
set.seed(234)
sex <- sample(c("M", "F"), size=100, replace=TRUE)
age <- rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
dsn <- data.frame(sex, age)
rm(sex, age) #remove sex and age from the global environment for reproducibility
to_analyze <- function(dep, indep, data){
glm(dep~factor(indep), data=data)
}
to_analyze(dep=age, indep=sex, data=dsn)
#> Error in eval(predvars, data, env): object 'age' not found

You could use any of the following:
Using substitute:
to_analyze <- function(dep, indep, data){
glm(substitute(dep ~ factor(indep)), data=data)
}
to_analyze(dep=age, indep=sex, data=dsn)
Advantage: Can write the independent as a formula.
eg
to_analyze(Petal.Width, Sepal.Length + Sepal.Width, data = iris)
Using reformulate as stated by #NelsonGon
to_analyze <- function(dep, indep, data){
glm(reformulate(sprintf("factor(%s)",indep), dep), data = data)
}
Note that to call this function, the variables aught to be of type character
to_analyze(dep= "age", indep="sex", data=dsn)
Recall glm can also take a string that can be parsed to a formula:
to_analyze <- function(dep, indep, data){
glm(sprintf("%s~factor(%s)", dep, indep), data = data)
}
to_analyze("age", "sex", data=dsn)
or even:
to_analyze <- function(dep, indep, data){
glm(paste(dep,"~ factor(",indep,")"), data = data)
}
to_analyze("age", "sex", data=dsn)
LASTLY: to combine both the substitute and paste:
to_analyze <- function(dep, indep, data){
glm(paste(substitute(dep),"~ factor(",substitute(indep),")"), data = data)
}
will work for both symbols and characters. eg:
to_analyze(age, sex, data=dsn)
to_analyze("age", "sex", data=dsn)

Create a "formula" object in the function and pass to glm.
To get the variables without giving an error the standard trick is deparse(substitute(.)).
Then compose the formula with paste.
to_analyze <- function(dep, indep, data){
dep <- deparse(substitute(dep))
indep <- deparse(substitute(indep))
indep <- paste0("factor(", indep, ")")
fmla <- paste(dep, indep, sep = " ~ ")
fmla <- as.formula(fmla)
glm(fmla, data = data)
}
to_analyze(dep=age, indep=sex, data=dsn)
#
#Call: glm(formula = fmla, data = data)
#
#Coefficients:
# (Intercept) factor(sex)M
# 23.984 -3.984
#
#Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
#Null Deviance: 396.2
#Residual Deviance: 0.837 AIC: -188.5

#Onyambu and others. The substitute command seems to work well for just one call as it works for the to_analyze(). However when I call another function inside it, it is complaining again. Any help would be greatly appreciated
to_analyze <- function(dep, indep, data){
glm(substitute(dep ~ factor(indep)), data=data)
}
to_analyze(dep=age, indep=sex, data=dsn)
#>
#> Call: glm(formula = substitute(dep ~ factor(indep)), data = data)
#>
#> Coefficients:
#> (Intercept) factor(sex)M
#> 24.006 -4.034
#>
#> Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
#> Null Deviance: 397.3
#> Residual Deviance: 0.8152 AIC: -191.2
However, I am stuck again because I am trying to call the output from this model in lsmeans::lsmeans() to predict marginal means and return the output but it is giving me an error. Although it does not need an offset, I am including it here so that I can get a more general code that I can modify later. Any help would be greatly appreciated
to_predict_lsmeans <- function(dep, indep, data){
model <- glm(substitute(dep ~ factor(indep)), data=data)
pred <- lsmeans:: lsmeans(model, substitute(~ factor(indep)), offset=substitute(data)$log(age), type ="response" )
return(pred)
}
pred <- to_predict_lsmeans(dep=age, indep=sex, data=dsn)
#> Error in ref_grid(object, ...): We are unable to reconstruct the data.
#> The variables needed are:
#> sex
#> Are any of these actually constants? (specify via 'params = ')
#> The dataset name is:
#> data
#> Does the data still exist? Or you can specify a dataset via 'data = '
pred
#> Error in eval(expr, envir, enclos): object 'pred' not found

Related

How to convert `c('a','b')` to `cbind(a,b)` in R formula?

myfun<-function(c('a','b'),c('g'),df){
manova(cbind(a,b)~g,data=df)
}
myfun(c('Sepal.Length','Sepal.Width'),c('Species'),iris)
If I want to make myfun(c('Sepal.Length','Sepal.Width'),c('Species'),iris) to get manova result,I need to revise myfun.
I tried but failed:
myfun<-function(var,group,df){
manova(as.formula(cbind(print(var,quote = FALSE))~group),data=df)
}
I don't know how to convert c('a','b') to cbind(a,b),any thought?
Thanks for any answer in advance.
We can create a formula with paste
myfun <- function(colnms1, group, dat) {
fmla <- as.formula(paste0("cbind(",
paste(colnms1, collapse=","), ")", " ~ ", group))
mva <- manova(fmla, data = dat)
mva$call <- fmla
mva
}
myfun(c('Sepal.Length','Sepal.Width'), 'Species' ,iris)
#Call:
# cbind(Sepal.Length, Sepal.Width) ~ Species
#Terms:
# Species Residuals
#Sepal.Length 63.21213 38.95620
#Sepal.Width 11.34493 16.96200
#Deg. of Freedom 2 147
#Residual standard errors: 0.5147894 0.3396877
#Estimated effects may be unbalanced

Finding model (returned from for loops) with lowest AIC in R

I am trying to find model with lowest AIC. Models are returned from two for loops that make possible combinations of columns. I am unable to make the function return model with lowest AIC. The code below demonstrates where I got stuck:
rm(list = ls())
data <- iris
data <- data[data$Species %in% c("setosa", "virginica"),]
data$Species = ifelse(data$Species == 'virginica', 0, 1)
mod_headers <- names(data[1:ncol(data)-1])
f <- function(mod_headers){
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- c(tab[,j])
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[c(mod_tab_new)], family = binomial(link = "logit"))
}
}
best_model <- model[which(AIC(model)[order(AIC(model))][1])]
print(best_model)
}
f(mod_headers)
Any suggestions? Thanks!
I replaced your for loops with vectorised alternatives
library(tidyverse)
library(iterators)
# Column names you want to use in glm model, saved as list
whichcols <- Reduce("c", map(1:length(mod_headers), ~lapply(iter(combn(mod_headers,.x), by="col"),function(y) c(y))))
# glm model results using selected column names, saved as list
models <- map(1:length(whichcols), ~glm(Species ~., data=data[c(whichcols[[.x]], "Species")], family = binomial(link = "logit")))
# selects model with lowest AIC
best <- models[[which.min(sapply(1:length(models),function(x)AIC(models[[x]])))]]
Output
Call: glm(formula = Species ~ ., family = binomial(link = "logit"),
data = data[c(whichcols[[.x]], "Species")])
Coefficients:
(Intercept) Petal.Length
55.40 -17.17
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 1.208e-09 AIC: 4
Using your loop, just put all the models in one list.
Then compute the AIC of all these models.
Finally return the model with the minimum AIC.
f <- function(mod_headers) {
models <- list()
k <- 1
for (i in 1:length(mod_headers)) {
tab <- combn(mod_headers, i)
for(j in 1:ncol(tab)) {
mod_tab_new <- c(tab[, j], "Species")
models[[k]] <- glm(Species ~ ., data = data[mod_tab_new],
family = binomial(link = "logit"))
k <- k + 1
}
}
models[[which.min(sapply(models, AIC))]]
}
glm() uses an iterative re-weighted least squares algorithm. The algorithm reaches the maximum number of iterations before it converges - changing this parameter helps in your case:
glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
There was another issue using which, I replaced it with an if after each model fit to compare to the lowest AIC so far. However, I think there are better solutions than this for-loop approach.
f <- function(mod_headers){
lowest_aic <- Inf # added
best_model <- NULL # added
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- tab[, j]
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
if(AIC(model) < lowest_aic){ # added
lowest_aic <- AIC(model) # added
best_model <- model # added
}
}
}
return(best_model)
}

R - how to use column names as arguments in a function and insert into a model formula

I want a function where the arguments can take a variable name (that is part of a dataset but is not stored as an object in the environment) and inserts that variable name into a model formula.
For example:
# Some data with a couple of variables
my_df <- data.frame(y = rbinom(10, 1,0.5), var1 = runif(10), var2 = runif(10))
# A function that fits a model using predictor specified in the arguments
my_fun <- function(var_name, df){
glm(y ~ var_name, data = df, family = "binomial")
}
When I try to use the function I get the following error message
my_fun(var1, my_df)
Error in eval(expr, envir, enclos) : object 'var1' not found
# What I want the function to do
glm(y ~ var1, data = my_df, family = "binomial")
Is there a way to get this kind of function to work?
You can parse unquoted var_name with substitute:
my_fun <- function(var_name, df){
glm.formula <- substitute(y ~ x, list(x = substitute(var_name)))
glm(glm.formula, data = df, family = "binomial")
}
An example:
my_fun(var1, my_df)
# Call: glm(formula = glm.formula, family = "binomial", data = df)
#
# Coefficients:
# (Intercept) var1
# -1.226 3.108
#
# Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
# Null Deviance: 13.46
# Residual Deviance: 11.35 AIC: 15.35
glm(y ~ var1, data = my_df, family = "binomial")
# Call: glm(formula = y ~ var1, family = "binomial", data = my_df)
#
# Coefficients:
# (Intercept) var1
# -1.226 3.108
#
# Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
# Null Deviance: 13.46
# Residual Deviance: 11.35 AIC: 15.35

R formula in survfit

I am having trouble with formulas, environments, and survfit().
Things work fine for lm() but they fail for survfit().
General problem statement:
I am fitting a series of formulas to some data. So, I call the
modeling function with the formula passed as a variable. Later,
I want to work with the formula from the fitted object.
(From my naive point of view, the trouble comes from survfit not
recording the environment.)
Detailed Example
Expected behaviour as seen in lm():
library("plyr")
preds <- c("wt", "qsec")
f <- function() {
lm(mpg ~ wt, data = mtcars)
}
fits <- alply(preds, 1, function(pred)
{
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$call$formula
##modform
formula(fits[[1]])
## mpg ~ wt
## <environment: 0x1419d1a0>
Even though fits[[1]]$call$formula resolves to modform I can
still get the original formula with formula(fits[[1]]).
But things fail for survfit():
library("plyr")
library("survival")
preds <- c("resid.ds", "rx", "ecog.ps")
fits <-
alply(preds, 1, function(pred)
{
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
print(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$call$formula
## modform
formula(fits[[1]])
## Error in eval(expr, envir, enclos) : object 'modform' not found
Here (and in contrast to lm-fits), formula(fits[[1]]) does not
work!
So, my specific question is: How can I get back the formula used
to fit fits[[1]]?
The issue is that when x$formula is NULL, for an lm object there is a backup plan to get the formula; this doesn't exist for survfit objects
library("plyr")
library("survival")
preds <- c("wt", "qsec")
f <- function() lm(mpg ~ wt, data = mtcars)
fits <- alply(preds, 1, function(pred) {
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$formula
# NULL
The formula can be extracted with formula(fits[[1]]) which uses the formula generic. The lm S3 method for formula is
stats:::formula.lm
# function (x, ...)
# {
# form <- x$formula
# if (!is.null(form)) {
# form <- formula(x$terms)
# environment(form) <- environment(x$formula)
# form
# }
# else formula(x$terms)
# }
So when fits[[1]]$formula returns NULL, forumla.lm looks for a terms attribute in the object and finds the formula that way
fits[[1]]$terms
The survfit objects don't have x$formula or x$terms, so formula(x) givens an error
preds <- c("resid.ds", "rx", "ecog.ps")
fits <- alply(preds, 1, function(pred) {
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$formula
# NULL
formula(fits[[1]]) ## error
formula(fits[[1]]$terms)
# list()
You can fix this by inserting the formula into the call and evaluating it
modform <- as.formula(paste("Surv(futime, fustat)", 'rx', sep = " ~ "))
substitute(survfit(modform, data = ovarian), list(modform = modform))
# survfit(Surv(futime, fustat) ~ rx, data = ovarian)
eval(substitute(survfit(modform, data = ovarian), list(modform = modform)))
# Surv(futime, fustat) ~ rx
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA
Or by manually putting the formula into x$call$formula
fit <- survfit(modform, data = ovarian)
fit$call$formula
# modform
fit$call$formula <- modform
fit$call$formula
# Surv(futime, fustat) ~ rx
fit
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA

substitute in r together with anova

I tried to run anova on different sets of data and didn't quite know how to do it. I goolged and found this to be useful: https://stats.idre.ucla.edu/r/codefragments/looping_strings/
hsb2 <- read.csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv")
names(hsb2)
varlist <- names(hsb2)[8:11]
models <- lapply(varlist, function(x) {
lm(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
})
My understanding of what the above codes does is it creates a function lm() and apply it to each variable in varlist and it does linear regression on each of them.
So I thought use aov instead of lm would work for me like this:
aov(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
However, I got this error:
Error in terms.default(formula, "Error", data = data) :
no terms component nor attribute
I have no idea of where the error comes from. Please help!
The problem is that substitute() returns an expression, not a formula. I think #thelatemail's suggestion of
lm(as.formula(paste("read ~",x)), data = hsb2)
is a good work around. Alternatively you could evaluate the expression to get the formula with
models <- lapply(varlist, function(x) {
aov(eval(substitute(read ~ i, list(i = as.name(x)))), data = hsb2)
})
I guess it depends on what you want to do with the list of models afterward. Doing
models <- lapply(varlist, function(x) {
eval(bquote(aov(read ~ .(as.name(x)), data = hsb2)))
})
gives a "cleaner" call property for each of the result.
This should do it. The varlist vector is going to be passed item by item to the function and the column will be delivered. The lm function will only see a two column dataframe and the "read" column will be the dependent variable each time. No need for fancy substitution:
models <- sapply(varlist, function(x) {
lm(read ~ ., data = hsb2[, c("read", x) ])
}, simplify=FALSE)
> summary(models[[1]]) # The first model. Note the use of "[["
Call:
lm(formula = read ~ ., data = hsb2[, c("read", x)])
Residuals:
Min 1Q Median 3Q Max
-19.8565 -5.8976 -0.8565 5.5801 24.2703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.16215 3.30716 5.492 1.21e-07 ***
write 0.64553 0.06168 10.465 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.248 on 198 degrees of freedom
Multiple R-squared: 0.3561, Adjusted R-squared: 0.3529
F-statistic: 109.5 on 1 and 198 DF, p-value: < 2.2e-16
For all the models::
lapply(models, summary)
akrun borrowed my answer the other night, now I'm (partially) borrowing his.
do.call puts the variables into the call output so it reads properly. Here's a general function for simple regression.
doModel <- function(col1, col2, data = hsb2, FUNC = "lm")
{
form <- as.formula(paste(col1, "~", col2))
do.call(FUNC, list(form, substitute(data)))
}
lapply(varlist, doModel, col1 = "read")
# [[1]]
#
# Call:
# lm(formula = read ~ write, data = hsb2)
#
# Coefficients:
# (Intercept) write
# 18.1622 0.6455
#
#
# [[2]]
#
# Call:
# lm(formula = read ~ math, data = hsb2)
#
# Coefficients:
# (Intercept) math
# 14.0725 0.7248
#
# ...
# ...
# ...
Note: As thelatemail mentions in his comment
sapply(varlist, doModel, col1 = "read", simplify = FALSE)
will keep the names in the list and also allow list$name subsetting.

Resources