Is there a way of capturing variables in glm function in R? - r

I would like to store indicator variables to be used in modelling into a vector to use them in a function. The goal is to use the created function to fit more than one model. I tried it as below but did not seem to get it right.
# Load data (from gtsummary package)
data(trial, package = "gtsummary")
# Specify the covariates
indicatorvars <- c("trt", "grade")
# Function
modelfunc <- function(outcome, covarates) {
glm({{outcome}} ~ {{covarates}},
data = trial, family = binomial)
}
modelfunc("response" , "indiatorvars")
# This one works
#glm(response ~ trt + grade, data = trial, family = binomial)

You can first build up your formula as a character string before converting it using as.formula. So the function would look like this:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
glm(as.formula(form),
data = trial, family = binomial)
}
And here is your example:
modelfunc("response", indicatorvars)
#>
#> Call: glm(formula = as.formula(form), family = binomial, data = trial)
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)
What I don't yet like about this solution is that the call is not very informative. So I would slightly adapt the function:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
out <- glm(as.formula(form),
data = trial, family = binomial)
out$call <- out$formula # replace call with formula
out
}
modelfunc("response", indicatorvars)
#>
#> Call: response ~ trt + grade
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)

Related

Warning message clarification

I'm using SNPassoc R package for finding association between data SNPs and a continuous variable outcome. I run the analysis and I got the results; however, I got warning message which is:
Warning in terms.formula(formula, data = data) :
'varlist' has changed (from nvar=3) to new 4 after EncodeVars() -- should no longer happen!
my model is:
model <- WGassociation (continuous variable ~ covariate +covariate+ covariate ,data= data)
model
I don't know what it means and should I worry about it or ignore it?
Can you please help me?
This warning message is coming glm, which is used by SNPassoc::WGassociation, see this line on GitHub.
Warning message is saying that it is dropping some variable as it is a linear combination of other existing variables in the model.
To reproduce this warning try below example:
# data
x <- mtcars[, 1:4]
# run model, all good
glm(mpg ~ ., data = x)
# Call: glm(formula = mpg ~ ., data = x)
#
# Coefficients:
# (Intercept) cyl disp hp
# 34.18492 -1.22742 -0.01884 -0.01468
#
# Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
# Null Deviance: 1126
# Residual Deviance: 261.4 AIC: 168
Now add useless combo variable which is constructed from existing variables.
# make a combo var
cyldisp <- x$cyl + x$disp
# run model with combo var, now we get the warning
glm(mpg ~ . + cylmpg, data = x)
# Call: glm(formula = mpg ~ . + cyldisp, data = x)
#
# Coefficients:
# (Intercept) cyl disp hp cyldisp
# 34.18492 -1.22742 -0.01884 -0.01468 NA
#
# Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
# Null Deviance: 1126
# Residual Deviance: 261.4 AIC: 168
# Warning message:
# In terms.formula(formula, data = data) :
# 'varlist' has changed (from nvar=4) to new 5 after EncodeVars() -- should no longer happen!

designing custom model objects in R

I'm coded up an estimator in R and tried to follow R syntax. It goes something like this:
model <- myEstimator(y ~ x1 + x2, data = df)
model has the usual stuff: coefficients, standard errors, p-values, etc.
Now I want model to play nicely with the R ecosystem for summarizing models, like summary() or sjPlot::plot_model() or stargazer::stargazer(). The way you might do summary(lm_model) where lm_model is an lm object.
How do I achieve this? Is there a standard protocol? Define a custom S3 class? Or just coerce model to an existing class like lm?
Create an S3 class and implement summary etc methods.
myEstimator <- function(formula, data) {
result <- list(
coefficients = 1:3,
residuals = 1:3
)
class(result) <- "myEstimator"
result
}
model <- myEstimator(y ~ x1 + x2, data = df)
Functions like summary will just call summary.default.
summary(model)
#> Length Class Mode
#> coefficients 3 -none- numeric
#> residuals 3 -none- numeric
If you wish to have your own summary function, implement summary.myEstimator.
summary.myEstimator <- function(object, ...) {
value <- paste0(
"Coefficients: ", paste0(object$coefficients, collapse = ", "),
"; Residuals: ", paste0(object$residuals, collapse = ", ")
)
value
}
summary(model)
#> [1] "Coefficients: 1, 2, 3; Residuals: 1, 2, 3"
If your estimator is very similar to lm (your model is-a lm), then just add your class to the lm class.
myEstimatorLm <- function(formula, data) {
result <- lm(formula, data)
# Some customisation
result$coefficients <- pmax(result$coefficients, 1)
class(result) <- c("myEstimatorLm", class(result))
result
}
model_lm <- myEstimatorLm(Petal.Length ~ Sepal.Length + Sepal.Width, data = iris)
class(model_lm)
#> [1] "myEstimator" "lm"
Now, summary.lm will be used.
summary(model_lm)
#> Call:
#> lm(formula = formula, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.25582 -0.46922 -0.05741 0.45530 1.75599
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.00000 0.56344 1.775 0.078 .
#> Sepal.Length 1.77559 0.06441 27.569 < 2e-16 ***
#> Sepal.Width 1.00000 0.12236 8.173 1.28e-13 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.6465 on 147 degrees of freedom
#> Multiple R-squared: 0.8677, Adjusted R-squared: 0.8659
#> F-statistic: 482 on 2 and 147 DF, p-value: < 2.2e-16
You can still implement summary.myEstimatorLm
summary.myEstimatorLm <- summary.myEstimator
summary(model_lm)
#> [1] "Coefficients: 1, 1.77559254648113, 1; Residuals: ...

Result of glm() for logistic regression

This might be a trivial question but I don't know where to find answers. I'm wondering when using glm() for logistic regression in R, if the response variable Y has factor values 1 or 2, does the result of glm() correspond to logit(P(Y=1)) or logit(P(Y=2))? What if Y has logical values TRUE or FALSE?
Why not just test it yourself?
output_bool <- c(rep(c(TRUE, FALSE), c(25, 75)), rep(c(TRUE, FALSE), c(75, 25)))
output_num <- c(rep(c(2, 1), c(25, 75)), rep(c(2, 1), c(75, 25)))
output_fact <- factor(output_num)
var <- rep(c("unlikely", "likely"), each = 100)
glm(output_bool ~ var, binomial)
#>
#> Call: glm(formula = output_bool ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
glm(output_num ~ var, binomial)
#> Error in eval(family$initialize): y values must be 0 <= y <= 1
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> 1.099 -2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
So, we get the correct answer if we use TRUE and FALSE, an error if we use 1 and 2 as numbers, and the correct result if we use 1 and 2 as a factor with two levels provided the TRUE value has a higher factor level than the FALSE. However, we have to be careful in how our factors are ordered or we will get the wrong result:
output_fact <- factor(output_fact, levels = c("2", "1"))
glm(output_fact ~ var, binomial)
#>
#> Call: glm(formula = output_fact ~ var, family = binomial)
#>
#> Coefficients:
#> (Intercept) varunlikely
#> -1.099 2.197
#>
#> Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
#> Null Deviance: 277.3
#> Residual Deviance: 224.9 AIC: 228.9
(Notice the intercept and coefficient have flipped signs)
Created on 2020-06-21 by the reprex package (v0.3.0)
Testing is good. If you want the documentation, it's in ?binomial (which is the same as ?family):
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
It doesn't explicitly say what happens in the logical (TRUE/FALSE) case; for that you have to know that, when coercing logical to numeric values, FALSE → 0 and TRUE → 1.

How Does R's Logit model handle categorical variables in the stats package?

I am running a logistic regression and I am noticing that each unique character string in my vector is receiving its own parameter. Is R optimizing the prediction on the outcome variable based each collection of unique values within the vector?
library(stats)
df = as.data.frame( matrix(c("a","a","b","c","c","b","a","a","b","b","c",1,0,0,0,1,0,1,1,0,1,0,1,0,100,10,8,3,5,6,13,10,4,"SF","CHI","NY","NY","SF","SF","CHI","CHI","SF","CHI","NY"), ncol = 4))
colnames(df) = c("letter","number1","number2","city")
df$letter = as.factor(df$letter)
df$city = as.factor(df$city)
df$number1 = as.numeric(df$number1)
df$number2 = as.numeric(df$number2)
glm(number1 ~ .,data=df)
#Call: glm(formula = number1 ~ ., data = df)
#Coefficients:
# (Intercept) letterb letterc number2 cityNY citySF
#1.57191 -0.25227 -0.01424 0.04593 -0.69269 -0.20634
#Degrees of Freedom: 10 Total (i.e. Null); 5 Residual
#Null Deviance: 2.727
#Residual Deviance: 1.35 AIC: 22.14
How is the logit treating city in the example above?

User-defined function with lapply function

I'm attempting to establish a user-defined function that inputs predetermined variables (independent and dependent) from the active data frame. Let's take the example data frame df below looking at a coin toss outcome as a result of other recorded variables:
> df
outcome toss person hand age
1 H 1 Mary Left 18
2 T 2 Allen Left 12
3 T 3 Dom Left 25
4 T 4 Francesca Left 42
5 H 5 Mary Right 18
6 H 6 Allen Right 12
7 H 7 Dom Right 25
8 T 8 Francesca Right 42
The dfdata frame has a binomial response outcome being either heads or tails and I am going to look at how person,hand, and age might affect this categorical outcome. I plan to use a forward-selection approach which will test one variable against toss and then progress to add more.
As to keep things simple, I want to be able to identify the response/dependent (e.g., outcome) and predictor/independent (e.g., person,hand) variables before my user-defined function as such:
> independent<-c('person','hand','age')
> dependent<-'outcome'
Then create my function using the lapply and glm functions:
> test.func<-function(some_data,the_response,the_predictors)
+ {
+ lapply(the_predictors,function(a)
+ {
+ glm(substitute(as.name(the_response)~i,list(i=as.name(a))),data=some_data,family=binomial)
+ })
+ }
Yet, when I attempt to run the function with the predetermined vectors, this occurs:
> test.func(df,dependent,independent)
Error in as.name(the_response) : object 'the_response' not found
My expected response would be the following:
models<-lapply(independent,function(x)
+ {
+ glm(substitute(outcome~i,list(i=as.name(x))),data=df,family=binomial)
+ })
> models
[[1]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
Coefficients:
(Intercept) personDom personFrancesca personMary
1.489e-16 -1.799e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
**End Snippet**
As you can tell, using lapply and glm, I have created 3 simple models without all of the extra work doing it individually. You may be asking why create a user-defined function when you have simple code right there? I plan to run a while or repeat loop and it will decrease clutter.
Thank you for your assistance
I know code only answers are deprecated but I thought you were almost there and could just use the nudge to use the formula function (and to include 'the_response in the substitution):
test.func<-function(some_data,the_response,the_predictors)
{
lapply(the_predictors,function(a)
{print( form<- formula(substitute(resp~i,
list(resp=as.name(the_response), i=as.name(a)))))
glm(form, data=some_data,family=binomial)
})
}
Test:
> test.func(df,dependent,independent)
outcome ~ person
<environment: 0x7f91a1ba5588>
outcome ~ hand
<environment: 0x7f91a2b38098>
outcome ~ age
<environment: 0x7f91a3fad468>
[[1]]
Call: glm(formula = form, family = binomial, data = some_data)
Coefficients:
(Intercept) personDom personFrancesca personMary
8.996e-17 -1.540e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = form, family = binomial, data = some_data)
#snipped

Resources