User-defined function to iterate through factor levels in a regression

User-defined function to iterate through factor levels in a regression - r

I am a beginner in R so I'm sorry if my question is basic and has been answered somewhere else but unfortunately I could not find the answer.
One of my predictor variables, nationality, has 8 levels.
I want to create a user defined function that loops through each level in my variable nationality, taking one level per regression. I created a list of the levels of the variable nationalityas such:
mylist <- list("bangladeshian", "british", "filipino", "indian",
"indonesian", "nigerian", "pakistani", "spanish")
then created a user defined function:
f1 <- function(x) {
l <- summary(glm(smoke ~ I(nationality == mylist[x]),
data=df.subpop, family=binomial(link="probit")))
print(l)
}
f1(2)
f1(2) gives this output:
Call:
glm(formula = smoke ~ I(nationality == mylist[x]),
family = binomial(link = "probit"), data = df.subpop)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.629 -0.629 -0.629 -0.629 1.853
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9173 0.1659 -5.530 3.21e-08 ***
I(nationality == mylist[x])TRUE -4.2935 376.7536 -0.011 0.991
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 73.809 on 78 degrees of freedom
Residual deviance: 73.416 on 77 degrees of freedom
AIC: 77.416
Number of Fisher Scoring iterations: 14
As you can see, the coefficient for nationality is "I(nationality == mylist[x])TRUE"
which is not very informative and requires the user to refer back to the line of code
f1(2) and also to mylist to understand the level that that coefficient represents. I believe there should be a cleaner and more straightforward way to do this and accurately run a regression for each level without having to call f1() 8 times.

Consider dynamically building formula with as.formula or reformulate:
nationality_levels <- levels(df.subpop$nationality)
f1 <- function(x) {
# BUILD FORMULA (EQUIVALENT CALLS)
f <- as.formula(paste0("smoke ~ I(nationality == '", x, "')"))
f <- reformulate(paste0("I(nationality == '", x, "')"), "smoke")
l <- summary(
glm(f, data=df.subpop, family=binomial(link="probit"))
)
}
reg_list <- lapply(nationality_levels, f1)
reg_list

Related

How I pass string inside R function

I have a df and I would like to do a function with the names of header and return linear models.
I'm trying this:
a <- function(j,k){
reg1 <- lm(data$j ~ data$k)
summary(reg1)
}
a(j="hour",k="score")
It's NULL for 'data$j'

You cannot use $ when passing column name as variable. Here are couple of ways in which you can do this.
Use reformulate to create a formula object
a <- function(data, j,k){
reg1 <- lm(reformulate(k, j), data = data)
summary(reg1)
}
lm also accepts formula as string so you don't necessarily need to convert it into formula object.
a <- function(data, j,k){
reg1 <- lm(sprintf('%s~%s', j, k), data = data)
summary(reg1)
}
You can call this as :
a(mtcars, 'mpg', 'cyl')
#Call:
#lm(formula = sprintf("%s~%s", j, k), data = data)
#Residuals:
# Min 1Q Median 3Q Max
#-4.9814 -2.1185 0.2217 1.0717 7.5186
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
#cyl -2.8758 0.3224 -8.92 6.11e-10 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 3.206 on 30 degrees of freedom
#Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
#F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Note that I added data as an additional argument in the function. It is generally a better practice to pass data object in the function rather than relying it to be evaluated in global environment.

Try this. If you are going tu use strings which are as variables in a dataframe, it is better to invoke them in a function using [[]]. Here the code of your function with slight changes:
a <- function(j,k){
reg1 <- lm(data[[j]] ~ data[[k]])
summary(reg1)
}
a(j="hour",k="score")
And a small example using iris dataset:
#Example
data=iris
#Code
a(j="Sepal.Length",k="Petal.Length")
You can further tune your function a if needed.

Dynamic variable names in R regressions

Being aware of the danger of using dynamic variable names, I am trying to loop over varios regression models where different variables specifications are choosen. Usually !!rlang::sym() solves this kind of problem for me just fine, but it somehow fails in regressions. A minimal example would be the following:
y= runif(1000)
x1 = runif(1000)
x2 = runif(1000)
df2= data.frame(y,x1,x2)
summary(lm(y ~ x1+x2, data=df2)) ## works
var = "x1"
summary(lm(y ~ !!rlang::sym(var)) +x2, data=df2) # gives an error
My understanding was that !!rlang::sym(var)) takes the values of var (namely x1) and puts that in the code in a way that R thinks this is a variable (not a char). BUt I seem to be wrong. Can anyone enlighten me?

Personally, I like to do this with some computing on the language. For me, a combination of bquote with eval is easiest (to remember).
var <- as.symbol(var)
eval(bquote(summary(lm(y ~ .(var) + x2, data = df2))))
#Call:
#lm(formula = y ~ x1 + x2, data = df2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-0.49298 -0.26248 -0.00046 0.24111 0.51988
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.50244 0.02480 20.258 <2e-16 ***
#x1 -0.01468 0.03161 -0.464 0.643
#x2 -0.01635 0.03227 -0.507 0.612
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.2878 on 997 degrees of freedom
#Multiple R-squared: 0.0004708, Adjusted R-squared: -0.001534
#F-statistic: 0.2348 on 2 and 997 DF, p-value: 0.7908
I find this superior to any approach that doesn't show the same call as summary(lm(y ~ x1+x2, data=df2)).

The bang-bang operator !! only works with "tidy" functions. It's not a part of the core R language. A base R function like lm() has no idea how to expand such operators. Instead, you need to wrap those in functions that can do the expansion. rlang::expr is one such example
rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2)))
# summary(lm(y ~ x1 + x2, data = df2))
Then you need to use rlang::eval_tidy to actually evaluate it
rlang::eval_tidy(rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2))))
# Call:
# lm(formula = y ~ x1 + x2, data = df2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.49178 -0.25482 0.00027 0.24566 0.50730
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.4953683 0.0242949 20.390 <2e-16 ***
# x1 -0.0006298 0.0314389 -0.020 0.984
# x2 -0.0052848 0.0318073 -0.166 0.868
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2882 on 997 degrees of freedom
# Multiple R-squared: 2.796e-05, Adjusted R-squared: -0.001978
# F-statistic: 0.01394 on 2 and 997 DF, p-value: 0.9862
You can see this version preserves the expanded formula in the model object.

1) Just use lm(df2) or if lm has additional columns beyond what is shown in the question but we just want to regress on x1 and x2 then
df3 <- df2[c("y", var, "x2")]
lm(df3)
The following are optional and only apply if it is important that the formula appear in the output as if it had been explicitly given.
Compute the formula fo using the first line below and then run lm as in the second line:
fo <- formula(model.frame(df3))
fm <- do.call("lm", list(fo, quote(df3)))
or just run lm as in the first line below and then write the formula into it as in the second line:
fm <- lm(df3)
fm$call <- formula(model.frame(df3))
Either one gives this:
> fm
Call:
lm(formula = y ~ x1 + x2, data = df3)
Coefficients:
(Intercept) x1 x2
0.44752 0.04278 0.05011
2) character string lm accepts a character string for the formula so this also works. The fn$ causes substitution to occur in the character arguments.
library(gsubfn)
fn$lm("y ~ $var + x2", quote(df2))
or at the expense of more involved code, without gsubfn:
do.call("lm", list(sprintf("y ~ %s + x2", var), quote(df2)))
or if you don't care that the formula displays without var substituted then just:
lm(sprintf("y ~ %s + x2", var), df2)

negative binomial model with interaction in R

i have fit negative binomial model to my data as follows:
> ngbinmodel <- glm.nb( seizure.rate ~ age + treatment, data = epilepsy_reduced)
> summary(ngbinmodel)
Call:
glm.nb(formula = seizure.rate ~ age + treatment, data = epilepsy_reduced,
init.theta = 1.498983674, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3510 -0.8790 -0.4563 0.4328 1.8916
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0985089 0.5845392 3.590 0.000331 ***
age -0.0007965 0.0193064 -0.041 0.967092
treatment -0.5011593 0.2405658 -2.083 0.037228 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.499) family taken to be 1)
Null deviance: 71.217 on 57 degrees of freedom
Residual deviance: 66.875 on 55 degrees of freedom
AIC: 341.12
Number of Fisher Scoring iterations: 1
Theta: 1.499
Std. Err.: 0.362
2 x log-likelihood: -333.119
Now I would like to check if i should include the interaction effect between age and treatment. I have found two methods to do it:
> intearaction_nbm<-addterm(ngbinmodel, . ~ . * age,test="Chisq")
> summary(intearaction_nbm)
Df AIC LRT Pr(Chi)
Min. :1 Min. :339.1 Min. :0.9383 Min. :0.3327
1st Qu.:1 1st Qu.:339.4 1st Qu.:0.9383 1st Qu.:0.3327
Median :1 Median :339.6 Median :0.9383 Median :0.3327
Mean :1 Mean :339.6 Mean :0.9383 Mean :0.3327
3rd Qu.:1 3rd Qu.:339.9 3rd Qu.:0.9383 3rd Qu.:0.3327
Max. :1 Max. :340.2 Max. :0.9383 Max. :0.3327
NA's :1 NA's :1 NA's :1
and
> ngbinmodel_int <- glm.nb( seizure.rate ~ age*treatment, data = epilepsy_reduced)
> summary(ngbinmodel_int)
glm.nb(formula = seizure.rate ~ age * treatment, data = epilepsy_reduced,
init.theta = 1.531539174, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3503 -0.8742 -0.3848 0.3403 1.8508
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.51361 0.83920 1.804 0.0713 .
age 0.01914 0.02826 0.677 0.4981
treatment 0.60748 1.12199 0.541 0.5882
age:treatment -0.03893 0.03850 -1.011 0.3119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.5315) family taken to be 1)
Null deviance: 72.238 on 57 degrees of freedom
Residual deviance: 66.874 on 54 degrees of freedom
AIC: 342.18
Number of Fisher Scoring iterations: 1
Theta: 1.532
Std. Err.: 0.373
2 x log-likelihood: -332.180
I was expecting to obtain the same result from both of the methods.
How can i access the regression estimates of intearaction_nbm?
why are the outcomes different? According to intearaction_nbm i should include the interaction term (the AIC is lower) but according to ngbinmodel_int i should not include the interaction term (AIC increases).
would discretizing my continuous variable age be advised?

Remark: You should move this post to cross validated.
How can i access the regression estimates of intearaction_nbm?
intearaction_nbm gives you the result of the addition of single term to your model, if you print it, you will have a row per possible additional term (age:treatment, age:another_variable, etc.) giving you the AIC and P-value among other things.
why are the outcomes different?
Not possible to answer without the data, but what I would do is define both models and compare their AIC using: AIC(model_1, model_2). This way I am sure that I am comparing the same quantity. As you know, the AIC is defined up to an additional term, and unless you check how it is computed, you cannot be sure that two different functions in two different packages use the same definition.
would discretizing my continuous variable age be advised?
Not possible to answer without the data...

Let us consider the dataset quine and the following model with only main effects for Eth and Lrn factors:
library(MASS)
negbin_no_int <- glm.nb(Days ~ Eth + Lrn, data = quine)
summary(negbin_no_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 3.0367 0.1334 22.764 < 2e-16 ***
# EthN -0.5520 0.1597 -3.457 0.000546 ***
# LrnSL 0.0388 0.1611 0.241 0.809661
extractAIC(negbin_no_int)
# [1] 3.000 1112.576
The model with the interaction term between the two factor is:
negbin_with_int <- glm.nb(Days ~ Eth * Lrn, data = quine)
summary(negbin_with_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.9218 0.1503 19.446 <2e-16 ***
# EthN -0.3374 0.2100 -1.607 0.108
# LrnSL 0.2929 0.2307 1.269 0.204
# EthN:LrnSL -0.4956 0.3201 -1.549 0.122
extractAIC(negbin_with_int)
# [1] 4.000 1112.196
The statistical significance of the interaction term is p=0.122.
Now we compare the two models using addterm:
interaction_nbm <- addterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
print(interaction_nbm)
# Model:
# Days ~ Eth + Lrn
# Df AIC LRT Pr(Chi)
# <none> 1112.6
# Eth:Lrn 1 1112.2 2.3804 0.1229
The AICs given by addterm are the same calculated using extractAIC.
If you want to see the regression estimates of addterm, you can add a summary(print(nfit)) inside the function, as follows:
myaddterm <- function (object, scope, scale = 0, test = c("none", "Chisq"),
k = 2, sorted = FALSE, trace = FALSE, ...)
{
if (missing(scope) || is.null(scope))
stop("no terms in scope")
if (!is.character(scope))
scope <- add.scope(object, update.formula(object, scope))
if (!length(scope))
stop("no terms in scope for adding to object")
ns <- length(scope)
ans <- matrix(nrow = ns + 1L, ncol = 2L, dimnames = list(c("<none>",
scope), c("df", "AIC")))
ans[1L, ] <- extractAIC(object, scale, k = k, ...)
n0 <- nobs(object, use.fallback = TRUE)
env <- environment(formula(object))
for (i in seq_len(ns)) {
tt <- scope[i]
if (trace) {
message(gettextf("trying + %s", tt), domain = NA)
utils::flush.console()
}
nfit <- update(object, as.formula(paste("~ . +", tt)),
evaluate = FALSE)
nfit <- try(eval(nfit, envir = env), silent = TRUE)
print(summary(nfit))
ans[i + 1L, ] <- if (!inherits(nfit, "try-error")) {
nnew <- nobs(nfit, use.fallback = TRUE)
if (all(is.finite(c(n0, nnew))) && nnew != n0)
stop("number of rows in use has changed: remove missing values?")
extractAIC(nfit, scale, k = k, ...)
}
else NA_real_
}
dfs <- ans[, 1L] - ans[1L, 1L]
dfs[1L] <- NA
aod <- data.frame(Df = dfs, AIC = ans[, 2L])
o <- if (sorted)
order(aod$AIC)
else seq_along(aod$AIC)
test <- match.arg(test)
if (test == "Chisq") {
dev <- ans[, 2L] - k * ans[, 1L]
dev <- dev[1L] - dev
dev[1L] <- NA
nas <- !is.na(dev)
P <- dev
P[nas] <- MASS:::safe_pchisq(dev[nas], dfs[nas], lower.tail = FALSE)
aod[, c("LRT", "Pr(Chi)")] <- list(dev, P)
}
aod <- aod[o, ]
head <- c("Single term additions", "\nModel:", deparse(formula(object)))
if (scale > 0)
head <- c(head, paste("\nscale: ", format(scale), "\n"))
class(aod) <- c("anova", "data.frame")
attr(aod, "heading") <- head
aod
}
interaction_nbm1 <- myaddterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
The output is:
Call:
glm.nb(formula = Days ~ Eth + Lrn + Eth:Lrn, data = quine, init.theta = 1.177546225,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5770 -1.0470 -0.3645 0.3521 2.7227
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.9218 0.1503 19.446 <2e-16 ***
EthN -0.3374 0.2100 -1.607 0.108
LrnSL 0.2929 0.2307 1.269 0.204
EthN:LrnSL -0.4956 0.3201 -1.549 0.122
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.1775) family taken to be 1)
Null deviance: 182.93 on 145 degrees of freedom
Residual deviance: 168.18 on 142 degrees of freedom
AIC: 1114.2
Number of Fisher Scoring iterations: 1
Theta: 1.178
Std. Err.: 0.146
2 x log-likelihood: -1104.196

Successively removing predictor variable from formula

I have a model formula in the form of
model.all <- lme(Response ~ A + B + C)
I would like to update this model by successively removing a predictor variable from the model, so I would end up with 3 models, specifically:
mod.1 <- lme(Response ~ B + C) ; mod.2 <- lme(Response ~ A + C) ; mod.3 <- lme(Response ~ A + B)
I am thinking of a loop function, so I am aware of the update function, but I have too many predictor variables to manually change the code.
Any suggestions would be appreciated.

I would use combn in this occasion, see the example below:
Example Data
Response <- runif(100)
A <- runif(100)
B <- runif(100)
C <- runif(100)
Solution
a <- c('A','B','C') #the names of your variables
b <- as.data.frame(combn(a,2)) #two-way combinations of those using combn
#create the formula for each model
my_forms <- sapply(b, function(x) paste('Response ~ ', paste(x,collapse=' + ')))
> my_forms #the formulas that will be used in the model
V1 V2 V3
"Response ~ A + B" "Response ~ A + C" "Response ~ B + C"
#run each model
my_models <- lapply(my_forms, function(x) lm(as.formula(x)))
Output
> summary(my_models[[1]])
Call:
lm(formula = as.formula(x))
Residuals:
Min 1Q Median 3Q Max
-0.48146 -0.20745 -0.00247 0.24263 0.58341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.32415 0.08232 3.938 0.000155 ***
A 0.25404 0.09890 2.569 0.011733 *
B 0.07955 0.10129 0.785 0.434141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2828 on 97 degrees of freedom
Multiple R-squared: 0.06507, Adjusted R-squared: 0.04579
F-statistic: 3.375 on 2 and 97 DF, p-value: 0.03827
As you can see each model is saved in as a list element in my_models. I find this quite easy to make and run.

Repeating univariable GLM with several different variables

I have an outcome variable, say Y and a list of 20 variables that could affect Y (say X1...X20). I would like to test which variables are NOT independent of Y. To do this I want to run a univariable glm for each variable and Y (ie Y~X1,...,Y~X20) and then do a likelihood ratio test for each model. Finally I would like to create a table the has the resulting P value from the likelihood test for each model.
From what I have seen the lapply function and split function could be useful for this but I don't really understand how they work in the examples I've seen.
This is what I tried at first:
> VarNames<-c(names(data[30:47]))
> glms<-glm(intBT~VarNames,family=binomial(logit))
Error in model.frame.default(formula = intBT ~ VarNames, drop.unused.levels = TRUE) :
variable lengths differ (found for 'VarNames')
I'm not sure if that was a good approach though.

It is easier to answer your questions if you provide a minimal example.
One way to go - but certainly not the most beautiful - is to use paste to create the formulas as a vector of strings and then use lapply on them. The Code for this could look like this:
example.data <- data.frame(intBT=1:10, bli=1:10, bla=1:10, blub=1:10)
var.names <- c('bli', 'bla', 'blub')
formulas <- paste('intBT ~', var.names)
fitted.models <- lapply(formulas, glm, data=example.data)
This gives a list of fitted model. You can then use the apply functions on fitted.models to execute further tests.

Like Paul said it really helps if you provide a minimal example, but I think this does what you want.
set.seed(123)
N <- 100
num_vars <- 5
df <- data.frame(lapply(1:num_vars, function(i) i = rnorm(N)))
names(df) <- c(paste0(rep("X",5), 1:num_vars ))
e <- rnorm(N)
y <- as.numeric((df$X1 + df$X2 + e) > 0.5)
pvalues <- vector(mode = "list")
singlevar <- function(var, y, df){
model <- as.formula(paste0("y ~ ", var))
pvalues[var] <- coef(summary(glm(model, family = "binomial", data = df)))[var,4]
}
sapply(colnames(df), singlevar, y, df)
X1 X2 X3 X4 X5
1.477199e-04 4.193461e-05 8.885365e-01 9.064953e-01 9.702645e-01
For comparison:
Call:
glm(formula = y ~ X2, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0674 -0.8211 -0.5296 0.9218 2.5463
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5591 0.2375 -2.354 0.0186 *
X2 1.2871 0.3142 4.097 4.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 130.68 on 99 degrees of freedom
Residual deviance: 106.24 on 98 degrees of freedom
AIC: 110.24
Number of Fisher Scoring iterations: 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

User-defined function to iterate through factor levels in a regression - r

Related

How I pass string inside R function

Dynamic variable names in R regressions

negative binomial model with interaction in R

Successively removing predictor variable from formula

Repeating univariable GLM with several different variables

Categories

Resources