User-defined function with lapply function - r

I'm attempting to establish a user-defined function that inputs predetermined variables (independent and dependent) from the active data frame. Let's take the example data frame df below looking at a coin toss outcome as a result of other recorded variables:
> df
outcome toss person hand age
1 H 1 Mary Left 18
2 T 2 Allen Left 12
3 T 3 Dom Left 25
4 T 4 Francesca Left 42
5 H 5 Mary Right 18
6 H 6 Allen Right 12
7 H 7 Dom Right 25
8 T 8 Francesca Right 42
The dfdata frame has a binomial response outcome being either heads or tails and I am going to look at how person,hand, and age might affect this categorical outcome. I plan to use a forward-selection approach which will test one variable against toss and then progress to add more.
As to keep things simple, I want to be able to identify the response/dependent (e.g., outcome) and predictor/independent (e.g., person,hand) variables before my user-defined function as such:
> independent<-c('person','hand','age')
> dependent<-'outcome'
Then create my function using the lapply and glm functions:
> test.func<-function(some_data,the_response,the_predictors)
+ {
+ lapply(the_predictors,function(a)
+ {
+ glm(substitute(as.name(the_response)~i,list(i=as.name(a))),data=some_data,family=binomial)
+ })
+ }
Yet, when I attempt to run the function with the predetermined vectors, this occurs:
> test.func(df,dependent,independent)
Error in as.name(the_response) : object 'the_response' not found
My expected response would be the following:
models<-lapply(independent,function(x)
+ {
+ glm(substitute(outcome~i,list(i=as.name(x))),data=df,family=binomial)
+ })
> models
[[1]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
Coefficients:
(Intercept) personDom personFrancesca personMary
1.489e-16 -1.799e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
**End Snippet**
As you can tell, using lapply and glm, I have created 3 simple models without all of the extra work doing it individually. You may be asking why create a user-defined function when you have simple code right there? I plan to run a while or repeat loop and it will decrease clutter.
Thank you for your assistance

I know code only answers are deprecated but I thought you were almost there and could just use the nudge to use the formula function (and to include 'the_response in the substitution):
test.func<-function(some_data,the_response,the_predictors)
{
lapply(the_predictors,function(a)
{print( form<- formula(substitute(resp~i,
list(resp=as.name(the_response), i=as.name(a)))))
glm(form, data=some_data,family=binomial)
})
}
Test:
> test.func(df,dependent,independent)
outcome ~ person
<environment: 0x7f91a1ba5588>
outcome ~ hand
<environment: 0x7f91a2b38098>
outcome ~ age
<environment: 0x7f91a3fad468>
[[1]]
Call: glm(formula = form, family = binomial, data = some_data)
Coefficients:
(Intercept) personDom personFrancesca personMary
8.996e-17 -1.540e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = form, family = binomial, data = some_data)
#snipped

Related

Is there a way of capturing variables in glm function in R?

I would like to store indicator variables to be used in modelling into a vector to use them in a function. The goal is to use the created function to fit more than one model. I tried it as below but did not seem to get it right.
# Load data (from gtsummary package)
data(trial, package = "gtsummary")
# Specify the covariates
indicatorvars <- c("trt", "grade")
# Function
modelfunc <- function(outcome, covarates) {
glm({{outcome}} ~ {{covarates}},
data = trial, family = binomial)
}
modelfunc("response" , "indiatorvars")
# This one works
#glm(response ~ trt + grade, data = trial, family = binomial)
You can first build up your formula as a character string before converting it using as.formula. So the function would look like this:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
glm(as.formula(form),
data = trial, family = binomial)
}
And here is your example:
modelfunc("response", indicatorvars)
#>
#> Call: glm(formula = as.formula(form), family = binomial, data = trial)
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)
What I don't yet like about this solution is that the call is not very informative. So I would slightly adapt the function:
modelfunc <- function(outcome, covarates) {
form <- paste(outcome, "~", paste(covarates, collapse = " + "))
out <- glm(as.formula(form),
data = trial, family = binomial)
out$call <- out$formula # replace call with formula
out
}
modelfunc("response", indicatorvars)
#>
#> Call: response ~ trt + grade
#>
#> Coefficients:
#> (Intercept) trtDrug B gradeII gradeIII
#> -0.87870 0.19435 -0.06473 0.08217
#>
#> Degrees of Freedom: 192 Total (i.e. Null); 189 Residual
#> (7 observations deleted due to missingness)
#> Null Deviance: 240.8
#> Residual Deviance: 240.3 AIC: 248.3
Created on 2021-04-27 by the reprex package (v2.0.0)

How to write a for loop which creates a model and has a function which references that same model

I am trying to run a post hoc analysis on an unbalanced two way anova using the anova_test funciton in the rstatix package. I need to run this post hoc test iteratively, as I have ~26 response (y) variables. My first step is to create models of all my y variables with relation to group and treatment. I have successfully managed to do this, creating a single list with 26 models:
models <- map(data[,y1:y26], ~(lm(.x ~data$group*data$treatment)))
Now comes the part I'm stuck on. Referring to these models iteratively. I would like to run the following code for every y variable I have:
group_by(group) %>%
anova_test(y ~ treatment, error = models(y), type = 3)
where my y changes every time and as it does, the "model" (referred to in the error = term) is updated accordingly. I'm struggling with this bit since first set of models I make is used to inform the second set of models.
However, if I run just one y variable through this whole bit of code at one time, I get the appropriate results.
model <- lm(y ~ group*treatment, data = data)
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
I have tried creating a for loop as well as using the map function in the purrr package but I have been unsuccessful. I am new to for loops and purrr so I am sure it's a simple fix I just can't see it.
Basically I want a way to run
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
iteratively for different y variables (y1, y2, ..., y26) while also referring to the approprite model (model$y1, model$y2, ..., model$26).
Thanks for your help!
Well you didn't give any data so let's use toothgrowth. You seem to like the model format, so let's build a list of models. You could do this in an automated fashion but to make it clear lets do it by hand. The call purrr::map with the anova_test function. You'll get a list back. Since you're in charge of naming the list elements go to town.
Updated answer May 18th. Now using map2 since you want two different models passed build a list for each...
library(rstatix)
library(purrr)
ToothGrowth$len2 <- ToothGrowth$len^2 # for variety
models <- list(model1 = lm(len ~ supp*dose, ToothGrowth),
model2 = lm(len ~ dose*supp, ToothGrowth),
model3 = lm(len2 ~ dose*supp, ToothGrowth),
model4 = lm(len2 ~ supp*dose, ToothGrowth))
models2 <- list(model1 = lm(len ~ supp, ToothGrowth),
model2 = lm(len ~ dose, ToothGrowth),
model3 = lm(len2 ~ dose, ToothGrowth),
model4 = lm(len2 ~ supp, ToothGrowth))
# one model
purrr::map(models, ~ anova_test(.x, type = 3))
# now with model for error term
purrr::map2(models, models2, ~ anova_test(.x, error = .y, type = 3))
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> $model1
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 4.058 0.049000 * 0.065
#> 2 dose 1 58 12.717 0.000734 * 0.180
#> 3 supp:dose 1 58 1.588 0.213000 0.027
#>
#> $model2
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 33.626 2.92e-07 * 0.367
#> 2 supp 1 58 10.729 2.00e-03 * 0.156
#> 3 dose:supp 1 58 4.200 4.50e-02 * 0.068
#>
#> $model3
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 36.028 1.35e-07 * 0.383
#> 2 supp 1 58 7.128 1.00e-02 * 0.109
#> 3 dose:supp 1 58 2.709 1.05e-01 0.045
#>
#> $model4
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 2.684 0.107000 0.044
#> 2 dose 1 58 13.566 0.000508 * 0.190
#> 3 supp:dose 1 58 1.020 0.317000 0.017
Thanks to Nirgrahamuk from the rstudio community forum for this answer:
map(names(models_1) ,
~ anova_test(data=group_by(df,edge),
formula = as.formula(paste0(.x,"~ trt")),
error = models_1[[.x]],
type = 3))
(see their full answer at: https://community.rstudio.com/t/trouble-using-group-by-and-map2-together/66730/8?u=mvula)
Created on 2020-05-20 by the reprex package (v0.3.0)

How Does R's Logit model handle categorical variables in the stats package?

I am running a logistic regression and I am noticing that each unique character string in my vector is receiving its own parameter. Is R optimizing the prediction on the outcome variable based each collection of unique values within the vector?
library(stats)
df = as.data.frame( matrix(c("a","a","b","c","c","b","a","a","b","b","c",1,0,0,0,1,0,1,1,0,1,0,1,0,100,10,8,3,5,6,13,10,4,"SF","CHI","NY","NY","SF","SF","CHI","CHI","SF","CHI","NY"), ncol = 4))
colnames(df) = c("letter","number1","number2","city")
df$letter = as.factor(df$letter)
df$city = as.factor(df$city)
df$number1 = as.numeric(df$number1)
df$number2 = as.numeric(df$number2)
glm(number1 ~ .,data=df)
#Call: glm(formula = number1 ~ ., data = df)
#Coefficients:
# (Intercept) letterb letterc number2 cityNY citySF
#1.57191 -0.25227 -0.01424 0.04593 -0.69269 -0.20634
#Degrees of Freedom: 10 Total (i.e. Null); 5 Residual
#Null Deviance: 2.727
#Residual Deviance: 1.35 AIC: 22.14
How is the logit treating city in the example above?

Regression in R using poly() function

The function poly() in R is used in order to produce orthogonal vectors and can be helpful to interpret coefficient significance. However, I don't see the point of using it for prediction. To my view, the two following model (model_1 and model_2) should produce the same predictions.
q=1:11
v=c(3,5,7,9.2,14,20,26,34,50,59,80)
model_1=lm(v~poly(q,2))
model_2=lm(v~1+q+q^2)
predict(model_1)
predict(model_2)
But it doesn't. Why?
Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I() function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930

How do I use vector values as variables in R

I have a dataframe called repay and I have created a vector for the variables names of the variables I am interested in called variables.
variables<-names(repay)[22:36]
I want to write a for loop that does some univariate analysis on each of the variables in variables. For example:
for (i in 1:length(variables))
{
model<-glm(Successful~ variables[i]
,data=repay
,family=binomial(link='logit'))
}
However it doesn't recognize variables[i] as a variable, giving the following error message:
Error in model.frame.default(formula = Successful ~ variables[i], data
= repay, : variable lengths differ (found for 'variables[i]')
Try using the formula function in R. It will allow correct interpretation of models as below:
for (i in 1:length(variables){
myglm <- glm(formula(paste("Successful", "~", variables[i])),
data = repay, family = binomial(link = 'logit'))
See my post here for more things you can do in this context.
Alternatively you can use assign yielding in as many models as the variables.
Let us consider
repay<-data.table(Successful=runif(10),a=sample(10),b=sample(10),c=runif(10))
variables<-names(repay)[2:4]
yielding:
>repay
Successful a b c
1: 0.8457686 7 9 0.2930537
2: 0.4050198 6 6 0.5948573
3: 0.1994583 2 8 0.4198423
4: 0.1471735 1 5 0.5906494
5: 0.7765083 8 10 0.7933327
6: 0.6503692 9 4 0.4262896
7: 0.2449512 4 1 0.7311928
8: 0.6754966 3 3 0.4723299
9: 0.7792951 10 7 0.9101495
10: 0.6281890 5 2 0.9215107
Then you can perform the loop
for (i in 1:length(variables)){
assign(paste0("model",i),eval(parse(text=paste("glm(Successful~",variables[i],",data=repay,family=binomial(link='logit'))"))))
}
resulting in 3 objects: model1,model2 and model3.
>model1
Call: glm(formula = Successful ~ a, family = binomial(link = "logit"),
data = repay)
Coefficients:
(Intercept) a
-0.36770 0.05501
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 5.752
Residual Deviance: 5.69 AIC: 17.66
Idem for model2, model3 et.c.
You could create a language object from a string,
var = "cyl"
lm(as.formula(sprintf("mpg ~ %s", var)), data=mtcars)
# alternative (see also substitute)
lm(bquote(mpg~.(as.name(var))), data=mtcars)
Small workaround that might help
for (i in 22:36)
{
ivar <- repay[i] #choose variable for running the model
repay2 <- data.frame(Successful= repay$Successful, ivar) #create new data frame with 2 variables only for running the model
#run model for new data frame repay2
model<-glm(Successful~ ivar
,data=repay2
,family=binomial(link='logit'))
}

Resources