I have a dataframe called repay and I have created a vector for the variables names of the variables I am interested in called variables.
variables<-names(repay)[22:36]
I want to write a for loop that does some univariate analysis on each of the variables in variables. For example:
for (i in 1:length(variables))
{
model<-glm(Successful~ variables[i]
,data=repay
,family=binomial(link='logit'))
}
However it doesn't recognize variables[i] as a variable, giving the following error message:
Error in model.frame.default(formula = Successful ~ variables[i], data
= repay, : variable lengths differ (found for 'variables[i]')
Try using the formula function in R. It will allow correct interpretation of models as below:
for (i in 1:length(variables){
myglm <- glm(formula(paste("Successful", "~", variables[i])),
data = repay, family = binomial(link = 'logit'))
See my post here for more things you can do in this context.
Alternatively you can use assign yielding in as many models as the variables.
Let us consider
repay<-data.table(Successful=runif(10),a=sample(10),b=sample(10),c=runif(10))
variables<-names(repay)[2:4]
yielding:
>repay
Successful a b c
1: 0.8457686 7 9 0.2930537
2: 0.4050198 6 6 0.5948573
3: 0.1994583 2 8 0.4198423
4: 0.1471735 1 5 0.5906494
5: 0.7765083 8 10 0.7933327
6: 0.6503692 9 4 0.4262896
7: 0.2449512 4 1 0.7311928
8: 0.6754966 3 3 0.4723299
9: 0.7792951 10 7 0.9101495
10: 0.6281890 5 2 0.9215107
Then you can perform the loop
for (i in 1:length(variables)){
assign(paste0("model",i),eval(parse(text=paste("glm(Successful~",variables[i],",data=repay,family=binomial(link='logit'))"))))
}
resulting in 3 objects: model1,model2 and model3.
>model1
Call: glm(formula = Successful ~ a, family = binomial(link = "logit"),
data = repay)
Coefficients:
(Intercept) a
-0.36770 0.05501
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 5.752
Residual Deviance: 5.69 AIC: 17.66
Idem for model2, model3 et.c.
You could create a language object from a string,
var = "cyl"
lm(as.formula(sprintf("mpg ~ %s", var)), data=mtcars)
# alternative (see also substitute)
lm(bquote(mpg~.(as.name(var))), data=mtcars)
Small workaround that might help
for (i in 22:36)
{
ivar <- repay[i] #choose variable for running the model
repay2 <- data.frame(Successful= repay$Successful, ivar) #create new data frame with 2 variables only for running the model
#run model for new data frame repay2
model<-glm(Successful~ ivar
,data=repay2
,family=binomial(link='logit'))
}
Related
I've been working recently with CDISC data that is structured with pre-specified column names and certain expectations for the way survival data are coded.
I want to write a wrapper for survival::Surv() that uses the structured data of the CDISC format. I have a function that is working in most scenarios, but I can't get it to work with survival::coxph().
How can I get my Surv() wrapper function to use default values and work in coxph()? Below are examples using the visR::adtte data set (data set ships with the visR package...install with devtools::install_github("openpharma/visR")), which is in CDISC format. All examples run without issue except the last one.
Surv_CDISC <- function(AVAL, CNSR) {
# set default values if not passed by user -----------------------------------
if (missing(AVAL) && exists("AVAL", envir = rlang::caller_env()))
AVAL <- get("AVAL", envir = rlang::caller_env())
else if (missing(AVAL))
stop("Default 'AVAL' value not found. Specify argument in `Surv_CDISC(AVAL=)`.")
if (missing(CNSR) && exists("CNSR", envir = rlang::caller_env()))
CNSR <- get("CNSR", envir = rlang::caller_env())
else if (missing(CNSR))
stop("Default 'CNSR' value not found. Specify argument in `Surv_CDISC(CNSR=)`.")
# pass args to `survival::Surv()` --------------------------------------------
survival::Surv(time = AVAL, event = 1 - CNSR)
}
# passing the arguments, everything works
with(visR::adtte, Surv_CDISC(AVAL = AVAL, CNSR = CNSR)) |> head()
#> [1] 2 3 3 28+ 58 46+
# letting the arguments use default value, everything still works
with(visR::adtte, Surv_CDISC()) |> head()
#> [1] 2 3 3 28+ 58 46+
# using function in model.frame() and defining argument values, everything works
model.frame(Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte) |> head(n = 2)
#> Surv_CDISC(AVAL, CNSR) SEX
#> 1 2 F
#> 2 3 M
# using function in model.frame() with default arguments, everything works
model.frame(Surv_CDISC() ~ SEX, data = visR::adtte) |> head(n = 2)
#> Surv_CDISC() SEX
#> 1 2 F
#> 2 3 M
# using function in survfit() and defining argument values, everything works
survival::survfit(Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#> Call: survfit(formula = Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#>
#> n events median 0.95LCL 0.95UCL
#> SEX=F 143 80 64 47 96
#> SEX=M 111 72 41 30 57
# using function in survfit() with default arguments, everything works
survival::survfit(Surv_CDISC() ~ SEX, data = visR::adtte)
#> Call: survfit(formula = Surv_CDISC() ~ SEX, data = visR::adtte)
#>
#> n events median 0.95LCL 0.95UCL
#> SEX=F 143 80 64 47 96
#> SEX=M 111 72 41 30 57
# using function in coxph() and defining argument values, everything works
survival::coxph(Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#> Call:
#> survival::coxph(formula = Surv_CDISC(AVAL, CNSR) ~ SEX, data = visR::adtte)
#>
#> coef exp(coef) se(coef) z p
#> SEXM 0.3147 1.3699 0.1626 1.935 0.053
#>
#> Likelihood ratio test=3.71 on 1 df, p=0.05412
#> n= 254, number of events= 152
# DOES NOT WORK TRYING TO RELY ON DEFAULT VALUES
survival::coxph(Surv_CDISC() ~ SEX, data = visR::adtte)
#> Error in x[[2]]: subscript out of bounds
Created on 2022-06-05 by the reprex package (v2.0.1)
This looks like a bug in the survival package, or maybe a mis-use of it (I'm not so familiar with the internals).
EDITED TO ADD A COMMENT:
I think the analysis below is wrong. Reading the code more carefully, I think the current code in the survival package won't work reliably unless you use the explicit Surv(AVAL, CNSR) in the formula.
HERE'S THE ANALYSIS THAT APPEARED TO WORK, BUT I DON'T TRUST IT:
The problem is that survival:::terms.inner looks specifically for a function named Surv, here: https://github.com/therneau/survival/blob/b5238a42867a931954cf222b871a7b3a1c2fcd24/R/xtras.R#L65 . Since your function has a different name, it's not handled as if it is the same thing.
You could fix this by naming your function Surv as well. When I do that, things appear to work. Of course, this may cause problems elsewhere when you want the original Surv without the survival:: prefix, but I don't know a way to fix that.
I'd still worry about using caller_env(). Here's how I'd create your fake Surv:
make_surv_CDISC <- function(defaults) {
force(defaults)
function(AVAL = defaults$AVAL,
CNSR = defaults$CNSR) {
# pass args to `survival::Surv()` --------------------------------------------
survival::Surv(time = AVAL, event = 1 - CNSR)
}
}
Surv <- make_surv_CDISC(visR::adtte)
This is less general than yours, but I think it's safer.
I am trying to run a post hoc analysis on an unbalanced two way anova using the anova_test funciton in the rstatix package. I need to run this post hoc test iteratively, as I have ~26 response (y) variables. My first step is to create models of all my y variables with relation to group and treatment. I have successfully managed to do this, creating a single list with 26 models:
models <- map(data[,y1:y26], ~(lm(.x ~data$group*data$treatment)))
Now comes the part I'm stuck on. Referring to these models iteratively. I would like to run the following code for every y variable I have:
group_by(group) %>%
anova_test(y ~ treatment, error = models(y), type = 3)
where my y changes every time and as it does, the "model" (referred to in the error = term) is updated accordingly. I'm struggling with this bit since first set of models I make is used to inform the second set of models.
However, if I run just one y variable through this whole bit of code at one time, I get the appropriate results.
model <- lm(y ~ group*treatment, data = data)
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
I have tried creating a for loop as well as using the map function in the purrr package but I have been unsuccessful. I am new to for loops and purrr so I am sure it's a simple fix I just can't see it.
Basically I want a way to run
data %>%
group_by(group) %>%
anova_test(y ~ treatment, error = model, type = 3)
iteratively for different y variables (y1, y2, ..., y26) while also referring to the approprite model (model$y1, model$y2, ..., model$26).
Thanks for your help!
Well you didn't give any data so let's use toothgrowth. You seem to like the model format, so let's build a list of models. You could do this in an automated fashion but to make it clear lets do it by hand. The call purrr::map with the anova_test function. You'll get a list back. Since you're in charge of naming the list elements go to town.
Updated answer May 18th. Now using map2 since you want two different models passed build a list for each...
library(rstatix)
library(purrr)
ToothGrowth$len2 <- ToothGrowth$len^2 # for variety
models <- list(model1 = lm(len ~ supp*dose, ToothGrowth),
model2 = lm(len ~ dose*supp, ToothGrowth),
model3 = lm(len2 ~ dose*supp, ToothGrowth),
model4 = lm(len2 ~ supp*dose, ToothGrowth))
models2 <- list(model1 = lm(len ~ supp, ToothGrowth),
model2 = lm(len ~ dose, ToothGrowth),
model3 = lm(len2 ~ dose, ToothGrowth),
model4 = lm(len2 ~ supp, ToothGrowth))
# one model
purrr::map(models, ~ anova_test(.x, type = 3))
# now with model for error term
purrr::map2(models, models2, ~ anova_test(.x, error = .y, type = 3))
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> Coefficient covariances computed by hccm()
#> $model1
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 4.058 0.049000 * 0.065
#> 2 dose 1 58 12.717 0.000734 * 0.180
#> 3 supp:dose 1 58 1.588 0.213000 0.027
#>
#> $model2
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 33.626 2.92e-07 * 0.367
#> 2 supp 1 58 10.729 2.00e-03 * 0.156
#> 3 dose:supp 1 58 4.200 4.50e-02 * 0.068
#>
#> $model3
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 dose 1 58 36.028 1.35e-07 * 0.383
#> 2 supp 1 58 7.128 1.00e-02 * 0.109
#> 3 dose:supp 1 58 2.709 1.05e-01 0.045
#>
#> $model4
#> ANOVA Table (type III tests)
#>
#> Effect DFn DFd F p p<.05 ges
#> 1 supp 1 58 2.684 0.107000 0.044
#> 2 dose 1 58 13.566 0.000508 * 0.190
#> 3 supp:dose 1 58 1.020 0.317000 0.017
Thanks to Nirgrahamuk from the rstudio community forum for this answer:
map(names(models_1) ,
~ anova_test(data=group_by(df,edge),
formula = as.formula(paste0(.x,"~ trt")),
error = models_1[[.x]],
type = 3))
(see their full answer at: https://community.rstudio.com/t/trouble-using-group-by-and-map2-together/66730/8?u=mvula)
Created on 2020-05-20 by the reprex package (v0.3.0)
The function poly() in R is used in order to produce orthogonal vectors and can be helpful to interpret coefficient significance. However, I don't see the point of using it for prediction. To my view, the two following model (model_1 and model_2) should produce the same predictions.
q=1:11
v=c(3,5,7,9.2,14,20,26,34,50,59,80)
model_1=lm(v~poly(q,2))
model_2=lm(v~1+q+q^2)
predict(model_1)
predict(model_2)
But it doesn't. Why?
Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I() function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
I'm attempting to establish a user-defined function that inputs predetermined variables (independent and dependent) from the active data frame. Let's take the example data frame df below looking at a coin toss outcome as a result of other recorded variables:
> df
outcome toss person hand age
1 H 1 Mary Left 18
2 T 2 Allen Left 12
3 T 3 Dom Left 25
4 T 4 Francesca Left 42
5 H 5 Mary Right 18
6 H 6 Allen Right 12
7 H 7 Dom Right 25
8 T 8 Francesca Right 42
The dfdata frame has a binomial response outcome being either heads or tails and I am going to look at how person,hand, and age might affect this categorical outcome. I plan to use a forward-selection approach which will test one variable against toss and then progress to add more.
As to keep things simple, I want to be able to identify the response/dependent (e.g., outcome) and predictor/independent (e.g., person,hand) variables before my user-defined function as such:
> independent<-c('person','hand','age')
> dependent<-'outcome'
Then create my function using the lapply and glm functions:
> test.func<-function(some_data,the_response,the_predictors)
+ {
+ lapply(the_predictors,function(a)
+ {
+ glm(substitute(as.name(the_response)~i,list(i=as.name(a))),data=some_data,family=binomial)
+ })
+ }
Yet, when I attempt to run the function with the predetermined vectors, this occurs:
> test.func(df,dependent,independent)
Error in as.name(the_response) : object 'the_response' not found
My expected response would be the following:
models<-lapply(independent,function(x)
+ {
+ glm(substitute(outcome~i,list(i=as.name(x))),data=df,family=binomial)
+ })
> models
[[1]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
Coefficients:
(Intercept) personDom personFrancesca personMary
1.489e-16 -1.799e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = substitute(outcome ~ i, list(i = as.name(x))),
family = binomial, data = df)
**End Snippet**
As you can tell, using lapply and glm, I have created 3 simple models without all of the extra work doing it individually. You may be asking why create a user-defined function when you have simple code right there? I plan to run a while or repeat loop and it will decrease clutter.
Thank you for your assistance
I know code only answers are deprecated but I thought you were almost there and could just use the nudge to use the formula function (and to include 'the_response in the substitution):
test.func<-function(some_data,the_response,the_predictors)
{
lapply(the_predictors,function(a)
{print( form<- formula(substitute(resp~i,
list(resp=as.name(the_response), i=as.name(a)))))
glm(form, data=some_data,family=binomial)
})
}
Test:
> test.func(df,dependent,independent)
outcome ~ person
<environment: 0x7f91a1ba5588>
outcome ~ hand
<environment: 0x7f91a2b38098>
outcome ~ age
<environment: 0x7f91a3fad468>
[[1]]
Call: glm(formula = form, family = binomial, data = some_data)
Coefficients:
(Intercept) personDom personFrancesca personMary
8.996e-17 -1.540e-16 1.957e+01 -1.957e+01
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.545 AIC: 13.55
[[2]]
Call: glm(formula = form, family = binomial, data = some_data)
#snipped
This question already has answers here:
How to debug "contrasts can be applied only to factors with 2 or more levels" error?
(3 answers)
Closed 5 years ago.
I am trying to run a binary logistic regression using For loops in R.
My code for the same is as follows:
mydata5<-read.table(file.choose(),header=T,sep=",")
colnames(mydata5)
Class <- 1:16
Countries <- 1:5
Months <- 1:7
DayDiff <- 1:28
mydata5$CT <- factor(mydata5$CT)
mydata5$CC <- factor(mydata5$CC)
mydata5$C <- factor(mydata5$C)
mydata5$DD <- factor(mydata5$DD)
mydata5$UM <- factor(mydata5$UM)
for(i in seq(along=Class))
{
mydata5$C=mydata5$C[i];
for(i2 in seq(along=Countries))
{
mydata5$CC=mydata5$CC[i2];
for(i3 in seq(along=Months))
{
mydata5$UM=mydata5$UM[i3];
for(i4 in seq(along=DayDiff))
{
mydata5$DD=mydata5$DD[i4];
lrfit5 <- glm(CT ~ C+CC+UM+DD, family = binomial(link = "logit"),data=mydata5)
summary(lrfit5)
library(lattice)
in_frame<-data.frame(C="mydata5$C[i]",CC="mydata5$CC[i2]",UM="mydata5$UM[i3]",DD="mydata5$DD[i4]")
predict(lrfit5,in_frame, type="response",se.fit=FALSE)
}
}
}
}
However, I'm getting the following error:
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Why is the error occurring? Also,the dataset "mydata5" has around 50000 rows.Please help.
Thanks in Advance.
You have tried to do a regression with a factor having only one level. Since you haven't given us your data we can't reproduce your analysis but I can simply reproduce your error message:
> d = data.frame(x=runif(10),y=factor("M",levels=c("M","F")))
> d
x y
1 0.07104688 M
2 0.11948466 M
3 0.20807068 M
4 0.24049508 M
5 0.44251492 M
6 0.69775646 M
7 0.44479983 M
8 0.64814971 M
9 0.75151207 M
10 0.38810621 M
> glm(x~y,data=d)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
By setting one of the factor values to "F" I don't get the error message:
> d$y[5]="F"
> glm(x~y,data=d)
Call: glm(formula = x ~ y, data = d)
Coefficients:
(Intercept) yF
0.39660 0.04591
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 0.5269
Residual Deviance: 0.525 AIC: 4.91
So somewhere in your loops (which we cannot run because we don't have your data) you are doing this.