loop for writing a multivariate binary logistic regression analysis - r

df <- data.frame(
disease = c(0,1,0,1),
var1 = c(0,1,2,0),
var2 =c(0,1,2,0),
var3 = c(0,1,2,0),
var40 = c(0,1,2,0),
Bi = c(0,1,0,1),
gender = c(1,0,1,0),
P1 = c(-0.040304832,0.006868288,0.002663759,0.020251087),
P2 = c(0.010566526,0.002663759,0.017480721,-0.008685749),
P3 = c(-0.008685749,0.020251087,-0.040304832,0.002663759),
P4 = c(0.017480721,0.024306667,0.002663759,0.010566526),
stringsAsFactors = FALSE)
The above data frame (df) consists of categorical and numerical variables namely; Disease, Bi and gender with labels 0,1, while var1 to var40 consists of a labels of 0,1,2, whereas PC1,PC2,PC3,PC4 consists of continuous numerical variables. The code for glm model for one variable will be:
glm(disease ~ var1*Bi+ gender+P1+P2+P3+P4, family = binomial(link
= 'logit'), data = df)
I need some help to write a loop that automatically performs the multivariate regression analysis for Disease versus variant1(var1) to Variant40(var) with same covariates namely; Bi, gender, P1, P2,P3,P4. I was doing something like below mentioned loop for all 40 variants but it's not working :
for (i in df$var1:df$var40) {glm(DepVar1 ~ i*Bi+gender+P1+P2+P3+P4, data=df,
family=binomial("logit")) }

Buyilding formulas dynamically can be a bit trickly, but there are functions like update() and reformulate() that can help. For example
results <- Map(function(i) {
newform <- update(disease ~ Bi+gender+P1+P2+P3+P4, reformulate(c(".", i)))
glm(newform, data=df, family=binomial("logit"))
}, names(subset(df, select=var1:var40)))
Here we use Map rather than a for loop so it's easier to save the results (they will be put into a list in with this method). But we use update() to add in the new variables of interest to the base formula. So for example
update(disease ~ Bi+gender+P1+P2+P3+P4, ~ . + var1)
# disease ~ Bi + gender + P1 + P2 + P3 + P4 + var1
this adds a variable to the right hand side. We use reformulate() to turn the name of the column as a string into a formula.
you can get all the models out from the list with
results$var1
results$var40
# etc

Related

Graphing model results of longitudinal data in R

I am looking to create a graph of longitudinal data by age and sex, similar to the graph in this image , from this paper https://www.thelancet.com/journals/lanpub/article/PIIS2468-2667(20)30258-9/fulltext.
To graph model results in the past, I have used both ggplot2 and ggpredict. I prefer ggpredict because it graphs the results accounting for covariates, but I am OK with graphing in ggplot2 if it can't be done in ggpredict.
I am providing a minimal reproducible example below, with id, wave (2 waves, separated by 6 years), age, sex, tst (total sleep time), and bmi for a covariate.
id<-rep(1:50, 2)
wave<-c(rep(1, 50),rep(2, 50))
tst<-c(sample(7:9,50, replace = T),sample(4:7,50, replace = T))
mydf<-data.frame(id,wave,tst)
mydf$age[mydf$wave==1]<-sample(40:90,50, replace = T)
mydf$age[mydf$wave==2]<-mydf$age[mydf$wave==1]+6
mydf$bmi<-sample(20:30,50, replace = T)
mydf$sex<-sample(1:2,50, replace = T)
mydf$age.cat<-cut(mydf$age[mydf$wave==1], breaks = 3,labels = c(1,2,3))
##Overall model##
(model <- lmer( tst ~ wave + age + sex + bmi +(1|id), data = mydf))
I tried to graph it with ggplot2 using the following syntax, however I'm not sure that the graph is exactly what I'm looking for. I would like to graph change in tst between waves 1 and 2, by age group and sex. TST would be on the y axis, age would be on the x axis, with separate lines for age group and sex, with standard errors. The lines will correspond to within-person change in TST between waves 1 and 2.
I think that the graph right now is showing the between subjects effects of age on tst, and not taking into account the fact that the data is nested within-person. Any help would be greatly appreciated.
ggplot(mydf,aes(x=age, y=tst, color=as.factor(sex), group=as.factor(age.cat), linetype=as.factor(age.cat)))+
geom_smooth(data=mydf[mydf$sex==1,], method = lm, formula = y~x)+
geom_smooth(data=mydf[mydf$sex==2,], method = lm, formula = y~x)+
geom_point() +
theme_bw()

R survey package: svyby + svymean: one vs many variables

Let's assume the a data set mydata with the variables foo1..foo20 which are factors with the labels "Easy" and "Difficult". Now let's consider this code:
library(survey)
svd <- svydesign(ids = ~ 1, weights = ~ weight, data = mydata)
svyby(~ foo1, by = ~ group, svd, svymean)$foo1Difficult
svyby(~ foo1 + foo2 + foo3 + ... + foo20, by = ~ group, svd, svymean)$foo1Difficult
Are the results supposed to be identical? Is there a reason why the results could differ? Why does it make a difference whether I iterate over each variable or use all variables at once?
As #AnthonyDamico pointed out, the difference was caused by NAs.

Loop multiple 'multiple linear regressions' in R

I have a database where I want to do several multiple regressions. They all look like this:
fit <- lm(Variable1 ~ Age + Speed + Gender + Mass, data=Data)
The only variable changing is variable1. Now I want to loop or use something from the apply family to loop several variables at the place of variable1. These variables are columns in my datafile. Can someone help me to solve this problem? Many thanks!
what I tried so far:
When I extract one of the column names with the names() function I do get a the name of the column:
varname = as.name(names(Data[14]))
But when I fill this in (and I used the attach() function):
fit <- lm(Varname ~ Age + Speed + Gender + Mass, data=Data)
I get the following error:
Error in model.frame.default(formula = Varname ~ Age + Speed + Gender
+ : object is not a matrix
I suppose that the lm() function does not recognize Varname as Variable1.
You can use lapply to loop over your variables.
fit <- lapply(Data[,c(...)], function(x) lm(x ~ Age + Speed + Gender + Mass, data = Data))
This gives you a list of your results.
The c(...) should contain your variable names as strings. Alternatively, you can choose the variables by their position in Data, like Data[,1:5].
The problem in your case is that the formula in the lm function attempts to read the literal names of columns in the data or feed the whole vector into the regression. Therefore, to use the column name, you need to tell the formula to interpret the value of the variable varnames and incorporate it with the other variables.
# generate some data
set.seed(123)
Data <- data.frame(x = rnorm(30), y = rnorm(30),
Age = sample(0:90, 30), Speed = rnorm(30, 60, 10),
Gender = sample(c("W", "M"), 30, rep=T), Mass = rnorm(30))
varnames <- names(Data)[1:2]
# fit regressions for multiple dependent variables
fit <- lapply(varnames,
FUN=function(x) lm(formula(paste(x, "~Age+Speed+Gender+Mass")), data=Data))
names(fit) <- varnames
fit
$x
Call:
lm(formula = formula(paste(x, "~Age+Speed+Gender+Mass")), data = Data)
Coefficients:
(Intercept) Age Speed GenderW Mass
0.135423 0.010013 -0.010413 0.023480 0.006939
$y
Call:
lm(formula = formula(paste(x, "~Age+Speed+Gender+Mass")), data = Data)
Coefficients:
(Intercept) Age Speed GenderW Mass
2.232269 -0.008035 -0.027147 -0.044456 -0.023895

Passing class column name as a variable when using caret and train function [duplicate]

I am trying to set the formula for GLM as the ensemble of columns in train - train$1:99:
model <- glm(train$100 ~ train$1:99, data = train, family = "binomial")
Can't figure to find the right way to do it in R...
If you need outcome ~ var1 + var2 + ... + varN, then try this:
# Name of the outcome column
f1 <- colnames(train)[100]
# Other columns seperated by "+"
f2 <- paste(colnames(train)[1:99], collapse = "+")
#glm
model <- glm(formula = as.formula(paste(f1, f2, sep = "~")),
data = train,
family = "binomial")
The simplest way, assuming that you want to use all but column 100 as predictor variables, is
model <- glm(v100 ~. , data = train, family = "binomial")
where v100 is the name of the 100th column (the name can't be 100 unless you have done something advanced/sneaky to subvert R's rules about data frame column names ...)

Setting a formula for GLM as a sum of columns in R

I am trying to set the formula for GLM as the ensemble of columns in train - train$1:99:
model <- glm(train$100 ~ train$1:99, data = train, family = "binomial")
Can't figure to find the right way to do it in R...
If you need outcome ~ var1 + var2 + ... + varN, then try this:
# Name of the outcome column
f1 <- colnames(train)[100]
# Other columns seperated by "+"
f2 <- paste(colnames(train)[1:99], collapse = "+")
#glm
model <- glm(formula = as.formula(paste(f1, f2, sep = "~")),
data = train,
family = "binomial")
The simplest way, assuming that you want to use all but column 100 as predictor variables, is
model <- glm(v100 ~. , data = train, family = "binomial")
where v100 is the name of the 100th column (the name can't be 100 unless you have done something advanced/sneaky to subvert R's rules about data frame column names ...)

Resources