Multiple glm in for loop - r

I have an R dataframe, strongly simplified as:
id <- rep(1:2, c(6,8))
correct <- sample(0:1,14,TRUE)
phase <- c(rep("discr",3),rep("rev",3), rep("discr",4),rep("rev",4))
dat <- data.frame(id,correct,phase)
with id as my subjects (in reality I have a lot more than 2), correct = responses coded as incorrect (0) or correct (1), and the phases Discrimination and Reversal (within-subjects factor).
I want to perform a logistic regression in the form of
glm(correct~phase, dat, family="binomial")
later possibly adding additional predictors.
However, since I have a varying amount of data for each subject, I would like to perform glm() seperately for each subject and later compare the coefficients with ANOVA for group effects.
I would like to do this in a for loop in the form of
for(i in seq_along(dat$id)){
my_glm[i] <- glm(correct~list,dat[dat$id==i,],family="binomial")
}
but keep receiving the error message
>Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels.
I have checked my data and there is no factor which contains only one level. All subjects gave at least one incorrect and one correct response, and all took part in Discrimination and Reversal. The function works outside the loop when I specify a particular subject.

Here's an R Base solution:
> lapply(split(dat, dat$id), function(x) coef(summary(glm(correct~phase,family="binomial",data=x))))
$`1`
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.931472e-01 1.224745 -5.659524e-01 0.5714261
phaserev -3.845925e-16 1.732050 -2.220446e-16 1.0000000
$`2`
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.356998e-16 1.000000 3.356998e-16 1.000000
phaserev 1.098612e+00 1.527524 7.192109e-01 0.472011

you currently trying to do a glm for each row in of id:
I think you want a glm for each id seperately. Personally, I would go with something like:
library(plyr)
ddply(dat, .(id), function (x){
intercept <- coef(summary(glm(correct~phase,family="binomial",data=x)))[1]
slope <- coef(summary(glm(correct~phase,family="binomial",data=x)))[2]
c(intercept,slope)
})
# id V1 V2
#1 1 -0.6931472 1.386294e+00
#2 2 1.0986123 -6.345448e-16
# here V1 is intercept and V2 is the estimate

Related

Estimating mix-level logistic regression coefficients without using lme4 package in R

I have a 2 level dataset of 37000 instances, which represents the choices of 199 subjects. I have to estimate coefficients in logistic regression for each of the 199 individuals. I have done manually 199 times by subsetting, but I want to know whether there is a more efficient way of getting the coefficients by looping without using the lme4 package. Also, I should compute the coefficients as variables in each subject.
Here is my code.
### Split of the dataset in each subject ID
mylist <- split(df_merged2, df_merged2$sjind)
### Indication of subject 1 in the first subsetting
df1 <- mylist[[1]]
### Logistic regression
glm1 <- glm(rep ~ reward_v.2 + trans_v.2 + reward_transition, data = df1)
### Extracting the coefficients
reward_transition <- coef(glm1)[4]
reward <- coef(glm1)[2]
transition <- coef(glm1)[3]
reward<- as.numeric(reward)
reward_transition <- as.numeric(reward_transition)
transition <- as.numeric(transition)
omega <- reward_transition - reward
### Computing the constant coefficients as variables
df1$rewardmix <- 1
df1$rewardmix <- reward
df1$omega <- 1
df1$omega <- omega
df1$transmix <- 1
df1$transmix <- transition
df1$reward_transitionmix <- reward_transition
You can use the by() function from the base package, whose short description is "Apply a Function to a Data Frame Split by Factors" (ref: help(by))
Here is an example using your terminology for the data frame and the subject ID variable names:
# Make the simulated data reproducible
set.seed(1717)
# The IDs can be sorted in any order
ids = c('A','B','B','A','A','B','B','B','C','C','C','B','C')
# Sample data frame with: subject ID, target variable (y), input variable (x)
df_merged2 = data.frame(sjind=ids,
y=rnorm(length(ids)),
x=rnorm(length(ids)))
head(df_merged2)
The top 6 rows of the data look like:
sjind y x
1 A -1.4548934 1.1004932
2 B -1.7084245 -0.7731208
3 B 2.1004557 -1.6229203
4 A -1.0283021 0.4233806
5 A 0.4133888 1.2398577
6 B -1.4104637 0.3746706
Now use the by() function to fit a GLM model for each group defined by the sjind unique values:
glm_by_sjind = by(df_merged2, as.factor(df_merged2$sjind),
function(df) glm(y ~ x, data=df))
The output object glm_by_sjind is a list with the following properties:
It has as many elements as the number of unique values in sjind (in this case 3)
It is indexed by the unique values of the sjind variable (in this case "A", "B", "C")
Each element contains the regression output from glm() run on each split of the input data frame (where splits are clearly defined by the sjind unique values)
So for example, you can request the summary of the regression output for subject "B" as follows:
> summary(glm_by_sjind[["B"]])
Call:
glm(formula = y ~ x, data = df)
Deviance Residuals:
2 3 6 7 8 12
-1.40226 1.59040 -0.00186 0.06400 -1.93118 1.68091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0487 0.7472 -1.404 0.233
x -0.9605 0.9170 -1.047 0.354
(Dispersion parameter for gaussian family taken to be 2.763681)
Null deviance: 14.087 on 5 degrees of freedom
Residual deviance: 11.055 on 4 degrees of freedom
AIC: 26.694
Number of Fisher Scoring iterations: 2
If we go a little further, we can also perform a sanity check that each GLM model is based on the expected number of cases (i.e. the number of cases in each model should be equal to the frequency distribution of the sjind variable in the input data frame).
freq_sjind_in_data = as.list( table(df_merged2$sjind) )
ncases_in_each_glm = lapply( glm_results, function(glm) NROW(glm$data) )
all.equal( freq_sjind_in_data,
ncases_in_each_glm )
which returns TRUE.
Or also inspect that visually:
as.data.frame(freq_sjind_in_data)
as.data.frame(ncases_in_each_glm)
which return
A B C
1 3 6 4
in both cases.

How to ignore linearly correlated variables introduced by factor reference cell coding

Assume I have a dataset containing two categorical predictor variables (a,b) and a binary target (y) variable.
> df <- data.frame(
> a = factor(c("cat1","cat2","cat3","cat1","cat2")),
> b = factor(c("cat1","cat1","cat3","cat2","cat2")),
> y = factor(c(T,F,T,F,T))
> )
The following logical relations exist in the data:
if (a = cat3) then (b = cat3 and y = true)
else if (a = b) then (y = true) else y = false
I want to use glm to build a model for my dataset.
glm will automatically apply reference cell coding on my categorical variables a and b. It will also take care of finding the right number of codes for each factor variable, so that no alias variables are introduced (explained here).
However it can happen, as for the dataset above, that a linear relationship exists between one reference code generated for variable a and one reference code of variable b.
See the output of my model:
> model <- glm(y ~ ., family=binomial(link='logit'), data=df)
> summary(model)
...
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.965e-16 1.732e+00 0.000 1.000
acat2 -2.396e-16 2.000e+00 0.000 1.000
acat3 1.857e+01 6.523e+03 0.003 0.998
bcat2 0.000e+00 2.000e+00 0.000 1.000
bcat3 NA NA NA NA # <- get rid of this?
How should I handle this case?
Is there a way to tell glm to omit some of the generated reference codes?
In the real problem my "cat3" value corresponds to NA. I have two meaningful factor variables which are NA in exactly the same instances of my dataset.
EDIT:
The checked answer solves the question, however, in this specific case the singularities can simply be ignored as pointed out in the comments.
The comments made under the question are pertinent but it may still be useful to try eliminating the NA model matrix columns so that you can compare it to not doing such elimition in order to satisfy yourself regarding the equivalence.
In particular, you could run glm twice removing the redundant model matrix columns on the second run:
model <- glm(y ~ ., family=binomial(link='logit'), data=df) # as in question
mm <- model.matrix(model)[, !is.na(coef(model)) ]
df0 <- data.frame(y = df$y, mm[, -1])
update(model, data = df0)
giving:
Call: glm(formula = y ~ ., family = binomial(link = "logit"), data = df0)
Coefficients:
(Intercept) acat2 acat3 bcat2
1.965e-16 -2.396e-16 1.857e+01 0.000e+00
Degrees of Freedom: 4 Total (i.e. Null); 1 Residual
Null Deviance: 6.73
Residual Deviance: 5.545 AIC: 13.55
Note that if you don't want to use the fact that we know that the response is named y then we could extract the response and its name replacing the assignment to df0 above with:
df0 <- data.frame(model.response(model.frame(model)), mm[, -1])
names(df0)[1] <- as.character(attr(terms(model), "variables")[[2]])

Set contrasts in glm

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?
Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.
Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.

Comparing nested mice models with interaction terms

R's mice contains a function, pool.compare, to compare nested models fit to imputed objects. If I try to include an interaction term:
library(mice)
imput = mice(nhanes2)
mi1 <- with(data=imput, expr=lm(bmi~age*hyp))
mi0 <- with(data=imput, expr=lm(bmi~age+hyp))
pc <- pool.compare(mi1, mi0, method="Wald")
then it returns the following error:
Error in pool(fit1) :
Different number of parameters: coef(fit): 6, vcov(fit): 5
It sounds like the variance-covariance matrix doesn't include the interaction term as its own variable. What's the best way around this?
The problem appears to be that some of your parameters are un-estimatable in some of your imputed data.sets. When I run the code, I see
( fit1<-mi1$analyses[[1]] )
# lm(formula = bmi ~ age * hyp)
#
# Coefficients:
# (Intercept) age2 age3 hyp2 age2:hyp2
# 28.425 -5.425 -3.758 1.200 3.300
# age3:hyp2
# NA
In this set, it was not possible to estimate age3*hyp2 (presumably because there were no observations in this group).
This causes the discrepancy in coef(fit1) and vcov(fit1) since the covariance cannot be estimated for that term.
What to do in this case is more of a statistical problem than a programming problem. If you are unsure of what would be appropriate for your data, I suggest you consult with the statisticians over at Cross Validated.

Determine or test signficance for certain levels in a variable using r

In this dataset, I mainly have one dependent variable (Reaction Time) and two independent variables (Prime Type (5 levels) and Related (2 levels)). I wanted to see the Reaction Time (RT) significance of Related for each Prime Type level. For example, how significantly related (level 1) condition differ from the unrelated (level 2) for PrimeType 2. One of the basic ways I tried to do this is with t.test, but was only able to run it as the following code (Related for all PrimeType):
t.test(RT~Related, alternative='two.sided', conf.level=.95, var.equal=FALSE,
data=mydata)
How should I write the code for R to only look at PrimeType 2 with regard to the Related variable?
#BenBolker -- Thank you for your comment. How is the following output should be interpreted? and are the Prime types are compared to an intercept? what is it?
> lm3 <- lmList(RT~Related|PrimeType, mydata)
> summary(lm3)
Call:
Coefficients:
(Intercept)
Estimate Std. Error t value Pr(>|t|)
1 690.2596 6.255575 110.34311 0
2 677.0929 6.673277 101.46332 0
....
Related1
Estimate Std. Error t value Pr(>|t|)
1 -13.307825 9.146849 -1.4549081 1.457611e-01
2 1.408043 9.483289 0.1484763 8.819733e-01
3 -7.737740 9.518978 -0.8128750 4.163306e-01
....
You can create a subset of your data
subset(mydata, PrimeType == 2)
for the t.test function:
t.test(RT ~ Related, alternative='two.sided', conf.level=.95, var.equal=FALSE,
data = subset(mydata, PrimeType == 2))

Resources