Estimating mix-level logistic regression coefficients without using lme4 package in R - r

I have a 2 level dataset of 37000 instances, which represents the choices of 199 subjects. I have to estimate coefficients in logistic regression for each of the 199 individuals. I have done manually 199 times by subsetting, but I want to know whether there is a more efficient way of getting the coefficients by looping without using the lme4 package. Also, I should compute the coefficients as variables in each subject.
Here is my code.
### Split of the dataset in each subject ID
mylist <- split(df_merged2, df_merged2$sjind)
### Indication of subject 1 in the first subsetting
df1 <- mylist[[1]]
### Logistic regression
glm1 <- glm(rep ~ reward_v.2 + trans_v.2 + reward_transition, data = df1)
### Extracting the coefficients
reward_transition <- coef(glm1)[4]
reward <- coef(glm1)[2]
transition <- coef(glm1)[3]
reward<- as.numeric(reward)
reward_transition <- as.numeric(reward_transition)
transition <- as.numeric(transition)
omega <- reward_transition - reward
### Computing the constant coefficients as variables
df1$rewardmix <- 1
df1$rewardmix <- reward
df1$omega <- 1
df1$omega <- omega
df1$transmix <- 1
df1$transmix <- transition
df1$reward_transitionmix <- reward_transition

You can use the by() function from the base package, whose short description is "Apply a Function to a Data Frame Split by Factors" (ref: help(by))
Here is an example using your terminology for the data frame and the subject ID variable names:
# Make the simulated data reproducible
set.seed(1717)
# The IDs can be sorted in any order
ids = c('A','B','B','A','A','B','B','B','C','C','C','B','C')
# Sample data frame with: subject ID, target variable (y), input variable (x)
df_merged2 = data.frame(sjind=ids,
y=rnorm(length(ids)),
x=rnorm(length(ids)))
head(df_merged2)
The top 6 rows of the data look like:
sjind y x
1 A -1.4548934 1.1004932
2 B -1.7084245 -0.7731208
3 B 2.1004557 -1.6229203
4 A -1.0283021 0.4233806
5 A 0.4133888 1.2398577
6 B -1.4104637 0.3746706
Now use the by() function to fit a GLM model for each group defined by the sjind unique values:
glm_by_sjind = by(df_merged2, as.factor(df_merged2$sjind),
function(df) glm(y ~ x, data=df))
The output object glm_by_sjind is a list with the following properties:
It has as many elements as the number of unique values in sjind (in this case 3)
It is indexed by the unique values of the sjind variable (in this case "A", "B", "C")
Each element contains the regression output from glm() run on each split of the input data frame (where splits are clearly defined by the sjind unique values)
So for example, you can request the summary of the regression output for subject "B" as follows:
> summary(glm_by_sjind[["B"]])
Call:
glm(formula = y ~ x, data = df)
Deviance Residuals:
2 3 6 7 8 12
-1.40226 1.59040 -0.00186 0.06400 -1.93118 1.68091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0487 0.7472 -1.404 0.233
x -0.9605 0.9170 -1.047 0.354
(Dispersion parameter for gaussian family taken to be 2.763681)
Null deviance: 14.087 on 5 degrees of freedom
Residual deviance: 11.055 on 4 degrees of freedom
AIC: 26.694
Number of Fisher Scoring iterations: 2
If we go a little further, we can also perform a sanity check that each GLM model is based on the expected number of cases (i.e. the number of cases in each model should be equal to the frequency distribution of the sjind variable in the input data frame).
freq_sjind_in_data = as.list( table(df_merged2$sjind) )
ncases_in_each_glm = lapply( glm_results, function(glm) NROW(glm$data) )
all.equal( freq_sjind_in_data,
ncases_in_each_glm )
which returns TRUE.
Or also inspect that visually:
as.data.frame(freq_sjind_in_data)
as.data.frame(ncases_in_each_glm)
which return
A B C
1 3 6 4
in both cases.

Related

In R can you manually set degrees of freedom for lm() or Anova()?

I am replicating SPSS code in R that runs several Type 3 ANOVAs. In SPSS you can specify specific contrasts in an ANOVA (e.g., compare level 2 v level 4 in this 5-level variable). The resulting ANOVA tables return a test where the degrees of freedom are equal to the full sample, rather than the sample that is just concentrated in those two levels.
In R, I use the command below to run an ANOVA comparing those two levels but the resulting Residuals DF is based on the subsample of only those two levels rather than the full sample. Is there a way I can manually set the DF in either the lm() or Anova() function to avoid this issue? Or is there a way to specify contrasts that uses the full sample DF?
Anova(lm(DV ~ FiveLevelFactor, data = data, type = 3, subset = FiveLevelFactor == "2" | FiveLevelFactor == "4"))
How about using the linearHypothesis() function from the car package:
library(car)
data(Ornstein)
mod <- lm(interlocks ~ log(assets) + sector + nation, data=Ornstein)
linearHypothesis(mod, "nationUK = nationUS")
# Linear hypothesis test
#
# Hypothesis:
# nationUK - nationUS = 0
#
# Model 1: restricted model
# Model 2: interlocks ~ log(assets) + sector + nation
#
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 235 29829
# 2 234 29690 1 138.36 1.0904 0.2975

Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R]

I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous (in addition to the response variable, which is also obviously categorical/binary).
When calling summary(model_name), is there a way to include a column representing the number of observations within each factor level?
I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous.
If all your covariates are factors (not including the intercept), this is fairly easy as the model matrix only contains 0 and 1 and the number of 1 indicates the occurrence of that factor level (or interaction level) in your data. So just do colSums(model.matrix(your_glm_model_object)).
Since a model matrix has column names, colSums will give you a vector with "names" attribute, that is consistent with the "names" field of coef(your_glm_model_object).
The same solution applies to a linear model (by lm) and a generalized linear model (by glm) for any distribution family.
Here is a quick example:
set.seed(0)
f1 <- sample(gl(2, 50)) ## a factor with 2 levels, each with 50 observations
f2 <- sample(gl(4, 25)) ## a factor with 4 levels, each with 25 observations
y <- rnorm(100)
fit <- glm(y ~ f1 * f2) ## or use `lm` as we use `guassian()` family object here
colSums(model.matrix(fit))
#(Intercept) f12 f22 f23 f24 f12:f22
# 100 50 25 25 25 12
# f12:f23 f12:f24
# 12 14
Here, we have 100 observations / complete-cases (indicated under (Intercept)).
Is there a way to display the count for the baseline level of each factor?
Baseline levels are contrasted, so they don't appear in the the model matrix used for fitting. However, we can generate the full model matrix (without contrasts) from your formula not your fitted model (this also offers you a way to drop numeric variables if you have them in your model):
SET_CONTRAST <- list(f1 = contr.treatment(nlevels(f1), contrast = FALSE),
f2 = contr.treatment(nlevels(f2), contrast = FALSE))
X <- model.matrix(~ f1 * f2, contrasts.arg = SET_CONTRAST)
colSums(X)
#(Intercept) f11 f12 f21 f22 f23
# 100 50 50 25 25 25
# f24 f11:f21 f12:f21 f11:f22 f12:f22 f11:f23
# 25 13 12 13 12 13
# f12:f23 f11:f24 f12:f24
# 12 11 14
Note that it can quickly become tedious in setting contrasts when you have many factor variables.
model.matrix is definitely not the only approach for this. The conventional way may be
table(f1)
table(f2)
table(f1, f2)
but could get tedious too when your model become complicated.

incorrect logistic regression output

I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.

Meaning of coef in limma

tl;dr
why isn't fit <- eBayes(fit); topTable(fit, coef=4) the same as fit <- contrasts.fit(fit, c(-1,0,0,1)); fit <- eBayes(fit); topTable(fit) (column 1 of the design beeing the intercept)?
Example from the limma usersguide
Strain <- factor(targets$Strain, levels=c("WT","Mu"))
Treatment <- factor(targets$Treatment, levels=c("U","S"))
design <- model.matrix(~Strain+Strain:Treatment)
colnames(design)
[1] "(Intercept)" "StrainMu" "StrainWT:TreatmentS" "StrainMu:TreatmentS"
The first term in the model formula is an effect for Strain. This introduces an intercept column
to the design matrix, which estimates the average log-expression level for wild-type unstimulated
cells, and a column for Strain which estimates the mutant vs wildtype dierence in the unstimulated
state. The second term in the model formula represents the interaction between stimulation and
strain. [...] It introduces a third and a fourth column to the design matrix which represent the
effect of stimulation for wild-type and for mutant mice respectively [...].
fit <- lmFit(eset, design)
fit <- eBayes(fit)
topTable(fit, coef=3)
# will find those genes responding to stimulation in wild-type mice, and
topTable(fit, coef=4)
# will find those genes responding to stimulation in mutant mice
What I don't understand
If using coef is the same as looking at the difference between the 4th column of the design matrix and the intercept (i.e. the contrast between the fourth and first column), wouldn't we need to look at the contrast between the fourth and second column to get the genes responding to stimulation in mutant mice?
Of course I compared the results when using coef and when using contrasts. They differ but I do not understand why... Obviously it means that coef=4 does not mean "look at the difference between column 4 and the intercept", but what does it mean then?
I hope that the question is understandable. Many thanks in advance!
The design matrix is based on
targets <- data.frame(
Strain = factor(c("WT", "WT", "MU", "MU", "MU"), levels = c("WT", "MU")),
Treatment = factor(c("U", "S", "U", "S", "S"), levels = c("U", "S")))
design <- model.matrix(~ Strain + Strain:Treatment, data = targets)
> targets
## Strain Treatment
## 1 WT U
## 2 WT S
## 3 MU U
## 4 MU S
## 5 MU S
Each row of targets corresponds to an experimental sample. The design matrix looks like this:
## (Intercept) StrainMU StrainWT:TreatmentS StrainMU:TreatmentS
## 1 1 0 0 0
## 2 1 0 1 0
## 3 1 1 0 0
## 4 1 1 0 1
## 5 1 1 0 1
Again, each row corresponds to an experimental sample. The columns of design correspond to coefficients that are fitted by limma and you can read off what combination of coefficients gives the model-fitted value for a given experimental group by comparing the rows of design with those of targets.
Looking at coef=4 effectively means that you're testing the null hypothesis that the fourth coefficient (that for StrainMu:TreatmentS) is zero - it isn't the same as comparing the value of the fourth coefficient against the value of the intercept coefficient.
Think in terms of the fitted value for each of the experimental classes.
For a given gene,
if I was a wildtype, unstimulated mouse (as in the first row of design or targets), my fitted value would be:
Intercept
if I was a wildtype, stimulated mouse (second row), my fitted value would be:
Intercept + StrainWT:TreatmentS
if I was a mutant, unstimulated mouse (third row), my fitted value would be:
Intercept + StrainMu
if I was a mutant, stimulated mouse (fourth and fifth rows), my fitted value would be:
Intercept + StrainMu + StrainMU:TreatmentS
So the difference between the stimulated and unstimulated group within the mutant strain is:
(Intercept + StrainMU + StrainMU:TreatmentS) - (Intercept + StrainMU)
= StrainMU:TreatmentS
... the coefficient corresponding to the 4th column in the design matrix
Hopefully that was helpful
ps, using coef=4 should give you the same result as using contrast = c(0, 0, 0, 1)

linear regression on equal-size groups of rows in data-frame

I have a data-frame of 2 columns: y and x. The dimensions of the data-frame are 10000 rows and 2 columns. The 10000 rows refer to 500 samples, each with 20 y and 20 x.
How can I perform linear regression on each sample (each group of 20 rows) so that I can store the estimated coefficient in a separate 500-row data structure?
I know I can perform summary(lm(y ~ x))$coefficients[2, 1] to get the estimated coefficients for every row in the data-frame. However, my objective is the estimated coefficients for every sample, not every row.
You can use by() to preform regressions on different subsets if you create a column which identifies the subset to which each row belongs. First, some sample data
N<-10000
n<-20
dd<-data.frame(x=runif(N))
dd<-transform(dd, y= 4-2*x + rnorm(N))
Now, to fit the model
fits<-t(sapply(by(dd, rep(1:(N/n), each=n), function(x) lm(y~x, x)), coef))
head(fits)
# (Intercept) x
# 1 4.025626 -2.3476841
# 2 4.684731 -3.0566627
# 3 4.011690 -1.8731735
# 4 3.788382 -1.9182377
# 5 3.461123 -1.0965173
# 6 3.671282 -0.9247785

Resources