tl;dr
why isn't fit <- eBayes(fit); topTable(fit, coef=4) the same as fit <- contrasts.fit(fit, c(-1,0,0,1)); fit <- eBayes(fit); topTable(fit) (column 1 of the design beeing the intercept)?
Example from the limma usersguide
Strain <- factor(targets$Strain, levels=c("WT","Mu"))
Treatment <- factor(targets$Treatment, levels=c("U","S"))
design <- model.matrix(~Strain+Strain:Treatment)
colnames(design)
[1] "(Intercept)" "StrainMu" "StrainWT:TreatmentS" "StrainMu:TreatmentS"
The first term in the model formula is an effect for Strain. This introduces an intercept column
to the design matrix, which estimates the average log-expression level for wild-type unstimulated
cells, and a column for Strain which estimates the mutant vs wildtype dierence in the unstimulated
state. The second term in the model formula represents the interaction between stimulation and
strain. [...] It introduces a third and a fourth column to the design matrix which represent the
effect of stimulation for wild-type and for mutant mice respectively [...].
fit <- lmFit(eset, design)
fit <- eBayes(fit)
topTable(fit, coef=3)
# will find those genes responding to stimulation in wild-type mice, and
topTable(fit, coef=4)
# will find those genes responding to stimulation in mutant mice
What I don't understand
If using coef is the same as looking at the difference between the 4th column of the design matrix and the intercept (i.e. the contrast between the fourth and first column), wouldn't we need to look at the contrast between the fourth and second column to get the genes responding to stimulation in mutant mice?
Of course I compared the results when using coef and when using contrasts. They differ but I do not understand why... Obviously it means that coef=4 does not mean "look at the difference between column 4 and the intercept", but what does it mean then?
I hope that the question is understandable. Many thanks in advance!
The design matrix is based on
targets <- data.frame(
Strain = factor(c("WT", "WT", "MU", "MU", "MU"), levels = c("WT", "MU")),
Treatment = factor(c("U", "S", "U", "S", "S"), levels = c("U", "S")))
design <- model.matrix(~ Strain + Strain:Treatment, data = targets)
> targets
## Strain Treatment
## 1 WT U
## 2 WT S
## 3 MU U
## 4 MU S
## 5 MU S
Each row of targets corresponds to an experimental sample. The design matrix looks like this:
## (Intercept) StrainMU StrainWT:TreatmentS StrainMU:TreatmentS
## 1 1 0 0 0
## 2 1 0 1 0
## 3 1 1 0 0
## 4 1 1 0 1
## 5 1 1 0 1
Again, each row corresponds to an experimental sample. The columns of design correspond to coefficients that are fitted by limma and you can read off what combination of coefficients gives the model-fitted value for a given experimental group by comparing the rows of design with those of targets.
Looking at coef=4 effectively means that you're testing the null hypothesis that the fourth coefficient (that for StrainMu:TreatmentS) is zero - it isn't the same as comparing the value of the fourth coefficient against the value of the intercept coefficient.
Think in terms of the fitted value for each of the experimental classes.
For a given gene,
if I was a wildtype, unstimulated mouse (as in the first row of design or targets), my fitted value would be:
Intercept
if I was a wildtype, stimulated mouse (second row), my fitted value would be:
Intercept + StrainWT:TreatmentS
if I was a mutant, unstimulated mouse (third row), my fitted value would be:
Intercept + StrainMu
if I was a mutant, stimulated mouse (fourth and fifth rows), my fitted value would be:
Intercept + StrainMu + StrainMU:TreatmentS
So the difference between the stimulated and unstimulated group within the mutant strain is:
(Intercept + StrainMU + StrainMU:TreatmentS) - (Intercept + StrainMU)
= StrainMU:TreatmentS
... the coefficient corresponding to the 4th column in the design matrix
Hopefully that was helpful
ps, using coef=4 should give you the same result as using contrast = c(0, 0, 0, 1)
Related
I am replicating SPSS code in R that runs several Type 3 ANOVAs. In SPSS you can specify specific contrasts in an ANOVA (e.g., compare level 2 v level 4 in this 5-level variable). The resulting ANOVA tables return a test where the degrees of freedom are equal to the full sample, rather than the sample that is just concentrated in those two levels.
In R, I use the command below to run an ANOVA comparing those two levels but the resulting Residuals DF is based on the subsample of only those two levels rather than the full sample. Is there a way I can manually set the DF in either the lm() or Anova() function to avoid this issue? Or is there a way to specify contrasts that uses the full sample DF?
Anova(lm(DV ~ FiveLevelFactor, data = data, type = 3, subset = FiveLevelFactor == "2" | FiveLevelFactor == "4"))
How about using the linearHypothesis() function from the car package:
library(car)
data(Ornstein)
mod <- lm(interlocks ~ log(assets) + sector + nation, data=Ornstein)
linearHypothesis(mod, "nationUK = nationUS")
# Linear hypothesis test
#
# Hypothesis:
# nationUK - nationUS = 0
#
# Model 1: restricted model
# Model 2: interlocks ~ log(assets) + sector + nation
#
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 235 29829
# 2 234 29690 1 138.36 1.0904 0.2975
I have been fitting different hierarchical GAMs (hereafter: HGAM) using mgcv in R. I can extract and plot their predictions for their random effects without problems. Conversely, extracting and plotting their predictions for their fixed effects only works for some models, and I don't know why.
Here is a practical example, which refers to the color spectra of flowers from two species (Taxon) sampled at various localities (also discussed here):
rm(list=ls()) # wipe R's memory clean
library(pacman) # load packages, installing them from CRAN if needed
p_load(RCurl) # allows accessing data from URL
ss <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
head(ss)
ss$density <- ifelse(ss$density<0, 0, ss$density) # set spurious negative reflectance values to zero
ss$clr <- ifelse(ss$Taxon=="SpeciesB", "red", "black")
ss <- with(ss, ss[order(Locality, wl), ])
These are the mean color spectra at the population level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
The following model is a HGAM of type G according to Pedersen et al.'s classification (2019) and it does not give any issues:
gam_G1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(Locality, bs="re"), # "re" is short for "random effect"
data = ss, method = 'REML',
family="quasipoisson"
)
# gam.check(gam_G1)
# k.check(gam_G1)
# MuMIn::AICc(gam_G1)
# gratia::draw(gam_G1)
# plot(gam_G1, pages=1)
# use gam_G1 to predict wl by Locality
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", "Taxon", "Locality", "clr")])
# predict:
pred <- predict(object= gam_G1, newdata=nn, type="response", se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
# use gam_G1 to predict wl by Taxon
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_G1,
type="response",
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
R warns me that factor levels 0 not in original fit, but it executes the task without issues:
Left panel: gam_G1 predictions at the Locality level. Right panel: gam_G1 predictions for the fixed effects.
Troublesome models
The following model is a HGAM of type "GI" sensu Pedersen et al. (2019). It produces more accurate predictions at the Locality level, but I can only get NA as predictions at the level of fixed effects:
# GI: models with a global smoother for all observations,
# plus group-level smoothers, the wiggliness of which is estimated individually
start_time <- Sys.time()
gam_GI1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="tp", m=1)
# "tp" is short for "thin plate [regression spline]"
+ s(Locality, bs="re"),
family="quasipoisson",
data = ss, method = 'REML'
)
end_time <- Sys.time()
end_time - start_time # it took ~2.2 minutes on my computer
# gam.check(gam_GI1)
# k.check(gam_GI1)
# MuMIn::AICc(gam_GI1)
Attempt at drawing predictions for the fixed effects (Taxon and wl) according to gam_GI1:
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_GI1,
type="response",
# exclude="c(Locality)",
# # this should turn random effect off
# # (doesn't work for me)
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
head(nn)
# wl Taxon Locality clr fit se
# 1 298.34 SpeciesB 0 red NA NA
# 2 305.82 SpeciesB 0 red NA NA
# 3 313.27 SpeciesB 0 red NA NA
# 4 320.72 SpeciesB 0 red NA NA
# 5 328.15 SpeciesB 0 red NA NA
# 6 335.57 SpeciesB 0 red NA NA
Left panel: gam_GI1 predictions at the Locality level. Right panel (blank): gam_GI1 predictions for the fixed effects.
The following model, which includes a global smoother for all observations, plus group-level smoothers, all with the same "wiggliness", doesn't provide fixed-effect predictions either:
gam_GS1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="fs", m=1),
# "fs" is short for "factor-smoother [interaction]"
family="quasipoisson",
data = ss, method = 'REML'
)
Why don't gam_GI1 and gam_GS1 produce predictions for their fixed effects, and how can I obtain them?
The models can take a few minutes to run. To save time, their output can be downloaded from here as an RData file. My R scripts (which include the code for plotting the figures) are available here.
I think you are conflating several things here; The by trick to turn off random effects only works for bs = "re" smooths. Locality is a factor (otherwise your random effect isn't a random intercept) and setting it to 0 is creating a new level (although it could be creating an NA as 0 isn't among the original levels.
If what you want to do is turn off anything to do with Locality, you should use exclude; however you have the invocation wrong. The reason why it's not working is because you are creating a character vector with a single element "c(Locality)". This fails for obvious reasons once you realize that c(Locality) doesn't related to anything in your model. What you need to provide here is a vector of smooth names as printed by summary(). For example, to exclude the smooth s(Locality, bs = "re"), {mgcv} knows this as s(Locality), so you would use exclude = "s(Locality)".
In your case, it is tedious to type out all the "s(wl):LocalityLevelX" labels for each smooth. As you have only two taxa, it would be easier to use the complimentary argument terms, where you list smooth labels that you want to include in the model. So you could do terms = c("s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC") or whatever summary() displays for these smooths.
You also need to include the Taxon term in terms, which I think needs to be:
terms = c("TaxonSpeciesB", TaxonSpeciesC",
"s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC")
If you install and load my {gratia} package, you can use smooths(gam_GI1) to list all the smooth labels as far as {mgcv} knows them.
The by trick works like this:
gam(y ~ x + s(z) + s(id, bs = "re", by = dummy)
where dummy is set to a numeric value 1 when fitting and to 0 when you are predicting. As this is a numeric by variable you are multiplying the smooth by dummy and hence why setting it to 0 excludes the term. The reason why your code isn't working is because you really want separate smooths for wl for each Locality; Locality is an actual variable of interest in your data/model, not a dummy variable we create to achieve the aim of excluding a term from the model.
Hopefully now you can see why exclude and terms are much better solutions than this dummy trick.
FYI, in bs = "tp", the "tp" doesn't mean tensor product smooth. It mean thin plate regression spline (TPRS). You only get tensor product smooths through te(), t2(), or ti() terms.
i ran a multiple regression model in r, with variables that had multiple levels. in the output, the summary shows results for every level of the two variables apart from one! i'm confused about why it isn't showing all of them?
for example, the model involved looking at an outcome (a score on a quiz) and the predictors were drunk (yes - no) and 'experience' (low - medium - high). in the output, it returns an coefficients row for every level of these factor variables apart from 'high experience'.
i am using the standard lm() structure, of lm(score ~ 1 + drink + experience + controls, data=data).
can anyone help with this??? thanks
You probably have a statistical question, not necessarily a coding one.
When placing categorical variables in a linear model, what you actually do is create dummie for each level that indicates whether or not an observation had that specific level. For example, a factor variable of gender would be turned into a dummie for male and another for female:
(For a model with 5 observations). The problem is that, if you have an intercept in your model - and for it to be calculated we create a "variable" of 1's: - the matrix of your variables will be "saturated" (as #Limey pointed out). This means that you can use and to get (they aren't linear independents), so your X matrix won't be reversible and you won't be able to calculate the coefficients estimates.
So to correct this you must remove , or or , and R by default removes the last factor variable (which by default is ordered alphabetically), so that's what happented with your code. We can remove one of those 3 because:
If we remove , then we will interpret the coefficient associated with
and as "the intercept for the males" and "the intercept for the females". If we remove one of the other two, lets say , we the coefficient associated with ais still the general intercept, and the one of will be "the difference between the female and male intercepts". This latter interpretation is easier to see if you write down the model.
The point is that even as you can't calculate one coefficient for each level, you can still get the effects that you're looking for if you know that you need to interpret the numbers as i explained in the above paragraph.
If you have more statistical questions, use Cross Validated instead :).
In order for the regression coefficients to be uniquely identifiable the columns of the model matrix must be linearly independent. Suppose we had
y <- 1:4
x <- factor(c("a", "a", "b", "b"))
fm <- lm(y ~ x)
x has two levels but if there were a column for both they would look like this in the model matrix:
1 0
1 0
0 1
0 1
but in that case if we sum the two columns we get a vector of four 1's
c(1, 1, 0, 0) + c(0, 0, 1, 1)
## [1] 1 1 1 1
which is the same as the intercept column in the model matrix so we can't have the intercept and two columns for x at the same time. We can either get rid of the intercept or we can only use one of the levels of x and since we have specified an intercept R drops the "a" level.
We can use dummy.coef to get pseudo coefficients for all levels.
fm <- lm(y ~ x)
coef(fm)
## (Intercept) xb
## 1.5 2.0
dummy.coef(fm)
## Full coefficients are
##
## (Intercept): 1.5
## x: a b
## 0 2
Looking at the model matrix will often help understanding. The model matrix that R actually uses is the following 4x2 matrix where we see that it dropped the a level of x:
model.matrix(lm(y ~ x))
## (Intercept) xb
## 1 1 0
## 2 1 0
## 3 1 1
## 4 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"
or if we removed the intercept then it would use this 4x2 matrix:
model.matrix(lm(y ~ x - 1))
## xa xb
## 1 1 0
## 2 1 0
## 3 0 1
## 4 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"
I have a 2 level dataset of 37000 instances, which represents the choices of 199 subjects. I have to estimate coefficients in logistic regression for each of the 199 individuals. I have done manually 199 times by subsetting, but I want to know whether there is a more efficient way of getting the coefficients by looping without using the lme4 package. Also, I should compute the coefficients as variables in each subject.
Here is my code.
### Split of the dataset in each subject ID
mylist <- split(df_merged2, df_merged2$sjind)
### Indication of subject 1 in the first subsetting
df1 <- mylist[[1]]
### Logistic regression
glm1 <- glm(rep ~ reward_v.2 + trans_v.2 + reward_transition, data = df1)
### Extracting the coefficients
reward_transition <- coef(glm1)[4]
reward <- coef(glm1)[2]
transition <- coef(glm1)[3]
reward<- as.numeric(reward)
reward_transition <- as.numeric(reward_transition)
transition <- as.numeric(transition)
omega <- reward_transition - reward
### Computing the constant coefficients as variables
df1$rewardmix <- 1
df1$rewardmix <- reward
df1$omega <- 1
df1$omega <- omega
df1$transmix <- 1
df1$transmix <- transition
df1$reward_transitionmix <- reward_transition
You can use the by() function from the base package, whose short description is "Apply a Function to a Data Frame Split by Factors" (ref: help(by))
Here is an example using your terminology for the data frame and the subject ID variable names:
# Make the simulated data reproducible
set.seed(1717)
# The IDs can be sorted in any order
ids = c('A','B','B','A','A','B','B','B','C','C','C','B','C')
# Sample data frame with: subject ID, target variable (y), input variable (x)
df_merged2 = data.frame(sjind=ids,
y=rnorm(length(ids)),
x=rnorm(length(ids)))
head(df_merged2)
The top 6 rows of the data look like:
sjind y x
1 A -1.4548934 1.1004932
2 B -1.7084245 -0.7731208
3 B 2.1004557 -1.6229203
4 A -1.0283021 0.4233806
5 A 0.4133888 1.2398577
6 B -1.4104637 0.3746706
Now use the by() function to fit a GLM model for each group defined by the sjind unique values:
glm_by_sjind = by(df_merged2, as.factor(df_merged2$sjind),
function(df) glm(y ~ x, data=df))
The output object glm_by_sjind is a list with the following properties:
It has as many elements as the number of unique values in sjind (in this case 3)
It is indexed by the unique values of the sjind variable (in this case "A", "B", "C")
Each element contains the regression output from glm() run on each split of the input data frame (where splits are clearly defined by the sjind unique values)
So for example, you can request the summary of the regression output for subject "B" as follows:
> summary(glm_by_sjind[["B"]])
Call:
glm(formula = y ~ x, data = df)
Deviance Residuals:
2 3 6 7 8 12
-1.40226 1.59040 -0.00186 0.06400 -1.93118 1.68091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0487 0.7472 -1.404 0.233
x -0.9605 0.9170 -1.047 0.354
(Dispersion parameter for gaussian family taken to be 2.763681)
Null deviance: 14.087 on 5 degrees of freedom
Residual deviance: 11.055 on 4 degrees of freedom
AIC: 26.694
Number of Fisher Scoring iterations: 2
If we go a little further, we can also perform a sanity check that each GLM model is based on the expected number of cases (i.e. the number of cases in each model should be equal to the frequency distribution of the sjind variable in the input data frame).
freq_sjind_in_data = as.list( table(df_merged2$sjind) )
ncases_in_each_glm = lapply( glm_results, function(glm) NROW(glm$data) )
all.equal( freq_sjind_in_data,
ncases_in_each_glm )
which returns TRUE.
Or also inspect that visually:
as.data.frame(freq_sjind_in_data)
as.data.frame(ncases_in_each_glm)
which return
A B C
1 3 6 4
in both cases.
I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.