Limma to Compare Bulk RNA Seq using makeContrasts and eBayes - r

After a day of googling, I've decided that it'd be better to ask the question here.
So the experiment is I have bulk RNA seq data from 3 patients: A, B, C.
And their RNA seq data is obtained for pre-treatment, treatment cycle 1, treatment cycle 2, treatment cycle 3.
So in total I have 12 samples of bulk RNA seq:
A.PreTreat -> A.Cycle1 -> A.Cycle2 -> A.Cycle3
B.PreTreat -> B.Cycle1 -> B.Cycle2 -> B.Cycle3
C.PreTreat -> C.Cycle1 -> C.Cycle2 -> C.Cycle3
I want to get a differential gene list between different cycles (i.e. cycle 3 to pretreatment, cycle 3 to cycle 2) using model.matrix(), lmFit(), makeContrasts(), contrasts.fit(), eBayes(), all of which are in the limma package.
Here is my minimal working example.
library(limma)
# Already normalized expression set: rows are genes, columns are the 12 samples
normalized_expression <- matrix(data=sample(1:100), nrow=10, ncol=12)
colnames(normalized_expression) <- c("A.PreTreat", "A.Cycle1", "A.Cycle2", "A.Cycle3", "B.PreTreat", "B.Cycle1", "B.Cycle2", "B.Cycle3", "C.PreTreat", "C.Cycle1", "C.Cycle2", "C.Cycle3")
patient_and_treatment <- factor(colnames(normalized_expression), levels = colnames(normalized_expression))
design.matrix <- model.matrix(~0 + patient_and_treatment)
colnames(design.matrix) <- patient_and_treatment
fit <- lmFit(normalized_expression, design.matrix)
# I want to get a contrast matrix to get differential genes between cycle 3 treatment and pre-treatment in all patients
contrast.matrix <- makeContrasts("A.Cycle3+B.Cycle3+C.Cycle3-A.PreTreat-B.PreTreat-C.PreTreat",
levels = levels(patient_and_treatment))
# Outputs Error of no residual degree of freedom
fit2 <- eBayes( contrasts.fit( fit, contrast.matrix ) )
# Want to run but cannot
summary(decideTests(fit2))
So far I am stuck on no residual degree of freedom error.
I am not even sure if this is the statistically right way in limma to address my question of getting differential gene list between cycle 3 treatment to pre-treatment in all patients.
Any help will be greatly appreciated.
Thanks!

You cannot have 1 observation per group, this makes the regression meaningless as you are fitting each data point to itself.
Briefly, what you are looking for is common effects observed across all patients, for say Cycle3 compared to PreTreat and so on, set up the model like this:
library(limma)
metadata = data.frame(
Patient=gsub("[.][^ ]*","",colnames(normalized_expression)),
Treatment=gsub("^[A-Z][.]*","",colnames(normalized_expression))
)
Patient Treatment
1 A PreTreat
2 A Cycle1
3 A Cycle2
4 A Cycle3
5 B PreTreat
6 B Cycle1
7 B Cycle2
8 B Cycle3
9 C PreTreat
10 C Cycle1
11 C Cycle2
12 C Cycle3
Now specify the model matrix, the Patient term is to account for differences in starting levels between Patients:
design.matrix <- model.matrix(~0 + Treatment+Patient,data=metadata)
fit <- lmFit(normalized_expression, design.matrix)
contrast.matrix <- makeContrasts(TreatmentCycle3-TreatmentPreTreat,
TreatmentCycle1-TreatmentPreTreat,levels=design.matrix)
fit2 = contrasts.fit(fit, contrast.matrix)
fit2 = eBayes(fit2)
You can check that the coefficients give you what you wanted:
fit2$coefficients
Contrasts
TreatmentCycle3 - TreatmentPreTreat
[1,] -3.666667
[2,] -13.666667
[3,] 1.666667
[4,] -40.666667
[5,] 12.000000
[6,] -46.000000
[7,] -32.000000
[8,] 4.666667
[9,] 11.333333
[10,] 5.666667
Contrasts
TreatmentCycle1 - TreatmentPreTreat
[1,] -11.33333
[2,] -19.33333
[3,] -27.33333
[4,] -42.33333
[5,] 27.33333
[6,] -32.66667
[7,] -33.00000
[8,] -30.66667
[9,] 46.00000
[10,] 17.33333

Related

Data Partition in Caret Package and Over-fitting

I was reading caret package and I saw that code;
createDataPartition(y, times = 1, p = 0.5, list = TRUE, groups = min(5,
length(y)))
I am wondering about "times" expression. So, if I use this code,
inTrain2 <- createDataPartition(y = MyData$Class ,times=3, p = .70,list = FALSE)
training2 <- MyData[ inTrain2,] # ≈ %67 (train)
testing2<- MydData[-inTrain2[2],] # ≈ %33 (test)
Would it be cause of overfitting problem? Or is that using for some kind of resampling method (unbiased)?
Many thanks in advance.
Edit:
I would like to mention that, if I use This code;
inTrain2 <- createDataPartition(y = MyData$Class ,times=1, p = .70,list = FALSE)
training2<- MyData[ inTrain2,] #142 samples # ≈ %67 (train)
testing2<- MydData[-inTrain2,] #69 samples # ≈ %33 (test)
I will have got 211 samples and And ≈ %52 Accuracy rate, On the other hand if I use this code;
inTrain2 <- createDataPartition(y = MyData$Class ,times=3,p =.70,list = FALSE)
training2<- MyData[ inTrain2,] # ≈ %67 (train) # 426 samples
testing2<- MydData[-inTrain2[2],] # ≈ %33 (test) # 210 samples
I will have got 536 samples and and ≈ %98 Accuracy rate.
Thank you.
It is not clear why you mix overfitting in this question; times refers simply to how many different partitions you want (docs). Let's see an example with the iris data:
library(caret)
data(iris)
ind1 <- createDataPartition(iris$Species, times=1, list=FALSE)
ind2 <- createDataPartition(iris$Species, times=2, list=FALSE)
nrow(ind1)
# 75
nrow(ind2)
# 75
head(ind1)
Resample1
[1,] 1
[2,] 5
[3,] 7
[4,] 11
[5,] 12
[6,] 18
head(ind2)
Resample1 Resample2
[1,] 2 1
[2,] 3 4
[3,] 6 6
[4,] 7 9
[5,] 8 10
[6,] 11 11
Both indices have a length of 75 (since we have used the default argument p=0.5, i.e. half the rows of the initial dataset). The columns (different samples) of ind2 are independent between them, and the analogy of the different iris$Species is preserved, e.g.:
length(which(iris$Species[ind2[,1]]=='setosa'))
# 25
length(which(iris$Species[ind2[,2]]=='setosa'))
# 25

covariance structure for multilevel modelling

I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

Remove inflections at end of lines from geom_line()

I am trying to plot the predictions of a lmer model with the following code:
p1 <- ggplot(Mac_Data_Tracking, aes(x = Rspan, y = SubjEff, colour = NsCond)) +
geom_point(size=3) +
geom_line(data=newdat, aes(y=predict(SubjEff.model,newdata=newdat)),lineend="round")
print(p1)
I get weird inflections at the end of each line, is there a way to remove them? I have changed the data in newdat, but the lines always have these inflections.
Lines with Inflections at ends:
Note that you have geom_line(data=newdat, aes(y=predict(SubjEff.model,newdata=newdat)). So you've fed newdat to geom_line as the data frame to use for plotting. But then for your y-value you provide a separate vector of predictions (based on newdat), when y should actually be just a column of newdat. I'm not sure why that's causing the inflections at the ends (probably there are, somehow, two different y-values being provided for each of the endpoint x-values), but that's probably the source of your problem.
Instead, you should create a column in newdat with the predictions (if you haven't already) and feed that column name to ggplot as the y in geom_line. To add a column of predictions, do the following:
newdat$pred = predict(SubjEff.model,newdata=newdat)
You should also give geom_line the x values that correspond to the y values in newdat. So your code would be:
geom_line(data=newdat, aes(y=pred, x=Rspan), lineend="round")
(Where Rspan will (automatically) be the Rspan column in newdat.)
It was a problem with having 2 x values, actually...it was having 2 subject values.
The linear mixed model is:
Mixed.model <- lmer(Outcome ~ NsCond + Rspan + (1|Subject), data=Data))
For newdat, I was intially using:
newdat <- expand.grid(Subject=c(min(Data$Subject),max(Data$Subject)),Rspan=c(min(Data$Rspan), max(Data$Rspan)),NsCond=unique(Data$NsCond))
Which gave me:
Subject Rspan NsCond
1 1 0.2916667 Pink
2 18 0.2916667 Pink
3 1 1.0000000 Pink
4 18 1.0000000 Pink
5 1 0.2916667 Babble
6 18 0.2916667 Babble
7 1 1.0000000 Babble
8 18 1.0000000 Babble
9 1 0.2916667 Loss
10 18 0.2916667 Loss
11 1 1.0000000 Loss
12 18 1.0000000 Loss
For each Rspan (x) there are 2 "Subjects" (1 and 18).
I changed newdat to:
newdat <- expand.grid(Subject=1,Rspan=c(min(Data$Rspan), max(Data$Rspan)),NsCond=unique(Data$NsCond))
Which results in:
Subject Rspan NsCond
1 1 0.2916667 Pink
2 1 1.0000000 Pink
3 1 0.2916667 Babble
4 1 1.0000000 Babble
5 1 0.2916667 Loss
6 1 1.0000000 Loss
Now it looks good

Transforming rows in a PCA context using dudi.pca

I have a huge matrix of genetic data (1e7 rows representing individuals x 5,000 columns representing markers) on which I would like to perform a PCA in order to keep c. 20 columns. However, due to memory issues, I cannot perform PCA using either dudi.pca or big.PCA on R 3.1.2 on a 8GB 64bits machine.
An alternative was to compute an approximation of the coordinates of principal axes on a row-subset of the matrix and then transform the whole matrix using a linear combination with the approximate PA coordinates.
I am facing a simple PCA-related problem using dudi.pca: how can I get the row coordinates using the original matrix and the matrix of column coordinates (= principal axes) ?
Here is a simple example, let's take a random matrix M (3 rows and 4 columns) such as:
M=
1 9 10 13
20 13 20 7
18 19 17 10
Doing dudi.pca(M, center=T, scale=T) and keeping only one PC, dudi.pca outputs the following $c1 matrix (column normed scores ie principal axes):
c1 =
-0.547
-0.395
-0.539
0.504
To compute the row coordinates of the data on the first principal axis, I thought doing the inner product:
r =
-0.547*1 + -0.395*9 + -0.539*10 + -0.504*13
-0.547*20 + -0.395*13 + -0.539*20 + -0.504*17
-0.547*18 + -0.395*19 + -0.539*17 + -0.504*10
i.e.
r =
-2.944
-23.331
-21.481
But if I look up at the $li (row coordinates ie principal components) natively computed by dudi.pca on the same dataset, I read:
r' =
2.565
-1.559
-1.005
Am I doing something wrong when formulating the row coordinates using dudi.pca $ci matrix?
Many thanks for your help,
Quaerens.
Code :
> M=matrix(c(1,9,10,13,20,13,20,7,18,19,17,10), ncol=4, byrow=T)
> M
[,1] [,2] [,3] [,4]
[1,] 1 9 10 13
[2,] 20 13 20 7
[3,] 18 19 17 10
> N=dudi.pca(M, center=T, scale=T, scannf=F, nf=1)
> N$c1
CS1
V1 -0.5468634
V2 -0.3955638
V3 -0.5389504
V4 0.5039863
> r=c( M[1,] %*% N$c1[,1], M[2,] %*% N$c1[,1], M[3,] %*% N$c1[,1] )
> r
[1] -2.94462 -23.33070 -21.48155
> N$li
Axis1
1 2.565165
2 -1.559546
3 -1.005619
If this is still of interest...
ADE4 works on the duality diagram, hence when p is greater than n singular value decomposition is carried out on the nxn symmetric matrix
library(ade4)
M=matrix(c(1,9,10,13,20,13,20,7,18,19,17,10), ncol=4, byrow=T)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 9 10 13
## [2,] 20 13 20 7
## [3,] 18 19 17 10
N=dudi.pca(M, center=T, scale=T, scannf=F, nf=1)
#dimensions of M
n=3
p=4
X=scalewt(M,center=T,scale=T)
#this could be done in two ways. Singular Value Decomposition or Duality Diagrams.
#Consider a Singular value decomposition of X; S=UDV; where S is X, U is the left triangular matrix, and V is the right triangular matrix, and D is the diagonal matrix of eigen values
svd=svd(X)
#These are equivalent
N$c1
svd$v[,1]
#Equivalent
N$eig
## [1] 3.341175 0.658825
svd$d[1:2]
## [1] 3.341175 0.658825
#Diagonal matrix of eigen values
lambda=diag(svd$d)
#N$lw gives the row weights
N$lw
#0.3333333 0.3333333 0.3333333
#find the inverse of the diagonal matrix of row weights; this is the normalization part
K=solve(sqrt(diag(N$lw,n)))%*%svd$u
#These are equivalent
head(K[,1])
## [1] 1.4033490 -0.8531958 -0.5501532
head(N$l1)
## RS1
## 1 1.4033490
## 2 -0.8531958
## 3 -0.5501532
#Find Principal Components
pc=K%*%sqrt(lambda)
#These are equivalent
head(pc)
## [,1] [,2]
## [1,] 2.565165 -0.1420130
## [2,] -1.559546 -0.9154578
## [3,] -1.005619 1.0574707
head(N$li)
## Axis1
## 1 2.565165
## 2 -1.559546
## 3 -1.005619
This could also be done using the duality diagram implemented in ade4
look here for references on the duality diagram implemented in ade4: http://projecteuclid.org/euclid.aoas/1324399594
Q<-diag(p)
D<-diag(1/n, n)
rk<-qr(X)
rank=rk$rank
#Statistical Triplets
V<-t(X)%*%D%*%X
W<-X%*%Q%*%t(X)
#Compute the eigen values and vectors of the statistical triplet
example.eigen=eigen(W%*%D)
#Equivalent
N$eig
## [1] 3.341175 0.658825
example.eigen$values[1:rank]
## [1] 3.341175 0.658825
#Diagonal matrix of eigen values
lambda=diag(example.eigen$values[1:rank])
#find the inverse of the diagonal matrix of row weights; this is the normalizing part
Binv<-solve(sqrt(D))
K=Binv%*%example.eigen$vectors[,1:rank]
#These are equivalent
head(K[,1])
## [1] 1.4033490 -0.8531958 -0.5501532
head(N$l1)
## RS1
## 1 1.4033490
## 2 -0.8531958
## 3 -0.5501532
#Find Principal Components
pc=K%*%sqrt(lambda)
#These are equivalent
head(pc)
## [,1] [,2]
## [1,] 2.565165 -0.1420130
## [2,] -1.559546 -0.9154578
## [3,] -1.005619 1.0574707
head(N$li)
## Axis1
## 1 2.565165
## 2 -1.559546
## 3 -1.005619

How to bootstrap respecting within-subject information?

This is the first time I post to this forum, and I want to say from the start I am not a skilled programmer. So please let me know if the question or code were unclear!
I am trying to get the 95% confidence interval (CI) for an interaction (that is my test statistic) by doing bootstrapping. I am using the package "boot". My problem is that for every resample, I would like the randomization to be done within subjects, so that observations from different subjects are not mixed. Here is the code to generate a dataframe similar to mine. As you can see, I have two within-subjects factors ("Num" and "Gram" and I am interested in the interaction between both):
Subject = rep(c("S1","S2","S3","S4"),4)
Num = rep(c("singular","plural"),8)
Gram = rep(c("gram","gram","ungram","ungram"),4)
RT = c(657,775,678,895,887,235,645,916,930,768,890,1016,590,978,450,920)
data = data.frame(Subject,Num,Gram,RT)
This is the code I used to get the empirical interaction value:
summary(lm(RT ~ Num*Gram, data=data))
As you can see, the interaction between my two factors is -348. I want to get a bootstrap confidence interval for this statistic, which I can generate using the "boot" package:
# You need the following packages
install.packages("car")
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")
#Function to create the statistic to be boostrapped
boot.huber <- function(data, indices) {
data <- data[indices, ] #select obs. in bootstrap sample
mod <- lm(RT ~ Num*Gram, data=data)
coefficients(mod) #return coefficient vector
}
#Generate bootstrap estimate
data.boot <- boot(data, boot.huber, 1999)
#Get confidence interval
boot.ci(data.boot, index=4, type=c("norm", "perc", "bca"),conf=0.95) #4 gets the CI for the interaction
My problem is that I think the resamples should be generated without mixing the individual subjects observations: that is, to generate the new resamples, the observations from subject 1 (S1) should be shuffled within subject 1, not mixing them with the observations from subjects 2, etc... I don't know how "boot" is doing the resampling (I read the documentation but don't understand how the function is doing it)
Does anyone know how I could make sure that the resampling procedure used by "boot" respects subject level information?
Thanks a lot for your help/advice!
Just modify your call to boot() like this:
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
?boot provides this description of the strata= argument, which does exactly what you are asking for:
strata: An integer vector or factor specifying the strata for
multi-sample problems. This may be specified for any
simulation, but is ignored when ‘sim = "parametric"’. When
‘strata’ is supplied for a nonparametric bootstrap, the
simulations are done within the specified strata.
Additional note:
To confirm that it's working as you'd like, you can call debugonce(boot), run the call above, and step through the debugger until the object i (whose rows contain the indices used to resample rows of data to create each bootstrap resample) has been assigned, and then have a look at it.
debugonce(boot)
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
# Browse[2]>
## [Press return 34 times]
# Browse[2]> head(i)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
# [1,] 9 10 11 16 9 14 15 16 9 2 15 16 1 10
# [2,] 9 14 7 12 5 6 15 4 13 6 11 16 13 6
# [3,] 5 10 15 16 9 6 3 4 1 2 15 12 5 6
# [4,] 5 10 11 4 9 6 15 16 9 14 11 16 5 2
# [5,] 5 10 3 4 1 10 15 16 9 6 3 8 13 14
# [6,] 13 10 3 12 5 10 3 4 5 14 7 16 5 14
# [,15] [,16]
# [1,] 7 8
# [2,] 11 16
# [3,] 3 16
# [4,] 3 8
# [5,] 7 8
# [6,] 7 12
(You can enter Q to leave the debugger at any time.)

Resources