Limma RNA-Seq analysis: using voom - r

I need to do RNA-Seq analysis with limma and I already have normalized count data for 61810 transcripts in two conditions (no replicates), i.e. a 61810*2 matrix. My "design" model matrix is :
(Intercept) sampletypestest
1 1 0
2 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$sampletypes
[1] "contr.treatment
when I use voom on the data: diff.exp <- voom(data,design), it gives the following error:
Error in approxfun(l, rule = 2) :
need at least two non-NA values to interpolate
Can anyone tell me what's the issue here?

voom (and limma more generally) require replicates. The whole purpose of voom is to estimate the mean-variance relationship. It would work if you had any replicates at all in any of groups. But you don't have any replicates, so you can't estimate any variances, so an error is inevitable.

Related

difference between 'psych::fa.parallel()' and 'paran::paran()' in Horn's parallel analysis

I am trying to conduct an explanatory factor analysis for data with dichotomous variables(csv Google link, n = 1000 originally but only 500 here) with a WLSMV estimator using lavaan package, whether the data sample is suitable for the analysis or not.
I tried using Horn's parallel analysis by psych::fa.parallel(df, fa="fa", fm="wls") and paran::paran(df, cfa=TRUE) in order to decide the appropriate number of factors, but they gave different results.
## psych
Using eigendecomposition of correlation matrix.
[1] 3
## paran
Parallel analysis suggests that the number of factors = 2 and the number of components = NA
[1] 2
Why did this happen? Also, why does psych::fa.parallel() need a factoring method as an argument, while paran::paran() does not?

Correlation of categorical data to binomial response in R

I'm looking to analyze the correlation between a categorical input variable and a binomial response variable, but I'm not sure how to organize my data or if I'm planning the right analysis.
Here's my data table (variables explained below):
species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar")
scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1)
dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0)
pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0)
nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0)
plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0)
blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)
loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no
data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss)
data #check data table
data table explanation
I have individual species listed, and the next columns are their annotated feeding types. A 1 in a given column means yes, and a 0 means no. Some species have multiple feeding types, while some have only one feeding type. The response variable I am interested in is "loss," indicating loss of a trait. I'm curious to know if any of the feeding types predict or are correlated with the status of "loss."
thoughts
I wasn't sure if there was a good way to include feeding types as one categorical variable with multiple categories. I don't think I can organize it as a single variable with the types c("scavenge","dung","pred", etc...) since some species have multiple feeding types, so I split them up into separate columns and indicated their status as 1 (yes) or 0 (no). At the moment I was thinking of trying to use a log-linear analysis, but examples I find don't quite have comparable data... and I'm happy for suggestions.
Any help or pointing in the right direction is much appreciated!
There are too little samples, you have 4 loss == 0 and 18 loss == 1. You will run into problems fitting a full logistic regression (i.e including all variables). I suggest testing for association for each feeding habit using a fisher test:
library(dplyr)
library(purrr)
# function for the fisher test
FISHER <- function(x,y){
FT = fisher.test(table(x,y))
data.frame(
pvalue=FT$p.value,
oddsratio=as.numeric(FT$estimate),
lower_limit_OR = FT$conf.int[1],
upper_limit_OR = FT$conf.int[2]
)
}
# define variables to test
FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom")
# we loop through and test association between each variable and "loss"
results <- data[,FEEDING] %>%
map_dfr(FISHER,y=data$loss) %>%
add_column(var=FEEDING,.before=1)
You get the results for each feeding habit:
> results
var pvalue oddsratio lower_limit_OR upper_limit_OR
1 scavenge 0.264251538 0.1817465 0.002943469 2.817560
2 dung 1.000000000 1.1582683 0.017827686 20.132849
3 pred 0.263157895 0.0000000 0.000000000 3.189217
4 nectar 0.535201640 0.0000000 0.000000000 5.503659
5 plant 0.002597403 Inf 2.780171314 Inf
6 blood 1.000000000 0.0000000 0.000000000 26.102285
7 mushroom 0.337662338 5.0498688 0.054241930 467.892765
The pvalue is p-value from fisher.test, basically with an odds ratio > 1, the variable is positively associated with loss. Of all your variables, plant is the strongest and you can check:
> table(loss,plant)
plant
loss 0 1
0 18 0
1 1 3
Almost all that are plant=1, are loss=1.. So with your current dataset, I think this is the best you can do. Should get a larger sample size to see if this still holds.

Calculate AUC for test set (keras model in R)

Is there a way (function) to calculate AUC value for a keras model in R on test-set?
I have searched on google but nothing shown up.
From Keras model, we can extract the predicted values as either class or probability as follows:
Probability:
[1,] 9.913518e-01 1.087829e-02
[2,] 9.990101e-01 1.216531e-03
[3,] 9.445553e-01 6.256607e-02
[4,] 9.928864e-01 6.808311e-03
[5,] 9.993126e-01 1.028240e-03
[6,] 6.075442e-01 3.926141e-01
Class:
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Many thanks,
Ho
Generally it does not really matter what calssifier (keras or not) did the prediction. All you need to estimate the AUC are two things: the predicted probabilities from some classifier and the actual category (for example, dead "yes" vs. "no"). With this data you can calculate both, True Positive Rate and False positive rate, thus you can also make a ROC plot and estimate AUC with this data. You can use
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
See here for some more explanation.
I'm not sure this will answer your needs as it depends on your data structure and keras output format, but have a look to Dismo package's function evaluate. You need to set up something like that:
library(dismo)
predictors <- stack of explaining variables
pres_test <- a subset of data used to model ##that you not use in your model for this testing purpose
backg_test <- true or random (background) absence data
model <- output of your model
AUC <- evaluate(pres_test, backg_test, model, predictors) ## you may bootstrap this step x time by randomly selecting 'pres_test' and 'backg_test' x time.

How to use the predict() function in the R package "pscl" with categorical predictor variables

I'm fitting count data (number of fledgling birds produced per territory) using zero-inflated poisson models in R, and while model fitting is working fine, I'm having trouble using the predict function to get estimates for multiple values of one category (Year) averaged over the values of another category (StudyArea). Both variables are dummy coded (0,1) and are set up as factors. The data frame sent to the predict function looks like this:
Year_d StudyArea_d
1 0 0.5
2 1 0.5
However, I get the error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
If instead I use a data frame such as:
Year_d StudyArea_d
1 0 0
2 0 1
3 1 0
4 1 1
I get sensible estimates of fledgling counts per year and study site combination. However, I'm not really interested in the effect of study site (the effect is small and isn't involved in an interaction), and the year effect is really what the study was designed to examine.
I have previously used similar code to successfully get estimated counts from a model that had one categorical and one continuous predictor variable (averaging over the levels of the dummy-coded factor), using a data frame similar to:
VegHeight StudyArea_d
1 0 0.5
2 0.5 0.5
3 1 0.5
4 1.5 0.5
So I'm a little confused why the first attempt I describe above doesn't work.
I can work on constructing a reproducible example if it would help, but I have a hunch that I'm not understand something basic about how the predict function works when dealing with factors. If anyone can help me understand what I need to do to get estimates at both levels of one factor, and averaged over the levels of another factor, I would really appreciate it.

Applying fixed effects factor in R breaks the regression

I am trying to run a fixed effects regression in R. When I run the linear model without the fixed effects factor being applied the model works just fine. But when I apply the factor - which is a numeric code for user ID, I get the following error:
Error in rep.int(c(1, numeric(n)), n - 1L) : cannot allocate vector of length 1055470143
I am not sure what the error means but I fear it may be an issue of coding the variable correctly in R.
I think this is more statistical and less programming problem for two reasons:
First, I am not sure whether you are using cross sectional data or panel data. If you using cross-sectional data it doesn't make sense to control for 30000 individuals(of course, they will add to variation).
Second, if you are using panel data, there are good package such as plm package in R that does this kind of computation.
An example:
set.seed(42)
DF <- data.frame(x=rnorm(1e5),id=factor(sample(seq_len(1e3),1e5,TRUE)))
DF$y <- 100*DF$x + 5 + rnorm(1e5,sd=0.01) + as.numeric(DF$id)^2
fit <- lm(y~x+id,data=DF)
This needs almost 2.5 GB RAM for the R session (if you add RAM needed by the OS this is more than many PCs have available) and takes some time to finish. The result is pretty useless.
If you don't run into RAM limitations you can suffer from limitations of vector length (e.g., if you have even more factor levels), in particular if you use an older version of R.
What happens?
One of the first steps in lm is creating the design matrix using the function model.matrix. Here is a smaller example of what happens with factors:
model.matrix(b~a,data=data.frame(a=factor(1:5),b=2))
# (Intercept) a2 a3 a4 a5
# 1 1 0 0 0 0
# 2 1 1 0 0 0
# 3 1 0 1 0 0
# 4 1 0 0 1 0
# 5 1 0 0 0 1
# attr(,"assign")
# [1] 0 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$a
# [1] "contr.treatment"
See how n factor levels result in n-1 dummy variables? If you have many factor levels and many observations, this matrix gets huge.
What should you do?
I'm quite sure, you should use a mixed effects model. There are two important packages that implement linear mixed effects models in R, package nlme and the newer package lme4.
library(lme4)
fit.mixed <- lmer(y~x+(1|id),data=DF)
summary(fit.mixed)
Linear mixed model fit by REML
Formula: y ~ x + (1 | id)
Data: DF
AIC BIC logLik deviance REMLdev
1025277 1025315 -512634 1025282 1025269
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 8.9057e+08 29842.472
Residual 1.3875e+03 37.249
Number of obs: 100000, groups: id, 1000
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.338e+05 9.437e+02 353.8
x 1.000e+02 1.180e-01 847.3
Correlation of Fixed Effects:
(Intr)
x 0.000
This needs very little RAM, calculates fast, and is a more correct model.
See how the random intercept accounts for most of the variance?
So, you need to study mixed effects models. There are some nice publications, e.g. Baayen, Davidson, Bates (2008), explaining how to use lme4.

Resources