All NaNs in RMA normalization of GSE31312 using Brainarray custom CDFs - r

I'm trying to RMA normalize a particular gene expression dataset concerning diffuse large B-cell lymphoma using custom gene-level annotation CDF (chip definition file) files from Brainarray.
Unfortunately, the RMA normalized expression matrix is all NaNs, and I don't understand why.
The dataset (GSE31312) is freely available at the GEO website and uses the Affymetrix HG-U133 Plus 2.0 array platform. I'm using the affy package to perform the RMA normalization.
Since the problem is specific to the dataset, the following R code to reproduce the problem is unfortunately quite cumbersome (2 GB download, 8.8 GB unpacked).
Set the working directory.
setwd("~/Desktop/GEO")
Load the needed packages. Uncomment to install the packages.
#source("http://bioconductor.org/biocLite.R")
#biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils"))
library("GEOquery") # To automatically download the data
library("affy")
library("R.utils") # For file handling
Download the array data to the work dir.
files <- getGEOSuppFiles("GSE31312")
Untar the data in a dir called CEL
#Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line
untar(tarfile = "GSE31312/GSE31312_RAW.tar", exdir = "CEL")
Unzip the all .gz files
gz.files <- list.files("CEL", pattern = "\\.gz$",
ignore.case = TRUE, full.names = TRUE)
for (file in gz.files)
gunzip(file, skip = TRUE, remove = TRUE)
cel.files <- list.files("CEL", pattern = "\\.cel$",
ignore.case = TRUE, full.names = TRUE)
Download, install, and load the custom Brainarray Ensembl ENSG gene annotation package
download.file(paste0("http://brainarray.mbni.med.umich.edu/Brainarray/",
"Database/CustomCDF/18.0.0/ensg.download/",
"hgu133plus2hsensgcdf_18.0.0.tar.gz"),
destfile = "hgu133plus2hsensgcdf_18.0.0.tar.gz")
install.packages("hgu133plus2hsensgcdf_18.0.0.tar.gz",
repos = NULL, type = "source")
library(hgu133plus2hsensgcdf)
Perform the RMA normalization with and without the custom CDF.
affy.rma <- justRMA(filenames = cel.files, verbose = TRUE)
ensg.rma <- justRMA(filenames = cel.files, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
As can be seen, the function returns without warning an expression matrix exprs(ensg.ram) where all entries in the expression matrix are NaN.
sum(is.nan(exprs(ensg.rma))) # == prod(dim(ensg.rma)) == 9964482
Interestingly, there are quite a few NaNs in exprs(affy.rma) when using the standard CDF.
# Show some NaNs
na.rows <- apply(is.na(exprs(affy.rma)), 1, any)
exprs(affy.rma)[which(na.rows)[1:10], 1:4]
# A particular bad probeset:
exprs(affy.rma)["1553575_at", ]
# There are relatively few NaNs in total (but the really should be none)
sum(is.na(exprs(affy.rma))) # == 12305
# Probesets of with all NaNs
sum(apply(is.na(exprs(affy.rma)), 1, all))
There really should be none NaNs at all. I've tried using the expresso function to perform background correction only (with no normalization and summarization) which also yield NaNs. So the problem appears to stem from the background correction. However, one might worry that it is one or more bad arrays that is the cause. Can anybody help me track down the origin of the NaNs and get some useful expression values?
Thanks, and best regards
Anders
Edit: A single file appears to be the issue (but not quite)
I decided to check what happens if each .CEL file is normalized independently. What actually happens underneath the RMA hood when justRMA is given a single array I'm not sure of. But I imagine that the quantile normalization step becomes the identity function and the summarization (median polish) simply stops after the first iteration.
Anyway, to perform the one-by-one RMA normalization we run:
ensg <- vector("list", length(cel.files)) # Initialize a list
names(ensg) <- basename(cel.files)
for (file in cel.files) {
ensg[[basename(file)]] <- justRMA(filenames = file, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
cat("File", which(file == cel.files), "done.\n\n")
}
# Extract the expression matrices for each file and combine them
X <- as.data.frame(do.call(cbind, lapply(ensg, exprs)))
By looking at head(X) it appears that GSM776462.CEL is all NaNs. Indeed, it almost is:
sum(!is.nan(X$GSM776462.CEL)) # == 18
Only 18 of 20009 is not NaN. Next, I count the number of NaN appearing other places
sum(is.na(X[, -grep("GSM776462.CEL", colnames(X))])) # == 0
which is 0. Hence GSM776462.CEL appears to be the culprit.
But the regular CDF annotation does not give any problems:
affy <- justRMA(filenames = "CEL/GSM776462.CEL")
any(is.nan(exprs(affy))) # == FALSE
It is also weird, that using regular CDF has seemingly randomly scattered NaNs in the expression matrix. I still don't quite know what to make of this.
Edit2: NaNs vanish when excluding sample but not with standard CDF
For what it is worth, when I exclude the GSM776462.CEL file and RMA normalize with and without the custom CDF files the NaNs only disappear in the custom CDF case.
# Trying with all other CEL than GSM776462.CEL
cel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE)
affy.rma.no.776462 <- justRMA(filenames = cel.files2)
ensg.rma.no.776462 <- justRMA(filenames = cel.files2, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
sum(is.na(exprs(affy.rma.no.776462))) # == 12275
sum(is.na(exprs(ensg.rma.no.776462))) # == 0
Puzzling.
Edit3: No NAs or NaNs in "raw data"
For what it is worth, I've tried to read in the raw probe-level expression values to check if they contain NAs or NaNs. The following goes through the .CEL-files one-by-one and checks for any missing values.
for (file in cel.files) {
affybtch <- suppressWarnings(read.affybatch(filenames = file))
tmp <- exprs(affybtch)
cat(file, "done.\n")
if (any(is.na(tmp)))
stop(paste("NAs or NaNs are present in", file))
}
No NAs or NaNs are found.

Related

Is there a way to store all factors selected from running the same Stepwise regression on each of N datasets using lapply(csvs, FUN(i) { step() })?

My file folder with the N datasets in the form of csv files is called sample_obs. The goal is to end up with two lists, one I have already figured out how to obtain, namely, a list of the names of each individual csv file that matches the format of their actual names in the folder all of them are in, not their file paths.
So, this is all the code I have written and the other list I need to create is a list of the factors/Independent Variables chosen by my Backward Elimination Stepwise Regression function only, no R-squared, Cp, AIC, BIC, or any other standard regression diagnostic tools; I don't want or need coefficient estimates either, just the regressors "chosen" for each dataset out of the 30 candidate regressors.
So far, in terms of my code that actually runs (except the last few lines):
# these 2 lines together create a simple character list of
# all the file names in the file folder of datasets you created
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Step 3: Run a Backward Elimination Stepwise Regression function on each of the N csvs.
# Assign the full models (meaning one with all 30 candidate regressors included in step 1)
# as the initial model that BE starts out with.
# This is crucial because if the initial model had less than the number of candidate factors # in the datasets, e.g. 25 (so, X1:X26), then it could miss 1 or more of the factors
# X26:X30 which ought to be 'chosen' in dataset j by Stepwise j.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
Finally, this is the part where I get really tripped up. I have tried at least 6 different sets of arguments, different syntax, using different objects, etc. when running my BE Stepwise Regression on my N datasets, but I'll just include 2 of them below which take entirely different approaches but are both wrong:
# attempt 1
set.seed(50) # for reproducibility
BE_fits1 <- map(.x = full_model[-1], .f = function(i) { step(object = all_IVs_models2, direction = 'backward', scope = formula(full_model), trace = 0) })
# attempt 3
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {
step(object = i[["coefficients"]], direction = 'backward',
scope = formula(full_model), trace = 0)
When I hit Ctrl+Enter on attempt 1, I get the following error message:
Error in x$terms %||% attr(x, "terms") %||% stop("no terms component nor attribute") :
no terms component nor attribute
And when I try to run my code for attempt #3, I get the following different error message:
Error in x$terms : $ operator is invalid for atomic vectors
I don't recognize either of these error messages.
p.s. If anyone looking over this question would like, I can re ask this question but including MUCH less minute details if you want me to.

R code runs when required parameter is not specified

I am assisting a colleague with adding functionality to one of his R packages.
I have implemented nonparametric bootstrapping using a for loop construct in R.
# perform resampling
# resample `subsample_size` values with or without replacement replicate_size times
for (i in 1:replicate_size) {
if (replacement == TRUE) { # bootstrapping
z <- sample(x, size = subsample_size, replace = TRUE)
zz <- sample(x, size = subsample_size, replace = TRUE)
} else { # subsampling
z <- sample(x, size = subsample_size, replace = FALSE)
zz <- sample(x, size = subsample_size, replace = FALSE)
}
# calculate statistic
boot_samples[i] <- min(zz) - max(z)
}
The above loop is nested within another for loop, which itself is nested within a function (details not shown). The code I'm dealing with is messy, and there are most certainly more efficient ways of coding things up, but I've had to leave it be since my colleague is only familiar with very basic and rudimentary coding constructs.
Upon running said function, I specified all required arguments (replicate_size, replacement) except subsample_size. subsample_size is needed to carry out the resampling. This mistake on my part was revealing because, for some strange reason, the code still runs without throwing an error regarding missing a value for subsample_size.
Question: Does anyone have any idea on why this happens?
I'd include more code, but it is very verbose and unwieldy (his code, not mine). Running the for loop outside the function does indeed raise the error regarding the missing value as expected.

Error related to randomisation test within lapply() function in R

I have 30 datasets that are conbined in a data list. I wanted to analyze spatial point pattern by L function along with randomisation test. Codes are following.
The first code works well for a single dataset (data1) but once it is applied to a list of dataset with lapply() function as shown in 2nd code, it gives me a very long error like so,
"Error in Kcross(X, i, j, ...) : No points have mark i = Acoraceae
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim =
nsim, : Exceeded maximum number of errors"
Can anybody tell me what is wrong with 2nd code?
grp <- factor(data1$species)
window <- ripras(data1$utmX, data1$utmY)
pp.grp <- ppp(data1$utmX, data1$utmY, window=window, marks=grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
plot(L.grp)
plot(LE.grp)
L.LE.sp <- lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window = window, marks = grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, envelope = TRUE)
result <- list(L.grp=L.grp, LE.grp=LE.grp)
return(result)
})
plot(L.LE.sp$LE.grp[1])
This question is about the R package spatstat.
It would help if you could add a minimal working example including data which demonstrate this problem.
If that is not available, please generate the error on your computer, then type traceback() and capture the output and post it here. This will trace the location of the error.
Without this information, my best guess is the following:
The error message says No points have mark i=Acoraceae. That means that the code is expecting a point pattern to include points of type Acoraceae but found that there were none. This can happen because in alltypes(... envelope=TRUE) the code generates random point patterns according to complete spatial randomness. In the simulated patterns, the number of points of type Acoraceae (say) will be random according to a Poisson distribution with a mean equal to the number of points of type Acoraceae in the observed data. If the number of Acoraceae in the actual data is small then there is a reasonable chance that the simulated pattern will contain no Acoraceae at all. This is probably what is causing the error message No points have mark i=Acoraceae.
If this interpretation is correct then you should be able to suppress the error by including the argument fix.marks=TRUE, that is,
alltypes(pp.grp, Lcross, envelope=TRUE, fix.marks=TRUE, nsim=99)
I'm not suggesting this is necessarily appropriate for your application, but this should remove the error message if my guess is correct.
In the latest development version of spatstat, available on github, the code for envelope has been tweaked to detect this error.

rdirichlet distribution isn't working in R

I am attempting to run some code I wrote over a year ago and for some reason it is not working this time. Previously I had the matrix variables alpha and prob typed into R. However this time I am importing them as csv files so it's easier for me to modify them with each itteration. I'm currently running R x64 2.12.2. I loaded gtools, splines and stat4 packages.
alpha <- read.csv("AlphaOriginal.csv", header = FALSE)
prior <- array(0,c(45,45))
for (j in 1:45) prior[j,] <- rdirichlet(1,alpha[j,])
prob <- read.csv("ProbOriginal.csv", header = FALSE)
data <- array(0,c(45,45))
for (i in 1:45) data[i,] <- rmultinom(1,1,prob[i,])
posterior <- data%*%prior
write.table(posterior, file = "PosteriorOriginal.csv", row.names = FALSE, na = "", col.names = FALSE, sep = ",")
After the rdirichlet line I get the following error.
Error in rgamma (1*n, alpha) : invalid arguments
Does anyone know what this error means and how to fix it? Thx
If you take a look at the 'rdirichlet' help page you will find that the 'alpha' argument must be a vector, not a data.frame. So you should instead do:
rdirichlet(1, unlist(alpha[j,]))

R: ape/phylobase: unable to convert ultrametric, binary tree to hclust object (warning message)

I've imported a ClustalW2 tree in R using the ape function and read.tree function of the ape package. I estimate molecular ages using the chronopl function, resulting in a ultrametric, binary tree. From which I want to create a R build in dendrogram object.
The tree plots fine, and is a real phylo object. However i'm running into problems when trying to convert it:
Minimal Working Example:
require(ape)
test.tree <- read.tree(file = "testree.phylip", text = NULL, tree.names = NULL, skip = 0,
comment.char = "#", keep.multi = FALSE)
test.tree.nu <- chronopl(test.tree, 0, age.min = 1, age.max = NULL,
node = "root", S = 1, tol = 1e-8,
CV = FALSE, eval.max = 500, iter.max = 500)
is.ultrametric(test.tree.nu)
is.binary.tree(test.tree.nu)
treeclust <- as.hclust.phylo(test.tree.nu)
The resulting tree "looks" fine,
I test to make sure the tree is not ultrametric and binary, and want to convert it into a hclust object, to make eventually a dendrogram object of it.
> is.binary.tree(test.tree.nu)
[1] TRUE
> is.ultrametric(test.tree.nu)
[1] TRUE
After trying to make a hclust object out of the tree, I get an error:
> tree.phylo <- as.hclust.phylo(test.tree.nu)
Error in if (tmp <= n) -tmp else nm[tmp] :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nm[inode] <- 1:N :
number of items to replace is not a multiple of replacement length
I realize this is a very detailed question, and perhaps such questions which are specifically related to certain packages are better asked somewhere else, but I hope someone is able to help me.
All help is much appreciated,
Regards,
File download
The Phylip file can be downloaded here
http://www.box.net/shared/rnbdk973ja
I can reproduce this with version 2.6-2 of ape under R 2.12.1 beta (2010-12-07 r53808) on Linux, but your code works in version 2.5-3 of ape.
This would suggest a bug has crept into the package and you should inform the developers of the problem to ask for expert advice. The email address of the maintainer, Emmanuel Paradis, is on the CRAN package for ape
looks like the problem is that chronopl returns a tree which is either unrooted, or has a multifurcating root (depending on how it's interpreted). Also as.hclust.phylo has/had unhelpful error messages.
This:
modded.tree <- drop.tip(test.tree.nu,c(
'An16g06590','An02g12505','An11g00390','An14g01130'))
removes all tips from one of the three clades descending from the root, thus
is.ultrametric(modded.tree)
is.binary.tree(modded.tree)
is.rooted(modded.tree)
all return TRUE, and you can do
treeclust <- as.hclust.phylo(modded.tree)
. Though I think you really want an hclust object representing the multifurcating tree, and though hclust objects can handle those, as.hclust.phylo (from package 'ape') doesn't work on multifurcations for some reason. If you know a way to import newick files into hclust objects, that might be a way forward - ade has write.tree() to generate newick files.

Resources