Get the most expressed genes from one .CEL file in R - r

In R the Limma package can give you a list of differentially expressed genes.
How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).
If you run the following script, it's all ok. You have many .CEL files and all work.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)
But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:
> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=
Then you'll get the error:
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
How can I get the most expressed genes from 1 .CEL sample file?
I've found a library that could be useful for my purpose: the panp package.
But, if you run the following script:
if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)
you'll get an error:
> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero
even if the platform of the GDS is that expected by the library.
If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:
my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.
In summary, If you run the script you'll get an error while executing:
my_pa <- pa.calls(eset)
and not while executing
my_pa <- pa.calls(gcrma.ExpressionSet)
Why if they are both ExpressionSet?
> is(gcrma.ExpressionSet)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
> is(eset)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"

Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette
vignette("ExpressionSetIntroduction")
also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So
> eset = gcrma.ExpressionSet ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
row col
213477_x_at 22779 24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"
Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.
Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.

Related

Error in is.single.string(object) : argument "object" is missing, with no default

I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments.
library(tidyr)
variant_calls = read.delim("variant_calls.txt")
info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":")
df = cbind(variant_calls["Gene.refGene"],info)
library(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice")
pep <- vector()
for(i in 1:length(df$`Refseq ID`)){
temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl)
temp <- sapply(temp$peptide, nchar)
temp <- sort(temp, decreasing = TRUE)
temp <- names(temp[1])
pep[i] <- temp
}
df$Sequence <- pep
Traceback:
Error in is.single.string(object) :
argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often.
My solution consisted in calling the function like this:
biomaRt::getSequence()

Parallel package for windows 10 in R

I have this dataset that I'm trying to parse in R. The data from HMDB and the dataset name is Serum Metabolites (in a format of xml file). The xml file contains about 25K metabolites nodes, each I want to parse to sub-nodes
I have a code that parses the XML file to a list object in R.
Since the XML file is quite big and since for each metabolite there are about 12 sub-nodes I want, It takes a long time to parse the file. about 3 hours to 1,000 metabolites.
I'm trying to use the package parallel but receive and error.
The packages:
library("XML")
library("xml2")
library( "magrittr" ) #for pipe operator %>%
library("pbapply") # to track on progress
library("parallel")
The function:
# The function receives an XML file (its location) and returns a list of nodes
Short_Parser_HMDB <- function(xml.file_location){
start.time<- Sys.time()
# Read as xml file
doc <- read_xml( xml.file_location )
#get metabolite nodes (only first three used in this sample)
met.nodes <- xml_find_all( doc, ".//d1:metabolite" ) [1:1000] # [(i*1000+1):(1000*i+1000)] # [1:3]
#list of data.frame
xpath_child.v <- c( "./d1:accession",
"./d1:name" ,
"./d1:description",
"./d1:synonyms/d1:synonym" ,
"./d1:chemical_formula" ,
"./d1:smiles" ,
"./d1:inchikey" ,
"./d1:biological_properties/d1:pathways/d1:pathway/d1:name" ,
"./d1:diseases/d1:disease/d1:name" ,
"./d1:diseases/d1:disease/d1:references",
"./d1:kegg_id" ,
"./d1:meta_cyc_id"
)
child.names.v <- c( "accession",
"name" ,
"description" ,
"synonyms" ,
"chemical_formula" ,
"smiles" ,
"inchikey" ,
"pathways_names" ,
"diseases_name",
"references",
"kegg_id" ,
"meta_cyc_id"
)
#first, loop over the met.nodes
L.sec_acc <- parLapply(cl, met.nodes, function(x) { # pblapply to track progress or lapply but slows down dramticlly the function and parLapply fo parallel
#second, loop over the xpath desired child-nodes
temp <- parLapply(cl, xpath_child.v, function(y) {
xml_find_all(x, y ) %>% xml_text(trim = T) %>% data.frame( value = .)
})
#set their names
names(temp) = child.names.v
return(temp)
})
end.time<- Sys.time()
total.time<- end.time-start.time
print(total.time)
return(L.sec_acc )
}
Now create the enviroment :
# select the location where the XML file is
location= "D:/path/to/file//HMDB/DataSets/serum_metabolites/serum_metabolites.xml"
cl <-makeCluster(detectCores(), type="PSOCK")
clusterExport(cl, c("Short_Parser_HMDB", "cl"))
clusterEvalQ(cl,{library("parallel")
library("magrittr")
library("XML")
library("xml2")
})
And execute :
Short_outp<-Short_Parser_HMDB(location)
stopCluster(cl)
The error received:
> Short_outp<-Short_Parser_HMDB(location)
Error in checkForRemoteErrors(val) :
one node produced an error: invalid connection
base on those links, Tried to implement the parallel :
Parallel Processing in R
How to call global function from the parLapply function?
Error in R parallel:Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: cannot open the connection
but couldn't find invalid connection as an error
I'm using windows 10 the latest R version 4.0.2 (not sure if it's enough information)
Any hint or idea will be appreciated

How to apply rma() normalization to a unique CEL file?

I have implemented a R script that performs batch correction on a gene expression dataset. To do the batch correction, I first need to normalize the data in each CEL file through the Affy rma() function of Bioconductor.
If I run it on the GSE59867 dataset obtained from GEO, everything works.
I define a batch as the data collection date: I put all the CEL files having the same date into a specific folder, and then consider that date/folder as a specific batch.
On the GSE59867 dataset, a batch/folder contains only 1 CEL file. Nonetheless, the rma() function works on it perfectly.
But, instead, if I try to run my script on another dataset (GSE36809), I have some troubles: if I try to apply the rma() function to a batch/folder containing only 1 file, I get the following error:
Error in `colnames<-`(`*tmp*`, value = "GSM901376_c23583161.CEL.gz") :
attempt to set 'colnames' on an object with less than two dimensions
Here's my specific R code, to let you understand.
You first have to download the file GSM901376_c23583161.CEL.gz:
setwd(".")
options(stringsAsFactors = FALSE)
fileURL <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM901nnn/GSM901376/suppl/GSM901376%5Fc23583161%2ECEL%2Egz"
fileDownloadCommand <- paste("wget ", fileURL, " ", sep="")
system(fileDownloadCommand)
Library installation:
source("https://bioconductor.org/biocLite.R")
list.of.packages <- c("easypackages")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
listOfBiocPackages <- c("oligo", "affyio","BiocParallel")
bioCpackagesNotInstalled <- which( !listOfBiocPackages %in% rownames(installed.packages()) )
cat("package missing listOfBiocPackages[", bioCpackagesNotInstalled, "]: ", listOfBiocPackages[bioCpackagesNotInstalled], "\n", sep="")
if( length(bioCpackagesNotInstalled) ) {
biocLite(listOfBiocPackages[bioCpackagesNotInstalled])
}
library("easypackages")
libraries(list.of.packages)
libraries(listOfBiocPackages)
Application of rma()
thisFileDate <- "GSM901376_c23583161.CEL.gz"
thisDateRawData <- read.celfiles(thisDateCelFiles)
thisDateNormData <- rma(thisDateRawData)
After the call to rma(), I get the error.
How can I solve this problem?
I also tried to skip this normalization, by saving the thisDateRawData object directly. But then I have the problem that I cannot combine together this thisDateRawData (that is a ExpressionFeatureSet) with the outputs of rma() (that are ExpressionSet objects).
(EDIT: I extensively edited the question, and added a piece of R code you should be able to run on your pc.)
Hmm. This is a puzzling problem. the oligo::rma() function might be buggy for class GeneFeatureSet with single samples. I got it to work with a single sample by using lower-level functions, but it means I also had to create the expression set from scratch by specifying the slots:
# source("https://bioconductor.org/biocLite.R")
# biocLite("GEOquery")
# biocLite("pd.hg.u133.plus.2")
# biocLite("pd.hugene.1.0.st.v1")
library(GEOquery)
library(oligo)
# # Instead of using .gz files, I extracted the actual CELs.
# # This is just to illustrate how I read in the files; your usage will differ.
# projectDir <- "" # Path to .tar files here
# setwd(projectDir)
# untar("GSE36809_RAW.tar", exdir = "GSE36809")
# untar("GSE59867_RAW.tar", exdir = "GSE59867")
# setwd("GSE36809"); gse3_cels <- dir()
# sapply(paste(gse3_cels, sep = "/"), gunzip); setwd(projectDir)
# setwd("GSE59867"); gse5_cels <- dir()
# sapply(paste(gse5_cels, sep = "/"), gunzip); setwd(projectDir)
#
# Read in CEL
#
# setwd("GSE36809"); gse3_cels <- dir()
# gse3_efs <- read.celfiles(gse3_cels[1])
# # Assuming you've read in the CEL files as a GeneFeatureSet or
# # ExpressionFeatureSet object (i.e. gse3_efs in this example),
# # you can now fit the RMA and create an ExpressionSet object with it:
exprsData <- basicRMA(exprs(gse3_efs), pnVec = featureNames(gse3_efs))
gse3_expset <- new("ExpressionSet")
slot(gse3_expset, "assayData") <- assayDataNew(exprs = exprsData)
slot(gse3_expset, "phenoData") <- phenoData(gse3_efs)
slot(gse3_expset, "featureData") <- annotatedDataFrameFrom(attr(gse3_expset,
'assayData'), byrow = TRUE)
slot(gse3_expset, "protocolData") <- protocolData(gse3_efs)
slot(gse3_expset, "annotation") <- slot(gse3_efs, "annotation")
Hopefully the above approach will work in your code.

Calling CustomCSVIter function using mxnet pacakge in R

I'm trying to train a list of text datasets at the character level (for example, a cat => "a", " ", "c", "a", "t") so that I can classify them with great accuracy. I'm using mxnet package (CNN Network) in R and using crepe model. So to prepare for training, I need to do iterations for both training and test datasets. So the code is as follow:
train.iter <- CustomCSVIter$new(iter=NULL, data.csv=train.file.output,
batch.size=args$batch_size, alphabet=alphabet,
feature.len=feature.len)
test.iter <- CustomCSVIter$new(iter=NULL, data.csv=test.file.output,
batch.size=args$batch_size, alphabet=alphabet,
feature.len=feature.len)
data.csv where I have these datasets, batch.size is just an integer, feature.len is also just an integer, and alphabet is a vector of alphanumeric quotations (abcd...?!""). When I run the above code, I get a message saying I have a fatal error and Rstudio crashes and reloads. I don't know what I'm doing wrong. To run the above code, you need the following function:
CustomCSVIter <- setRefClass("CustomCSVIter",
fields=c("iter", "data.csv", "batch.size",
"alphabet","feature.len"),
contains = "Rcpp_MXArrayDataIter",
methods=list(
initialize=function(iter, data.csv, batch.size,
alphabet, feature.len){
csv_iter <- mx.io.CSVIter(data.csv=data.csv,
data.shape=feature.len+1, #=features + label
batch.size=batch.size)
.self$iter <- csv_iter
.self$data.csv <- data.csv
.self$batch.size <- batch.size
.self$alphabet <- alphabet
.self$feature.len <- feature.len
.self
},
value=function(){
val <- as.array(.self$iter$value()$data)
val.y <- val[1,]
val.x <- val[-1,]
val.x <- dict.decoder(data=val.x,
alphabet=.self$alphabet,
feature.len=.self$feature.len,
batch.size=.self$batch.size)
val.x <- mx.nd.array(val.x)
val.y <- mx.nd.array(val.y)
list(data=val.x, label=val.y)
},
iter.next=function(){
.self$iter$iter.next()
},
reset=function(){
.self$iter$reset()
},
num.pad=function(){
.self$iter$num.pad()
},
finalize=function(){
.self$iter$finalize()
}
)
)
Usually a problem like this arises when there is a mismatch between shapes of the input file and the data.shape parameter of the iterator.
You can easily check if this is the problem by running your code outside from RStudio. Run R from terminal/command line and paste your code there. When an exception happens, it will terminate the R session, and you will be able to read the exception message. In my case it was:
Check failed: row.length == shape.Size() (2 vs. 1) The data size in CSV do not match size of shape: specified shape=(1,), the csv row-length=2
In your case it is probably something similar.
Btw, there is an implementation of a custom iterator for MNIST dataset, which you may find useful: https://github.com/apache/incubator-mxnet/issues/4105#issuecomment-266190690

getting Error while installing DMwR package

hi i am getting this error message while installing DMwR package from RGUI-3.3.1.
Error in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open the connection
In addition: Warning messages:
1: In unzip(zipname, exdir = dest) : error 1 in extracting from zip file
2: In read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open compressed file 'bitops/DESCRIPTION', probable reason 'No such file or directory'
Approach 1:
The error being reported is inability to open a connection. In Windows that is often a firewall problem and is in the Windows R FAQ. The usual first attempt should be to run internet2.dll. From a console session you can use:
setInternet2(TRUE)
NEWS for R version 3.3.1 Patched (2016-09-13 r71247)
(Windows only) Function
setInternet2()
has no effect and will be removed in due
course. The choice between methods
"internal"
and
"wininet"
is now made by the
method
arguments of
url()
and
download.file()
and their defaults can be set
via
options. The out-of-the-box default remains
"wininet"
(as it has been since
R
3.2.2)
You are using version 3.3.1, this is why it is not working anymore.
Approach 2
The error is suggesting that the package requires another package bitops that is not available. That package is not in any of the dependencies but perhaps one of the dependencies requires it in turn(In this case, it is: ROCR).
Try installing:
install.packages("bitops",repos="https://cran.r-project.org/bin/windows/contrib/3.3/bitops_1.0-6.zip",dependencies=TRUE,type="source")
The package DMwR contains packages abind, zoo, xts, quantmod and ROCR as imports. So, additionally to installing 5 packages you must install DMwR package, Install these packages manually.
Install packages in following sequence:
install.packages('abind')
install.packages('zoo')
install.packages('xts')
install.packages('quantmod')
install.packages('ROCR')
install.packages("DMwR")
library("DMwR")
Approach 3:
chooseCRANmirror()
Select CRAN mirror from popup list. Then install packages:
install.packages("bitops")
install.packages("DMwR")
Package ‘DMwR’ was removed from the CRAN repository.
Formerly available versions can be obtained from the archive.
https://CRAN.R-project.org/package=DMwR
You can use the function as written in CRAN package. Copy the following code in a new RScript, run it and save it for future use if you want. Once you run this function, you should be able to use the way you have been trying to use it.
# ===================================================
# Creating a SMOTE training sample for classification problems
#
# If called with learner=NULL (the default) is does not
# learn any model, simply returning the SMOTEd data set
#
# NOTE: It does not handle NAs!
#
# Examples:
# ms <- SMOTE(Species ~ .,iris,'setosa',perc.under=400,perc.over=300,
# learner='svm',gamma=0.001,cost=100)
# newds <- SMOTE(Species ~ .,iris,'setosa',perc.under=300,k=3,perc.over=400)
#
# L. Torgo, Feb 2010
# ---------------------------------------------------
SMOTE <- function(form,data,
perc.over=200,k=5,
perc.under=200,
learner=NULL,...
)
# INPUTS:
# form a model formula
# data the original training set (with the unbalanced distribution)
# minCl the minority class label
# per.over/100 is the number of new cases (smoted cases) generated
# for each rare case. If perc.over < 100 a single case
# is generated uniquely for a randomly selected perc.over
# of the rare cases
# k is the number of neighbours to consider as the pool from where
# the new examples are generated
# perc.under/100 is the number of "normal" cases that are randomly
# selected for each smoted case
# learner the learning system to use.
# ... any learning parameters to pass to learner
{
# the column where the target variable is
tgt <- which(names(data) == as.character(form[[2]]))
minCl <- levels(data[,tgt])[which.min(table(data[,tgt]))]
# get the cases of the minority class
minExs <- which(data[,tgt] == minCl)
# generate synthetic cases from these minExs
if (tgt < ncol(data)) {
cols <- 1:ncol(data)
cols[c(tgt,ncol(data))] <- cols[c(ncol(data),tgt)]
data <- data[,cols]
}
newExs <- smote.exs(data[minExs,],ncol(data),perc.over,k)
if (tgt < ncol(data)) {
newExs <- newExs[,cols]
data <- data[,cols]
}
# get the undersample of the "majority class" examples
selMaj <- sample((1:NROW(data))[-minExs],
as.integer((perc.under/100)*nrow(newExs)),
replace=T)
# the final data set (the undersample+the rare cases+the smoted exs)
newdataset <- rbind(data[selMaj,],data[minExs,],newExs)
# learn a model if required
if (is.null(learner)) return(newdataset)
else do.call(learner,list(form,newdataset,...))
}
# ===================================================
# Obtain a set of smoted examples for a set of rare cases.
# L. Torgo, Feb 2010
# ---------------------------------------------------
smote.exs <- function(data,tgt,N,k)
# INPUTS:
# data are the rare cases (the minority "class" cases)
# tgt is the name of the target variable
# N is the percentage of over-sampling to carry out;
# and k is the number of nearest neighours to use for the generation
# OUTPUTS:
# The result of the function is a (N/100)*T set of generated
# examples with rare values on the target
{
nomatr <- c()
T <- matrix(nrow=dim(data)[1],ncol=dim(data)[2]-1)
for(col in seq.int(dim(T)[2]))
if (class(data[,col]) %in% c('factor','character')) {
T[,col] <- as.integer(data[,col])
nomatr <- c(nomatr,col)
} else T[,col] <- data[,col]
if (N < 100) { # only a percentage of the T cases will be SMOTEd
nT <- NROW(T)
idx <- sample(1:nT,as.integer((N/100)*nT))
T <- T[idx,]
N <- 100
}
p <- dim(T)[2]
nT <- dim(T)[1]
ranges <- apply(T,2,max)-apply(T,2,min)
nexs <- as.integer(N/100) # this is the number of artificial exs generated
# for each member of T
new <- matrix(nrow=nexs*nT,ncol=p) # the new cases
for(i in 1:nT) {
# the k NNs of case T[i,]
xd <- scale(T,T[i,],ranges)
for(a in nomatr) xd[,a] <- xd[,a]==0
dd <- drop(xd^2 %*% rep(1, ncol(xd)))
kNNs <- order(dd)[2:(k+1)]
for(n in 1:nexs) {
# select randomly one of the k NNs
neig <- sample(1:k,1)
ex <- vector(length=ncol(T))
# the attribute values of the generated case
difs <- T[kNNs[neig],]-T[i,]
new[(i-1)*nexs+n,] <- T[i,]+runif(1)*difs
for(a in nomatr)
new[(i-1)*nexs+n,a] <- c(T[kNNs[neig],a],T[i,a])[1+round(runif(1),0)]
}
}
newCases <- data.frame(new)
for(a in nomatr)
newCases[,a] <- factor(newCases[,a],levels=1:nlevels(data[,a]),labels=levels(data[,a]))
newCases[,tgt] <- factor(rep(data[1,tgt],nrow(newCases)),levels=levels(data[,tgt]))
colnames(newCases) <- colnames(data)
newCases
}
It has been removed from the CRAN library. There are instructions on how to retrieve it from the archive.
Either follow the link - https://packagemanager.rstudio.com/client/#/repos/2/packages/DMwR
OR copy-paste the three lines of code mentioned below:
install.packages("devtools")
devtools::install_version('DMwR', '0.4.1')
library("DMwR")
EDIT: this is the error I got while downloading the DMwR package in 2022, but looks like when the question was posted, the error happened because of another reason.
The reason is that the package 'DMwR' was built under R version 3.4.3 So the solution is actually explained in the marked answer in details. Hence, to be short:Just run the script below to get the problem solved! 
install.packages('abind')
install.packages('zoo')
install.packages('xts')
install.packages('quantmod')
install.packages('ROCR')
install.packages("DMwR")
library("DMwR")

Resources