SPIA package (R) applied to Illumina expression microarray data - r

I've been experimenting with alternative annotation to GSEA for expression (mRNA)data.
SPIA (Signalling Pathway Integration Analysis) looks interesting, but it seems to have exactly one error message for everything:
Error in spia(de = sigGenes, all = allGenes, organism = "hsa", plots = TRUE, :
de must be a vector of log2 fold changes. The names of de should >be
included in the reference array!
The input requires a single vector of log2 fold changes(my vector is named sigGenes), with Entrez ID as the associated names, and an integer vector of Entrez IDs included in the microarray (allGenes):
head(sigGenes)
6144 115286 23530 10776 83933 6232
0.368 0.301 0.106 0.234 -0.214 0.591
head(allGenes)
6144 115286 23530 10776 83933 6232
I've already removed values whose EntrezID annotations that are NA.
I've also subset my data from the Illumina microarray to only those genes found in the Affymetrix array using the example provided in the site I list below. I still get the same error.
Here is the full bit of R code:
library(Biobase)
library(limma)
library(SPIA)
sigGenes <- subset(full_table, P.Value<0.01)$logFC
names(sigGenes) <- subset(full_table, P.Value<0.01)$EntrezID
sigGenes<-sigGenes[!is.na(names(sigGenes))] # remove NAs
allGenes <- unique(full_table$EntrezID[!is.na(full_table$EntrezID)])
spiaOut <- spia(de=sigGenes, all=allGenes, organism="hsa", plots=TRUE, data.dir="./")
Any ideas of what else I could try?
Apologies if off topic (still new here). Happy to move the question elsewhere if needed.
Example of SPIA applied to Affymetrix platform data here: http://www.gettinggeneticsdone.com/2012/03/pathway-analysis-for-high-throughput.html)

Removing the duplicates did help.
As a workaround, I chose the median value (only because the values were close) among each set of duplicates as follows:
dups<-unique(names(sigGenes[which(duplicated(names(sigGenes)))])) # determine which are duplicates
dupID<-names(sigGenes) %in% dups # determine the laocation of all duplicates
sigGenes_dup<-vector(); j=0; # determine the median value for each duplicate
for (i in dups){j=j+1; sigGenes_dup[j]<- median(sigGenes[names(sigGenes)==i]) }
names(sigGenes_dup)<-dups
sigGenes<-sigGenes[!(names(sigGenes) %in% dups)] # remove duplicates from sigGenes
sigGenes<-c(sigGenes,sigGenes_dup) # append the median values of the duplicates
Alternatively just removing the duplicates works:
dups<-unique(names(sigGenes[which(duplicated(names(sigGenes)))]))
sigGenes<-sigGenes[!(names(sigGenes) %in% dups)] # remove duplicates from sigGenes

based on our discussion I would suggest removing duplicated entries in sigGenes. Without additional information, it is hard to say where the duplicates might originate from, and which one to delete.

Related

Create a histogram of specific columns and rows from a `data.frame` in R

## my data frame
crime = read.csv("url")
## specific columns that need to be represented
property_crime = crime$Burglary + crime$Theft + crime$`Motor Vehical Theft`
## the rows that I am looking for have the name "harris" within the column named "county_name"
## my attempt
with(crime, hist(harris))
## Error in hist(harris) : object 'harris' not found
Not sure why I am getting object 'harris' not found as that is the name under the county_name column. I'm new to R, could someone walk me through the process of displaying a histogram only including the values of specific columns and specific rows?
the rows that I am looking for have the name "harris" within the column named "county_name"
You have to tell R the same logic that you are telling us.
There are several ways of making this in R but I am going to put here the base R way.
We can access the desired rows of object crime column county_name by indexing like data.frame[rows, columns]. So, in your case, crime[harris_rows, "county_name"] should work. To get harris_rows, we can make a boolean index like so crime$county_name == harris. If we put all of this together and call hist():
hist(crime[crime$county_name == "harris", "county_name"])
You don't provide a reproducible example, but you can check a similar logic with the mtcars dataset. Here, I am making the histogram of the cars with mpg > 15
hist(mtcars[mtcars$mpg >15, "mpg"])
# this is another option that produces the same result
# hist(mtcars$mpg[mtcars$mpg >15])

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!
I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

undefined columns selected (Bayesian analysis)

I am replicating a R code for the Bayesian analysis but I got this error that I have tried to solve it, also reading other questions here but still it does not work.
I use the same dataset and same variables (from OECD). Can anyone tell me why it does not work?
My code is this:
rm(list=ls())
# Name of variables to be extracted
v.resp=c("pv1math") # Response Variable
v.treat=c("IC02Q01","IC02Q02","IC02Q03") # Treatment variable(s)
# Student Confoundings
v.student.conf=c("Age", "Gender", "isced_0", "IMMIG", "HEDRES", "WEALTH", "ESCS","FAMSTRUC","hisced","hisei","HOMEPOS", "TIMEINT")
# School Confoundings
v.school.conf=c("CLSIZE","SCMATEDU","STRATIO","SMRATIO","PublicPrivate")
## LOAD DATA
dat <- read.dta("name.dta")
## Weighted sample with weights in the w vector
w=dat$W_FSTUWT
Subset data in R
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf)]
names(dat)[names(dat)==v.resp]="y"
w=w[complete.cases(dat)]
w=w/sum(w)
nw=function(w) w/sum(w)
dat=dat[complete.cases(dat),]
dim(dat)
When I run the line
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf)] I got the error
Error in [.data.frame(dat, c(v.resp, v.treat, v.student.conf, v.school.conf)) :undefined columns selected
I have 25000 observation and 900 variables but I want to subset my data with 21 variables and the observations related to them (less than 25000 for sure). I put comma between )] but nothing, run other lines I lose all data.
I also run this code from "Quick-R website" but again the same error message
# select variables v1, v2, v3
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
I would like to understand why it does not work. I am copying and pasting these codes from a paper that used them for the same dataset.
Thank you.
The message stated: undefined columns selected. That is just what is the situation here: you only selected the rows you wanted, but forgot to tell which columns. When you use [ ] for subsetting, you must specify the rows and the columns. So, you need a comma to separate the info for the rows and for the columns. Since you have no selection on rows, you don't need to specify anything after the comma. But the comma is needed. The adjusted code:
dat=dat[c(v.resp,v.treat,v.student.conf,v.school.conf),]
The only difference is the comma before the closing ]

R unique function vs eliminating duplicated values

I have a data frame and an info file (also in table format) that describes the data within the data frame. The row names of the data frame need to be relabelled according to information within the info file. The problem is that the information corresponding to the data frame row names, in the info file, contains lots of duplicated values. Hence it is necessary to convert the df to a matrix such that the row names can have duplicate values.
matrix1<-as.matrix(df)
ptr<-match(rownames(matrix1), info_file$Array_Address_Id)
rownames(matrix1)<-info_file$ILMN_Gene[ptr]
matrix1<-matrix1[!duplicated(rownames(E.rna_conditions_cleaned)), ]
The above is my own code however a friend gave me some code with a similar goal but different results:
u.genes <- unique(info_file$ILMN_Gene)
ptr.u.genes <- match( u.genes, info_file$ILMN_Gene )
matrix2 <- as.matrix(df[ptr.u.genes,])
rownames(matrix2) <- u.genes
The problem is that these two strategies output different results:
> dim(matrix1)
[1] 30783 565
> dim(matrix2[,ptr.use])
[1] 34694 565
See above matrix2 has ~4000 more rows than the other.
As you can see the row names of the below output are indeed unique but that doesn't tell why the two methods selected different rows but which method is better and why is the output different?
U.95 JIA.65 DV.93 KD.76 HC.54 KD.77
7A5 5.136470 5.657738 5.122299 5.195540 5.378040 4.997210
A1BG 6.166210 6.210373 6.382051 6.494048 5.888900 5.914070
A1CF 5.222130 4.940529 4.715292 5.182658 4.510937 5.060749
A26C3 5.410403 5.148601 5.122299 3.967419 4.780758 4.868472
A2BP1 5.725115 4.817920 5.483607 5.444427 5.503358 5.121951
A2LD1 6.505271 6.558276 5.494096 4.833267 6.988192 6.082662
I need to know this because I wan the row values that will yield the most accurate downstream analysis by having the row values that are best.

Creating eset object from preprocessed expression matrix?

I am analysing with R some gene expression data. I would like to do differential gene expression analysis with limma's eBayes (limma is part of BioConductor), but to do that I need to have my expression data as an eset object. Thing is, I have only preprocessed data and do not have the CEL files, I could convert directly to eset object. I tried searching from Internet, but couldn't find a solution. Only thing I found, was that it IS possible.
Why eBayes:
It should have robust results even with only two or three samples in some of the groups and I do indeed have 3 groups that are from 2 to 3 samples in size.
In detail what I have and want to do:
I have expression data, already as logarithmic, normalized intesity values. The data is in expression matrix. There is about 20 000 rows and each row is a gene and the rownames are the official gene names. There is 22 columns and each column corresponds to one cancer sample. I have different kinds of cancer subtypes there and would like to compare for example subtype 1 samples' gene expression to that of the group 2's. Below is a two row, 5 column example of what my matrix would look like.
Example matrix:
SAMP1 SAMP2 SAMP3 SAMP4 SAMP5
GENE1 123.764 122.476 23.4764 2.24343 123.3124
GENE2 224.233 455.111 124.122 112.155 800.4516
The problem:
To evaluate the differential gene expression with eBayes I would need the eset object out of this expression data and I have honestly no idea how to go about that step. :(
I am very grateful for every bit of info that can help me out! If someone can suggest another reliable method for small sample size comparisons, that might solve my problem as well.
Thank you!
Using an ExpressionSet seems to be quite similar to a SummarizedExperiment which is also prevalent in Bioconductor packages. From what I understand, there is nothing special about using one or the other in a package--in my experience, it's considered as a generalized container for data in order to standardize the data set format across Bioconductor packages.
From the vignette on Bioconductor:
Affymetrix data will usually be normalized using the
affy
package. We will assume here that the
data is available as an
ExpressionSet
object called
eset. Such an object will have an slot containing
the log-expression values for each gene on each array which can be extracted using
exprs(eset).
In other words, there's nothing special about the data for the ExpressionSet. An ExpressionSet is simply a bunch of related experimental data strung together into one, but it appears that I can create a new object just from the regular object:
library(limma)
# counts is the assay data I already have.
dim(counts)
# [1] 64102 8
# Creates a new ExpressionSet object (quite bare, only the assay data)
asdf <- ExpressionSet(assayData = counts)
# Returns the data you put in.
exprs(asdf)
This works on my setup.
The second part that you need to consider is the design of the differential expression analysis comparison model matrix. You will need predefined factors to go along with your samples (probably within a phenoData argument to ExpressionSet and then create a model.matrix using R's special formula syntax. They look similar to: dependent ~ factor1 + factor2 + co:related. Note that a factor1 is a factor category or dimension, not just one level.
Once you have that, you should be able to run lmFit. I've actually not used limma much before but it appears to be similar to edgeR's scheme.
Just decided to make it answer to help some other poor sod, who has the same accident. Figured the problem out myself after going through the links kindly given in comments.
ExpressionSet() does take matrices and turn them to eSet object fine. Just had to make sure the data was as matrix instead of data frame object.

Resources