interpreting R code function - r

I would like to perform pathway enrichment analyses.
I have 21 list of significant genes, and mutiple types of pathways I would like to check (ie. check for enrichment in KEGG pathways, GOterms, complexes etc.).
I found this example of code, on an old BioC post. However, I am having trouble adapting it for myself.
Firstly,
1- what does this mean? I don't know this multiple colon syntax.
hyperg <- Category:::.doHyperGInternal
2 - I don't understand how this line works. hyperg.test is a function that needs 3 variables passed to it, correct? Is this line somehow passing "genes.by.pathways, significant.genes, and all.geneIDs to thr hyperg.test?
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
Code that I would like to adapt
library(KEGGREST)
library(org.Hs.eg.db)
# created named list, length 449, eg:
# path:hsa00010: "Glycolysis / Gluconeogenesis"
pathways <- keggList("pathway", "hsa")
# make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))
# for demonstration, just use the first ten pathways
demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)
genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
demo.pathway$GENE[c(TRUE, FALSE)]
})
all.geneIDs <- keys(org.Hs.eg.db)
# chose one of these for demonstration. the first (a whole genome random
# set of 100 genes) has very little enrichment, the second, a random set
# from the pathways themselves, has very good enrichment in some pathways
set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)
# the hypergeometric distribution is traditionally explained in terms of
# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also
hyperg <- Category:::.doHyperGInternal
hyperg.test <-
function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
white.balls.in.urn <- length(pathway.genes)
total.balls.in.urn <- length(all.genes)
black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
balls.pulled.from.urn <- length(significant.genes)
hyperg(white.balls.in.urn, black.balls.in.urn,
balls.pulled.from.urn, white.balls.drawn, over)
}
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
print(pVals.by.pathway)

The reason you are getting your error is because it appears you don't have the Category package installed from bioconductor. I suspect this because of the triple colon operator :::. This operator is very similar to the double colon operator ::. Whereas with :: you can access exported objects from a package without loading it, the ::: allows access to non-exported objects (in this case the hyperg function from Category). If you install the Category package the code runs without error.
With regard to the sapply statement:
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
You can break this down into the separate parts to understand it. Firstly, the sapply is iterating over the elements of gene.by.pathway and passing them to the first argument of hyperg.test. The following arguments are the two addition parameters. It is a little unclear and I personally recommend that people explicitly identify the parameters to avoid unexpected surprises and avoids the need for the exact same order. This is a little repetitive in this case but a good way to avoid a silly bug (e.g. putting significant.genes after all.geneIds)
Rewritten:
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes=significant.genes, all.genes=all.geneIDs))
Once this loop completes, the sapply function simplifies the output in to a matrix. However, the output is much more user-friendly by taking the transpose t.
Generally speaking, when trying to understand complex apply statements I find it best to break them apart in to smaller parts and see what the objects themselves look like.

Related

How to create objects in R with "unknown name"?

I am trying to create (vector) objects in R. Thereby, I want to achieve that I don't specify a priori the name of the object. For example if I have a list of length 3, I want to create the objects p1 to p3 and if I have a list of length 10, the objects p1to p10 have to be created. The length should be arbitrary and not a priori determined.
Thanks for your help!
I guess the proper way of doing that is to consider a list p = list() and then you can use p[[i]] with i as big as you wish without having specified any length.
Then once your list is filled up, you can rename it: names(p) = paste0("p",c(1:length(p)))
Finally, if you want to get all the pi variables directly accessible, you add attach(p)
This is kind of a hack but you can do the following
short_list <- list(rnorm(10),rnorm(20),1:3)
long_list <- c(short_list,short_list )
paste0("p",seq_along(short_list))
mapply(assign, paste0("p",seq_along(short_list)), short_list, MoreArgs = list(envir = .GlobalEnv))
result:
> p3
[1] 1 2 3
you can do the same with long_list
I dont see a statistical model you will need this. Better start working with lists like short_list or data.frame's directly.
PS If you just want to use it for glm you probably want to learn formula's in R.
glm(y~., data=your_data) takes all columns in your data-frame that are not named y as regressor. Maybe this helps.
assign (and maybe also attach) are often a sign that you have not yet arrived at an "Rish" version of the code.
Considering that you need this for modeling: if your $p_1 \cdot p_n$ are of the same type, you can put them into a matrix (inside a column of a data.frame; for modeling they anyways need to be of same length):
df$matrix <- p.matrix
If you directly create the data.frame, you need to make sure the matrix is not expanded to data.frame columns:
df <- data.frame (matrix = I (matrix), ...)
Then glm (y ~ matrix, ...)  will work.
For examples of this technique see e.g. packages pls or hyperSpec or the pls paper in the Journal of Statistical Software.

Loop and clear the basic function in R

I've got this dataset
install.packages("combinat")
install.packages("quantmod")
library(quantmod)
library(combinat)
library(utils)
getSymbols("AAPL",from="2012-01-01")
data<-AAPL
p1<-4
dO<-data[,1]
dC<-data[,4]
emaO<-EMA(dO,n=p1)
emaC<-EMA(dC,n=p1)
Pos_emaO_dO_UP<-emaO>dO
Pos_emaO_dO_D<-emaO<dO
Pos_emaC_dC_UP<-emaC>dC
Pos_emaC_dC_D<-emaC<dC
Pos_emaC_dO_D<-emaC<dO
Pos_emaC_dO_UP<-emaC>dO
Pos_emaO_dC_UP<-emaO>dC
Pos_emaO_dC_D<-emaO<dC
Profit_L_1<-((lag(dC,-1)-lag(dO,-1))/(lag(dO,-1)))*100
Profit_L_2<-(((lag(dC,-2)-lag(dO,-1))/(lag(dO,-1)))*100)/2
Profit_L_3<-(((lag(dC,-3)-lag(dO,-1))/(lag(dO,-1)))*100)/3
Profit_L_4<-(((lag(dC,-4)-lag(dO,-1))/(lag(dO,-1)))*100)/4
Profit_L_5<-(((lag(dC,-5)-lag(dO,-1))/(lag(dO,-1)))*100)/5
Profit_L_6<-(((lag(dC,-6)-lag(dO,-1))/(lag(dO,-1)))*100)/6
Profit_L_7<-(((lag(dC,-7)-lag(dO,-1))/(lag(dO,-1)))*100)/7
Profit_L_8<-(((lag(dC,-8)-lag(dO,-1))/(lag(dO,-1)))*100)/8
Profit_L_9<-(((lag(dC,-9)-lag(dO,-1))/(lag(dO,-1)))*100)/9
Profit_L_10<-(((lag(dC,-10)-lag(dO,-1))/(lag(dO,-1)))*100)/10
which are given to this frame
frame<-data.frame(Pos_emaO_dO_UP,Pos_emaO_dO_D,Pos_emaC_dC_UP,Pos_emaC_dC_D,Pos_emaC_dO_D,Pos_emaC_dO_UP,Pos_emaO_dC_UP,Pos_emaO_dC_D,Profit_L_1,Profit_L_2,Profit_L_3,Profit_L_4,Profit_L_5,Profit_L_6,Profit_L_7,Profit_L_8,Profit_L_9,Profit_L_10)
colnames(frame)<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D","Profit_L_1","Profit_L_2","Profit_L_3","Profit_L_4","Profit_L_5","Profit_L_6","Profit_L_7","Profit_L_8","Profit_L_9","Profit_L_10")
There is vector with variables for later usage
vector<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D")
I made all possible combination with 4 variables of the vector (there are no depended variables)
comb<-as.data.frame(combn(vector,4))
comb
and get out the ,,nonsense" combination (where are both possible values of variable)
rc<-comb[!sapply(comb, function(x) any(duplicated(sub('_D|_UP', '', x))))]
rc
Then I prepare the first combination to later subseting
var<-paste(rc[,1],collapse=" & ")
var
and subset the frame (with all DVs)
kr<-eval(parse(text=paste0('subset(frame,' , var,')' )))
kr
Now I have the subseted df by the first combination of 4 variables.
Then I used the evaluation function on it
evaluation<-function(x){
s_1<-nrow(x[x$Profit_L_1>0,])/nrow(x)
s_2<-nrow(x[x$Profit_L_2>0,])/nrow(x)
s_3<-nrow(x[x$Profit_L_3>0,])/nrow(x)
s_4<-nrow(x[x$Profit_L_4>0,])/nrow(x)
s_5<-nrow(x[x$Profit_L_5>0,])/nrow(x)
s_6<-nrow(x[x$Profit_L_6>0,])/nrow(x)
s_7<-nrow(x[x$Profit_L_7>0,])/nrow(x)
s_8<-nrow(x[x$Profit_L_8>0,])/nrow(x)
s_9<-nrow(x[x$Profit_L_9>0,])/nrow(x)
s_10<-nrow(x[x$Profit_L_10>0,])/nrow(x)
n_1<-nrow(x[x$Profit_L_1>0,])/nrow(frame)
n_2<-nrow(x[x$Profit_L_2>0,])/nrow(frame)
n_3<-nrow(x[x$Profit_L_3>0,])/nrow(frame)
n_4<-nrow(x[x$Profit_L_4>0,])/nrow(frame)
n_5<-nrow(x[x$Profit_L_5>0,])/nrow(frame)
n_6<-nrow(x[x$Profit_L_6>0,])/nrow(frame)
n_7<-nrow(x[x$Profit_L_7>0,])/nrow(frame)
n_8<-nrow(x[x$Profit_L_8>0,])/nrow(frame)
n_9<-nrow(x[x$Profit_L_9>0,])/nrow(frame)
n_10<-nrow(x[x$Profit_L_10>0,])/nrow(frame)
pr_1<-sum(kr[,"Profit_L_1"])/nrow(kr[,kr=="Profit_L_1"])
pr_2<-sum(kr[,"Profit_L_2"])/nrow(kr[,kr=="Profit_L_2"])
pr_3<-sum(kr[,"Profit_L_3"])/nrow(kr[,kr=="Profit_L_3"])
pr_4<-sum(kr[,"Profit_L_4"])/nrow(kr[,kr=="Profit_L_4"])
pr_5<-sum(kr[,"Profit_L_5"])/nrow(kr[,kr=="Profit_L_5"])
pr_6<-sum(kr[,"Profit_L_6"])/nrow(kr[,kr=="Profit_L_6"])
pr_7<-sum(kr[,"Profit_L_7"])/nrow(kr[,kr=="Profit_L_7"])
pr_8<-sum(kr[,"Profit_L_8"])/nrow(kr[,kr=="Profit_L_8"])
pr_9<-sum(kr[,"Profit_L_9"])/nrow(kr[,kr=="Profit_L_9"])
pr_10<-sum(kr[,"Profit_L_10"])/nrow(kr[,kr=="Profit_L_10"])
mat<-matrix(c(s_1,n_1,pr_1,s_2,n_2,pr_2,s_3,n_3,pr_3,s_4,n_4,pr_4,s_5,n_5,pr_5,s_6,n_6,pr_6,s_7,n_7,pr_7,s_8,n_8,pr_8,s_9,n_9,pr_9,s_10,n_10,pr_10),ncol=3,nrow=10,dimnames=list(c(1:10),c("s","n","pr")))
df<-as.data.frame(mat)
return(df)
}
result<-evaluation(kr)
result
And I need to help in several cases.
1, in evaluation function the way the matrix is made is wrong (s_1,n_1,pr_1 are starting in first column but I need to start the order by rows)
2, I need to use some loop/lapply function to go trough all possible combinations (not only the first one like in this case (var<-paste(rc[,1],collapse=" & ")) and have the understandable output where is evaluation function used on every combination and I will be able to see for which combination of variables is the evaluation done (understand I need to recognize for what is this evaluation made) and compare evaluation results for each combination.
3, This is not main point, BUT I generally want to evaluate all possible combinations (it means for 2:n number of variables and also all combinations in each of them) and then get the best possible combination according to specific DV (Profit_L_1 or Profit_L_2 and so on). And I am so weak in looping now, so, if it this possible, keep in mind what am I going to do with it later.
Thanks, feel free to update, repair or improve the question (if there is something which could be done way more easily, effectively - do it - I am open for every senseful advice.

R: conditional expand.grid function

I would like to find all combinations of vector elements that matches a specific condition. The function expand.grid returns all possible combinations without checking for a specific condition. It is possible to test for a specific condition after using the expand.grid function, but in some situations the number of possible combinations is too large to generate them with expand.grid. Therefore is there a function that allows me to check for a condition while generating all possible combinations.
This is a simplified version of the problem:
A <- seq.int(12, from=0, by=1)*15
B <- seq.int(27, from=0, by=1)*23
C <- seq.int(18, from=0, by=1)*18
D <- seq.int(33, from=0, by=1)*10
out<-expand.grid(A,B,C,D) #out is a dataframe with 235144 x 4 as dimensions
idx<-which(rowSums(out)<=400 & rowSums(out)>=300) #Only a small fraction of 'out' is needed
results <- out(idx,)
In a word, no. After all, if you knew a priori which combinations were desirable/undesirable, you could exclude them from the expansion, e.g. expand.grid(A[A<20],B[B<15],...) . In the general case, which I'm assuming is your real question, you have no simple way to exclude portions of the input vectors.
You might just want to write a multilevel loop which tests each combination in turn and saves or rejects it. This will be slow (again, unless you come up with some clever algorithm to predict regions which are all TRUE or FALSE). So, in the long run, you may be better off using some of the R-packages which partition large calculations (and datasets) so as to avoid exceeding your memory limits.
Now that I've said all that, someone's going to post a link to a package which does exactly that :-(

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

How to perform basic Multiple Sequence Alignments in R?

(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)
The task I'm trying to achieve is to align several sequences.
I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.
My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)
Real_sequence The_sequence_we_see
1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
Here is an example code to reproduce the above example:
ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG)
{
number.of.changes <- sample(seq_len(number.of.changes), 1)
new.letters <- sample(letters.to.change.with , number.of.changes, T)
where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
x[where.to.change.the.letters] <- new.letters
return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-")
insert.missing.values(original.seq)
seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))
seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")
# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
I understand that if all I had was a string and a pattern I would be able to use
library(Biostrings)
pairwiseAlignment(...)
But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).
Is there a known method for doing this in R?
Writing an alignment algorithm in R looks like a bad idea to me, but there is an R interface to the MUSCLE algorithm in the bio3d package (function seqaln()). Be aware of the fact that you have to install this algorithm first.
Alternatively, you can use any of the available algorithms (eg ClustalW, MAFFT, T-COFFEE) and import the multiple sequence alignemts in R using bioconductor functionality. See eg here..
Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.
You can perform multiple alignment in R with the DECIPHER package.
Following your example, it would look something like:
library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)
It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!
You are looking for a global alignment algorithm on multiple sequences.
Did you look at Wikipedia before asking ?
First learn what global alignment is, then look for multiple sequence alignment.
Wikipedia doesn't give a lot of details about algorithms, but this paper is better.

Resources