I am currently in a statistics class working on multivariate clustering and classification. For our homework we are trying to use a 10 fold cross validation to test how accurate different classification methods are on a 6 variable data set with three classifications. I was hoping I could get some help on creating a for loop (or something else which would be better that I don't know about) to create and run 10 classifications and validations so I don't have to repeat myself 10 times on everything.... Here is what I have. It will run but the first two matrices only show the first variable. Because of this, I have not been able to troubleshoot the other parts.
index<-sample(1:10,90,rep=TRUE)
table(index)
training=NULL
leave=NULL
Trfootball=NULL
football.pred=NULL
for(i in 1:10){
training[i]<-football[index!=i,]
leave[i]<-football[index==i,]
Trfootball[i]<-rpart(V1~., data=training[i], method="class")
football.pred[i]<- predict(Trfootball[i], leave[i], type="class")
table(Actual=leave[i]$"V1", classfied=football.pred[i])}
Removing the "[i]" and replacing them with 1:10 individually works right now....
Your problem lies is the assignment of a data.frame or matrix to a vector that you initially set as NULL (training and leave). A way to think about it is, you are trying to squeeze in a whole matrix into an element that can only take a single number. That's why R has a problem with your code. You need to initialise training and leave to something that can handle your iterative agglomeration of values (the R object list as #akrun points out).
The following example should give you a feel for what is happening and what you can do to fix your problem:
a<-NULL # your set up at the moment
print(a) # NULL as expected
# your football data is either data.frame or matrix
# try assigning those objects to the first element of a:
a[1]<-data.frame(1:10,11:20) # no good
a[1]<-matrix(1:10,nrow=2) # no good either
print(a)
## create "a" upfront, instead of an empty object
# what you need:
a<-vector(mode="list",length=10)
print(a) # empty list with 10 locations
## to assign and extract elements out of a list, use the "[[" double brackets
a[[1]]<-data.frame(1:10,11:20)
#access data.frame in "a"
a[1] ## no good
a[[1]] ## what you need to extract the first element of the list
## how does it look when you add an extra element?
a[[2]]<-matrix(1:10,nrow=2)
print(a)
Related
I would like to perform pathway enrichment analyses.
I have 21 list of significant genes, and mutiple types of pathways I would like to check (ie. check for enrichment in KEGG pathways, GOterms, complexes etc.).
I found this example of code, on an old BioC post. However, I am having trouble adapting it for myself.
Firstly,
1- what does this mean? I don't know this multiple colon syntax.
hyperg <- Category:::.doHyperGInternal
2 - I don't understand how this line works. hyperg.test is a function that needs 3 variables passed to it, correct? Is this line somehow passing "genes.by.pathways, significant.genes, and all.geneIDs to thr hyperg.test?
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
Code that I would like to adapt
library(KEGGREST)
library(org.Hs.eg.db)
# created named list, length 449, eg:
# path:hsa00010: "Glycolysis / Gluconeogenesis"
pathways <- keggList("pathway", "hsa")
# make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))
# for demonstration, just use the first ten pathways
demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)
genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
demo.pathway$GENE[c(TRUE, FALSE)]
})
all.geneIDs <- keys(org.Hs.eg.db)
# chose one of these for demonstration. the first (a whole genome random
# set of 100 genes) has very little enrichment, the second, a random set
# from the pathways themselves, has very good enrichment in some pathways
set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)
# the hypergeometric distribution is traditionally explained in terms of
# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also
hyperg <- Category:::.doHyperGInternal
hyperg.test <-
function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
white.balls.in.urn <- length(pathway.genes)
total.balls.in.urn <- length(all.genes)
black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
balls.pulled.from.urn <- length(significant.genes)
hyperg(white.balls.in.urn, black.balls.in.urn,
balls.pulled.from.urn, white.balls.drawn, over)
}
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
print(pVals.by.pathway)
The reason you are getting your error is because it appears you don't have the Category package installed from bioconductor. I suspect this because of the triple colon operator :::. This operator is very similar to the double colon operator ::. Whereas with :: you can access exported objects from a package without loading it, the ::: allows access to non-exported objects (in this case the hyperg function from Category). If you install the Category package the code runs without error.
With regard to the sapply statement:
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
You can break this down into the separate parts to understand it. Firstly, the sapply is iterating over the elements of gene.by.pathway and passing them to the first argument of hyperg.test. The following arguments are the two addition parameters. It is a little unclear and I personally recommend that people explicitly identify the parameters to avoid unexpected surprises and avoids the need for the exact same order. This is a little repetitive in this case but a good way to avoid a silly bug (e.g. putting significant.genes after all.geneIds)
Rewritten:
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes=significant.genes, all.genes=all.geneIDs))
Once this loop completes, the sapply function simplifies the output in to a matrix. However, the output is much more user-friendly by taking the transpose t.
Generally speaking, when trying to understand complex apply statements I find it best to break them apart in to smaller parts and see what the objects themselves look like.
I have a dataset, which is broken into 20 groups. The matrices storing the data for each group (2 columns of data), are stored in a list, so that I can perform functions on each set within a loop. I would like to store the output of any function that I might run in another matrix.
For example, if I run a fitdistr() on all 20 groups, I would like the output of the function stored in a matrix, so that I can call distribution[1] to call the results from group 1. I have tried the following:
distribution<-ls()
for(i in (1:20))
{ distribution[[i]]<-fitdistr(as.numeric(data[[i]]$Column2,"normal") }
This sucessfully stores the outputs, and I can call:
distribution[1]
The issue is that the fitdistr() results in 2 columns of data - a mean and a standard deviation. I checked that I cannot call the mean for a given point:
names(distribtuion)
"NULL"
So I obviously cannot call get the means, say by:
distribution[1]$mean
I will be looking for trends in the means and standard deviations (and other parameters for other distributions), so I would like to have the results of fitdistr() stored in a matrix somehow if at all possible. Even if I could somehow call only say, the mean, when running the function, then I can just create an empty vector and populated it in a loop, then repeat for the standard deviation.
I have considered creating an empty matrix large enough to store the data (so it would be 20 rows, 1 for each group, and 2 columns, 1 for each calculated value). I'm still not sure how I would dictate that I want the calculated mean stored in column 1 and the calculated standard deviation stored in column 2. Again, it is an issue of asking the function for only one of its multiple outputs at a time.
I've also looked into one of the apply functions, but these do not seem to be appropriate for what I am doing.
ls() is a function that lists the objects in a given environment. It returns a character vector.
You (probably) mean to have list().
But then you would be growing your list within a loop. Which is the second circle of R hell.
Instead use lapply with the appropriate function (hard to tell where you want the as.numeric to go, but it is not correct in your example).
something like..
distribution <- lapply(data, function(x) fitdistr(as.numeric(x[['Column2']]),"normal"))
I currently have a loop - well actually a loop in loop, in a simulation model which gets slow with larger numbers of individuals. I've vectorised most of it and made it a heck of a lot faster. But there's a part where I assign multiple elements of a list as the same thing, simplifying a big loop to just the task I want to achieve:
new.matrices[[length(new.matrices)+1]]<-old.matrix
With each iteration of the loop the line above is called, and the same matrix object is assigned to the next new element of a list.
I'm trying to vectorize this - if possible, or make it faster than a loop or apply statement.
So far I've tried stuff along the lines of:
indices <- seq(from = length(new.matrices) + 1, to = length(new.matrices) + reps)
new.matrices[indices] <- old.matrix
However this results in the message:
Warning message:
In new.effectors[effectorlength] <- matrix :
number of items to replace is not a multiple of replacement length
It also tries to assign one value of the old.matrix to one element of new.matrices like so:
[[1]]
[1] 8687
[[2]]
[1] 1
[[3]]
[1] 5486
[[4]]
[1] 0
When the desired result is one list element = one whole matrix, a copy of old.matrix
Is there a way I can vectorize sticking a matrix in list elements without looping? With loops how it is currently implemented we are talking many thousands of repetitions which slows things down considerably, hence my desire to vectorize this if possible.
Probably you already solved your problem, anyway, the issue in your code
new.matrices[indices] <- old.matrix
was caused by trying to replace some objects (the NULL elements in your new.matrices list) with something different, a matrix. So R coerces old.matrix into a vector and tries to stick each single value to a different list element, (that's why you got this result, and when, say, reps is 4 or 8 and old.matrix is NOT a 2 x 2 matrix, you also get the warning). Doing
new.matrices[indices] <- list(old.matrix)
will work, and R will replicate the single element list list(old.matrix) "reps" times automatically.
I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?
I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)
Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.
I'm trying to conduct certain statistics such as t-tests on a table of data containing hundreds to thousands of columns. The data is formatted in a way that the two groups of values I'm comparing are in the same column.
So, basically my first attempt was to cut and paste like the following;
NN <-read.delim("E:/output.txt")
View(NN)
attach(NN)
#output p-values of 100 t-tests
sink(file="E:/ttest.txt", append=TRUE, split=FALSE)
t.test(Tree1[1:13],Tree1[14:34])$p.value
t.test(Tree2[1:13],Tree2[14:34])$p.value
t.test(Tree3[1:13],Tree3[14:34])$p.value
....
...
..
.
As my data grows, this is becoming more and more impractical. Is there a way to loop these t-tests through each column sequentially and save the ouput to file?
Thanks in advance.
lapply will get you there I think with an anonymous function:
> test <- data.frame(a=1:100,b=101:200)
> lapply(test,function(x) t.test(x[1:50],x[51:100])$p.value)
$a
[1] 2.876776e-31
$b
[1] 2.876776e-31
I should do my part for good practice and also note that running 100 t-tests in a single go is fraught with the potential for type-1 errors and other badness.
Extracting the p-value in isolation is also probably a really bad move.
Not sure if this is a wise approach or if it even works correctly but try mapply with the indexed parts as in:
test <- data.frame(a=1:100,b=101:200)
testa <- test[1:50, ]
testb <- test[51:100, ]
t.test2 <- function(x, y) t.test(x, y)[["p.value"]]
mapply(t.test2, testa, testb)
EDIT: I used thelatemail's data so it's comparable. His warning is right on.
Thanks for all the input. Just a few clarifications; while I AM running hundreds of t-tests at once, they are comparing independent sets of data each time. So for example, the values in column 1 (Tree1), rows 1:50 would only be compared once to rows 51:100 in the same column, and never used again. The same for column 2 (Tree2), and so on. Would type-1 error still be a problem? the way I see it I'm basically doing t-tests on separate data sets one at a time.
That being said, I've come up with a way to do this with a for-loop, and the results correspond to those when t-testing each column individually.
for (i in 1:100)
print (t.test(mydata[1:50, i],mydata[51:100, i])$p.value)
end;
The only problem being that my output always has a [1] in front of it.