All of the below is conducted in R.
I am trying to store the result of a for-loop result in different containers, but somehow I keep ending up with NA-warnings and the results are not not stored in my container. Even tried different containers for different for-loops within the function and then finally a matrix for the containers, but it seems it's not working.
Already trying different solutions for two full days, and it seems there should be such an easy solution. Maybe I just can't see it myself anymore...
data.ols<-data.frame(cbind(rep(1),holiday,weathersit,atemp,hum,windspeed))
y<-as.vector(cnt)
z=c(holiday, weathersit, atemp, hum, windspeed)
z.names=c("holiday","weathersit","atemp","hum","windspeed")
result.container<-data.frame(matrix(nrow=6,ncol=4))
colnames(result.container)<-c("beta","SE","t-statistic","p-value")
ols<-function(y,X2,x=0){
X<-matrix(z, ncol=5)
X2<-cbind(rep(1, nrow(X)), X)
XXinv <- solve(t(X2) %*% X2, diag(ncol(X2))) # Compute (X'X)^-1
beta<-XXinv%*%t(X2)%*%y
print(beta)
result.container[,1]<-beta
result.testdebug<-vector()
for (i in c("V1","holiday","weathersit","atemp","hum","windspeed")){
SE<-sd(i)
result.testdebug[i]<-sd(data.ols[,i])
return(result.testdebug)
result.container[,2]<-result.testdebug}
result.testtvalue<-vector()
for (i in c("V1","holiday","weathersit","atemp","hum","windspeed")){
nominator<-(mean(i)-x)
t.value <- nominator/sd(i)
return(t.value)
result.testtvalue<-t.value
result.container[,3]<-result.testtvalue}
df <- length(X)-1
p.value <- 2*pt(t.value, df, lower.tail=FALSE)
return(p.value)
result.container[,4]<-p.value
list(rbind(beta,result.testdebug,t.value,p.value))}
It seems you are having some trouble with functions in R. In R, functions have their own environment (i.e. their own set of objects). Even though they can read from their parent environment (the set of all objects), they cannot write on it. Let me demonstrate that with a simpler code.
teste2=matrix(,2,2)
teste=function(a,b) {teste2[,1]=c(a,b)}
teste(3,2)
teste2
[,1] [,2]
[1,] NA NA
[2,] NA NA
As you can see teste (the function) cannot change teste2 (the matrix).
In R ,the best way to make a function is to give it all the objects it needs as parameters and by the end of the function body, to give a single return() function that gives the final object.
You did something close to that, but used multiple return() functions. R only uses the first return() and ignores the rest. See below:
teste=function(a,b) {c=a;return(c);d=b;return(d)}
teste(3,2)
[1] 3
For your particular code, I reccomend excluding all result.container<- and put return() only on the end, around that last (list)
Related
I am trying to set up a function that will run lm() on a model derived from a user defined matrix in R.
The modelMatrix would be set up by the user, but would be expected to be structured as follows:
source target
1 "PVC" "AA"
2 "Aro" "AA"
3 "PVC" "Aro"
This matrix serves to allow the user to define the dependent(target) and independent(source) variables in the lm and refers to the column names of another valuesMatrix:
PVC Ar AA
[1,] -2.677875504 0.76141471 0.006114699
[2,] 0.330537781 -0.18462039 -0.265710261
[3,] 0.609826160 -0.62470233 0.715474554
I need to take this modelMatrix and generate a relevant number of lms. Eg in this case:
lm(AA ~ PVC + Aro)
and
lm(Aro ~ PVC)
I tried this, but it would seem that as the user changes the model matrix, it becomes erroneous and I need to explicitly specify each independent variable according to the modelMatrix.
```lm(as.formula(paste(unique(modelMatrix[,"target"])[1], "~ .",sep="")),
data=data.frame(valuesMatrix))
```
Do I need to set up 2 loops (1 nested) to fetch the source and target strings and paste them into the formula or am I overlooking some detail. Very confused.
Ideally I would like the user to be able to change the modelMatrix to include 1 or many lms and one or many independent variables for each lm. I would really appreciate your assistance as I am really hitting a wall here.
Thanks.
For your specific example this code should work -
source <- c("PVC","Aro","PVC")
target <- c("AA","AA","Aro")
modelMatrix <- data.frame(source = source, target = target)
valuesMatrix <- as.matrix(rbind(c(-2.677875504,0.76141471,0.006114699), c(0.330537781,-0.18462039,-0.265710261),
c(0.609826160,-0.62470233,0.715474554)))
colnames(valuesMatrix) <- c("PVC","Aro","AA")
unique.target <- as.character(unique(modelMatrix$target))
lm.models <- lapply(unique.target, function(x) {lm(as.formula(paste(x,"~ .", sep = "")),
data = data.frame(valuesMatrix[,colnames(valuesMatrix) %in%
c(x,as.character(modelMatrix$source[which(modelMatrix$target==x)]))]))})
You can use the ideas of the for loop but for loops can be to expensive especially if your modelMatrix get really large. It might not look as appealing but the lapply function is optimized for this type of job. The only other trick to this was to keep the columns required to perform the lm.
You can pull the results of each lm also but using:
lm[[1]] and lm[[2]]
I am currently in a statistics class working on multivariate clustering and classification. For our homework we are trying to use a 10 fold cross validation to test how accurate different classification methods are on a 6 variable data set with three classifications. I was hoping I could get some help on creating a for loop (or something else which would be better that I don't know about) to create and run 10 classifications and validations so I don't have to repeat myself 10 times on everything.... Here is what I have. It will run but the first two matrices only show the first variable. Because of this, I have not been able to troubleshoot the other parts.
index<-sample(1:10,90,rep=TRUE)
table(index)
training=NULL
leave=NULL
Trfootball=NULL
football.pred=NULL
for(i in 1:10){
training[i]<-football[index!=i,]
leave[i]<-football[index==i,]
Trfootball[i]<-rpart(V1~., data=training[i], method="class")
football.pred[i]<- predict(Trfootball[i], leave[i], type="class")
table(Actual=leave[i]$"V1", classfied=football.pred[i])}
Removing the "[i]" and replacing them with 1:10 individually works right now....
Your problem lies is the assignment of a data.frame or matrix to a vector that you initially set as NULL (training and leave). A way to think about it is, you are trying to squeeze in a whole matrix into an element that can only take a single number. That's why R has a problem with your code. You need to initialise training and leave to something that can handle your iterative agglomeration of values (the R object list as #akrun points out).
The following example should give you a feel for what is happening and what you can do to fix your problem:
a<-NULL # your set up at the moment
print(a) # NULL as expected
# your football data is either data.frame or matrix
# try assigning those objects to the first element of a:
a[1]<-data.frame(1:10,11:20) # no good
a[1]<-matrix(1:10,nrow=2) # no good either
print(a)
## create "a" upfront, instead of an empty object
# what you need:
a<-vector(mode="list",length=10)
print(a) # empty list with 10 locations
## to assign and extract elements out of a list, use the "[[" double brackets
a[[1]]<-data.frame(1:10,11:20)
#access data.frame in "a"
a[1] ## no good
a[[1]] ## what you need to extract the first element of the list
## how does it look when you add an extra element?
a[[2]]<-matrix(1:10,nrow=2)
print(a)
I would like to perform pathway enrichment analyses.
I have 21 list of significant genes, and mutiple types of pathways I would like to check (ie. check for enrichment in KEGG pathways, GOterms, complexes etc.).
I found this example of code, on an old BioC post. However, I am having trouble adapting it for myself.
Firstly,
1- what does this mean? I don't know this multiple colon syntax.
hyperg <- Category:::.doHyperGInternal
2 - I don't understand how this line works. hyperg.test is a function that needs 3 variables passed to it, correct? Is this line somehow passing "genes.by.pathways, significant.genes, and all.geneIDs to thr hyperg.test?
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
Code that I would like to adapt
library(KEGGREST)
library(org.Hs.eg.db)
# created named list, length 449, eg:
# path:hsa00010: "Glycolysis / Gluconeogenesis"
pathways <- keggList("pathway", "hsa")
# make them into KEGG-style human pathway identifiers
human.pathways <- sub("path:", "", names(pathways))
# for demonstration, just use the first ten pathways
demo.pathway.ids <- head(human.pathways, 10)
demo.pathways <- setNames(keggGet(demo.pathway.ids), demo.pathway.ids)
genes.by.pathway <- lapply(demo.pathways, function(demo.pathway) {
demo.pathway$GENE[c(TRUE, FALSE)]
})
all.geneIDs <- keys(org.Hs.eg.db)
# chose one of these for demonstration. the first (a whole genome random
# set of 100 genes) has very little enrichment, the second, a random set
# from the pathways themselves, has very good enrichment in some pathways
set.seed(123)
significant.genes <- sample(all.geneIDs, size=100)
#significant.genes <- sample(unique(unlist(genes.by.pathway)), size=10)
# the hypergeometric distribution is traditionally explained in terms of
# drawing a sample of balls from an urn containing black and white balls.
# to keep the arguments straight (in my mind at least), I use these terms
# here also
hyperg <- Category:::.doHyperGInternal
hyperg.test <-
function(pathway.genes, significant.genes, all.genes, over=TRUE)
{
white.balls.drawn <- length(intersect(significant.genes, pathway.genes))
white.balls.in.urn <- length(pathway.genes)
total.balls.in.urn <- length(all.genes)
black.balls.in.urn <- total.balls.in.urn - white.balls.in.urn
balls.pulled.from.urn <- length(significant.genes)
hyperg(white.balls.in.urn, black.balls.in.urn,
balls.pulled.from.urn, white.balls.drawn, over)
}
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
print(pVals.by.pathway)
The reason you are getting your error is because it appears you don't have the Category package installed from bioconductor. I suspect this because of the triple colon operator :::. This operator is very similar to the double colon operator ::. Whereas with :: you can access exported objects from a package without loading it, the ::: allows access to non-exported objects (in this case the hyperg function from Category). If you install the Category package the code runs without error.
With regard to the sapply statement:
pVals.by.pathway<-t(sapply(genes.by.pathway, hyperg.test, significant.genes, all.geneIDs))
You can break this down into the separate parts to understand it. Firstly, the sapply is iterating over the elements of gene.by.pathway and passing them to the first argument of hyperg.test. The following arguments are the two addition parameters. It is a little unclear and I personally recommend that people explicitly identify the parameters to avoid unexpected surprises and avoids the need for the exact same order. This is a little repetitive in this case but a good way to avoid a silly bug (e.g. putting significant.genes after all.geneIds)
Rewritten:
pVals.by.pathway <-
t(sapply(genes.by.pathway, hyperg.test, significant.genes=significant.genes, all.genes=all.geneIDs))
Once this loop completes, the sapply function simplifies the output in to a matrix. However, the output is much more user-friendly by taking the transpose t.
Generally speaking, when trying to understand complex apply statements I find it best to break them apart in to smaller parts and see what the objects themselves look like.
a R novice is once again seeking for help.
General situation: I am currently creating a script, I got several data frames per experiment.
The experiments vary in time-steps of measurements and number of reactors, therefore I need
two dimensional flexibility of my script to "massage" data into the right shape for the desired tests, and draw the necessary data from multiple data frames.
Unfortunately I choose to use for loops to account for this, which I see now is bad practice in R,
but I have gotten to fare to change directions now.
The Problem: I try to achieve that one dimensional matrix are named by the objects names, inside a for loop, I need them to be in matrix format because of further functions I want to apply.
# Simple but non- flexible examples of what I want to do:
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
#names header of the objects name
colnames(a1) <- "a1"
colnames(a2) <- "a2"
this works, but I need it to work with in a for loop...
# here are the two flexible but non- working approaches of mine
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
# should name object according to progress in loop
for(i in 1:2)
{
assign(colnames(paste("a",i,sep="",collapse="")),do.call("c",list(paste("a",i,sep=""))))
}
which isn’t the proper use of assign and creates an error.
the second attempt doesn't create an error but doesn't work either, it creates empty objects
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
# should name object according to progress in loop
for(i in 1:2)
{
assign(paste("a",i,sep="", colapse=""),do.call("colnames",list(paste("a",i,sep="", colapse=""))))
}
My conclusion: I do not understand the proper way of combining assign, and colnames,
If anyone got suggestion how I could get it up and running, this would be awesome.
So fare I searched for: R combining assign and colnames inside for loop, R using assign and colnames, R naming data with for loops,...
but unfortunately didn’t manage to extrapolate solutions to my problem.
The following is a function that, by default, will take any objects in the parent environment that have a name that starts with a followed by numbers, check that they are one column matrices, and if they are, name the columns with the name of the object.
a1 <- matrix(c(1:5))
a2 <- matrix(c(1:5))
name_cols()
a1
# a1
# [1,] 1
# [2,] 2
# ...
a2
# a2
# [1,] 1
# [2,] 2
# ...
And here is the code:
name_cols <- function(pattern="^a[0-9]+", env=parent.frame()) {
lapply(
ls(pattern=pattern, envir=env),
function(x) {
var <- get(x, envir=env)
if(is.matrix(var) && identical(ncol(var), 1L)) {
colnames(var) <- x
assign(x, var, env)
} } )
invisible(NULL)
}
Note I chose specification by pattern, but you can easily change this to be by specifying the names of the variable (instead of using ls, just pass the names and lapply over those), or potentially even the objects (though you have to use substitute for that, and the wiseness of this becomes questionable).
More generally, if you have several related objects on which you will be performing related analysis (e.g. in this case modifying columns), you should really consider storing them in lists rather than at the top level. If you do this, then you can easily use the built in *pply functions to operate on all your objects at once. For example:
a.lst <- list(a1=matrix(1:5), a2=matrix(1:5))
a.lst <- lapply(names(a.lst), function(x) {colnames(a.lst[[x]]) <- x; a.lst[[x]]})
I often want to do essentially the following:
mat <- matrix(0,nrow=10,ncol=1)
lapply(1:10, function(i) { mat[i,] <- rnorm(1,mean=i)})
But, I would expect that mat would have 10 random numbers in it, but rather it has 0. (I am not worried about the rnorm part. Clearly there is a right way to do that. I am worry about affecting mat from within an anonymous function of lapply) Can I not affect matrix mat from inside lapply? Why not? Is there a scoping rule of R that is blocking this?
I discussed this issue in this related question: "Is R’s apply family more than syntactic sugar". You will notice that if you look at the function signature for for and apply, they have one critical difference: a for loop evaluates an expression, while an apply loop evaluates a function.
If you want to alter things outside the scope of an apply function, then you need to use <<- or assign. Or more to the point, use something like a for loop instead. But you really need to be careful when working with things outside of a function because it can result in unexpected behavior.
In my opinion, one of the primary reasons to use an apply function is explicitly because it doesn't alter things outside of it. This is a core concept in functional programming, wherein functions avoid having side effects. This is also a reason why the apply family of functions can be used in parallel processing (and similar functions exist in the various parallel packages such as snow).
Lastly, the right way to run your code example is to also pass in the parameters to your function like so, and assigning back the output:
mat <- matrix(0,nrow=10,ncol=1)
mat <- matrix(lapply(1:10, function(i, mat) { mat[i,] <- rnorm(1,mean=i)}, mat=mat))
It is always best to be explicit about a parameter when possible (hence the mat=mat) rather than inferring it.
One of the main advantages of higher-order functions like lapply() or sapply() is that you don't have to initialize your "container" (matrix in this case).
As Fojtasek suggests:
as.matrix(lapply(1:10,function(i) rnorm(1,mean=i)))
Alternatively:
do.call(rbind,lapply(1:10,function(i) rnorm(1,mean=i)))
Or, simply as a numeric vector:
sapply(1:10,function(i) rnorm(1,mean=i))
If you really want to modify a variable above of the scope of your anonymous function (random number generator in this instance), use <<-
> mat <- matrix(0,nrow=10,ncol=1)
> invisible(lapply(1:10, function(i) { mat[i,] <<- rnorm(1,mean=i)}))
> mat
[,1]
[1,] 1.6780866
[2,] 0.8591515
[3,] 2.2693493
[4,] 2.6093988
[5,] 6.6216346
[6,] 5.3469690
[7,] 7.3558518
[8,] 8.3354715
[9,] 9.5993111
[10,] 7.7545249
See this post about <<-. But in this particular example, a for-loop would just make more sense:
mat <- matrix(0,nrow=10,ncol=1)
for( i in 1:10 ) mat[i,] <- rnorm(1,mean=i)
with the minor cost of creating a indexing variable, i, in the global workspace.
Instead of actually altering mat, lapply just returns the altered version of mat (as a list). You just need to assign it to mat and turn it back into a matrix using as.matrix().