a small question. I'm trying to get the values for the network indicies (fl_wp_mod, fl_wp_den) stored in seperate variables after running the function. I tried this but I'm able to get it.
Any idea why?
Sorry, I'm new to lapply and R in general.
cluster_modularity = function(graph_object){
fl_wp_ig <- graph_from_incidence_matrix(graph_object)
fl_wp_cw <- cluster_walktrap(fl_wp_ig)
fl_wp_mod <- modularity(fl_wp_cw)
fl_den <- edge_density(fl_wp_ig, loops = FALSE)
return(c(fl_wp_mod, fl_den))
}
Mod = lapply(fl_wp_n, cluster_modularity[1]) #fl_wp_n is the raw data
Den = lapply(fl_wp_n, cluster_modularity[2]) #Both these lines are giving me errors
Using sapply is more convenient here.
ans <- sapply(fl_wp_n, cluster_modularity)
Mod <- ans[1, ]
Den <- ans[2, ]
I need to open files (matrices) from a directory and apply a function pca on each one. It uses another function count_pc which is thought to null diagonals in the matrix step by step and add recalculated PC1 to a the table pcs from the previous function. At the start, I didn't think of environments so count_pca was crashing with the error "unknown variable". Then I tried to do it this way:
files <- list.files()
count_pc <- function(x, env = parent.frame()) {
diag(file[x:nrow(file),]) <- 0
diag(file[,x:nrow(file)]) <- 0
pcn <- prcomp(file, scale = FALSE)
pcn <- data.frame(pcn$rotation)
pcs <- cbind(pcs, pcn$PC1)
}
pca <- function(filename) {
file <- as.matrix(read.table(filename))
pc <- prcomp(file, scale = FALSE)
pc <- data.frame(pc$rotation)
pc1 <- pc$PC1
pcs <- data.frame(pc1)
for (k in 1:40) {
count_pc(k)
}
new_filename <- strsplit(filename, "_")[[1]][3]
print(pcs)
colnames(pcs) <- paste0(0:40, rep("_bins_deleted", 40))
write.table(pcs, file=paste(new_filename, "eigenvectors", sep="_"))
return(apply(pcs, 2, cor, y = pc1))
}
ldply(files, pca)
And indeed, count_pc does not crash with above error but, unfortunately, it crashes with the new one:
"colnames<-`(`*tmp*`, value = c("0_bins_deleted", "1_bins_deleted", :
'names' [41] attribute must be the same length as the vector [1]"
which means that count_pc does not change needed variables. First, I thought the problem might be connected with using sapply(1:40, count_pc) so I replaced it with a cycle. But it didn't help. I've also tried to use environment(count_pc) <- environment() in the pca but it didn't help either (as well as changing variable names in count_pc to env$'name'). I don't know what to do and googling doesn't seem to help.
I am very new to R and I have some problem on performing loop using seq() and list. I have search on the QnA in SO, yet I have to find the same problem as this. I apologize if there is a duplicate QnA on this.
I know the basic on how to generate sequence of number and generate using list, however I am wondering whether we can generate a list of sequence for each loop.
this is an example of my code
J <- seq(50,200,50) # (I actually wanted to use 1: J to generate a sequence of each combinations . i.e: 1:50, 1:100 etc)
K <- seq(10,100,10) #(same as the above)
set.seed(1234)
for (i in J) {
for (j in K){
f <- rnorm(i + 1) # the f value I would like it to be generate in terms of list, since the j have 4 sequence value, if possible, could it adhere to that?
}
}
i try using both sequence and list function, but i keep getting either messages:
if print(i)
output
[1]1
.
.
.
[1]50
Warning message:
In 1:(seq(50, 200, 50)) :
numerical expression has 6 elements: only the first used
for (i in 1:list(seq(50,200,50)))
Error in 1:list(seq(50, 200, 50)) : NA/NaN argument
May I know, whether such loop combinations can be perform? Could you please guide me on this? Thank you very much.
not yet sure of what you are asking but is this what you are looking for? It was difficult to post this as a comment
J <- seq(50,200,50)
l1 <- vector(length = length(J), mode = "list")
for (i in seq_along(J)){ # you know of seq_along() right?
l1[[i]] = rnorm(J[i])
}
for the second question where you want lists(J) of lists(K) of matrices : Please do note hat <<- has never been a good practice, but for now this is what i could come up with!
Note : to understand what is actually happening, go into the debug mode : i.e. after defining func, also pass debug(func) which will then go into step-by-step execution.
l1 <- vector(length = length(J), mode = "list")
l2 <- vector(length = length(K), mode = "list")
func <- function(x){
l1[[x]] <- l2
func1 <- function(y) {
l1[[x]][[y]] <<- matrix(rnorm(J[x]*K[y]),
ncol = J[x],
nrow = K[y])
}
lapply(seq_along(l1[[x]]),func1)
}
lapply(seq_along(l1), func)
I want to make a tree (cluster) using Interactive Tree of Life web-based tool (iTOL). As an input file (or string) this tool uses Newick format which is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. Beside that, additional information might be supported such as bootstrapped values of cluster's nodes.
For example, here I created dataset for a cluster analysis using clusterGeneration package:
library(clusterGeneration)
set.seed(1)
tmp1 <- genRandomClust(numClust=3, sepVal=0.3, numNonNoisy=5,
numNoisy=3, numOutlier=5, numReplicate=2, fileName="chk1")
data <- tmp1$datList[[2]]
Afterwards I performed cluster analysis and assessed support for the cluster's nodes by bootstrap using pvclust package:
set.seed(2)
y <- pvclust(data=data,method.hclust="average",method.dist="correlation",nboot=100)
plot(y)
Here is the cluster and bootstrapped values:
In order to make a Newick file, I used ape package:
library(ape)
yy<-as.phylo(y$hclust)
write.tree(yy,digits=2)
write.tree function will print tree in a Newick format:
((x2:0.45,x6:0.45):0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14):0.0064):0.12):0.22,(x5:0.28,x8:0.28):0.2):0.011);
Those numbers represent branch lengths (cluster's edge lengths). Following instructions from iTOL help page ("Uploading and working with your own trees" section) I manually added bootstrapped values into my Newick file (bolded values below):
((x2:0.45,x6:0.45)74:0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14)55:0.0064)68:0.12)100:0.22,(x5:0.28,x8:0.28)100:0.2)63:0.011);
It works fine when I upload the string into iTOL. However, I have a huge cluster and doing it by hand seems tedious...
QUESTION: What would be a code that can perform it instead of manual typing?
Bootstrap values can be obtained by:
(round(y$edges,2)*100)[,1:2]
Branch lengths used to form Newick file can be obtained by:
yy$edge.length
I tried to figure out how write.tree function works after debugging it. However, I noticed that it internally calls function .write.tree2 and I couldn't understand how to efficiently change the original code and obtain bootstrapped values in appropriate position in a Newick file.
Any suggestion are welcome.
Here is one solution for you: objects of class phylo have an available slot called node.label that, appropriately, gives you the label of a node. You can use it to store your bootstrap values. There will be written in your Newick File at the appropriate place as you can see in the code of .write.tree2:
> .write.tree2
function (phy, digits = 10, tree.prefix = "")
{
brl <- !is.null(phy$edge.length)
nodelab <- !is.null(phy$node.label)
...
if (is.null(phy$root.edge)) {
cp(")")
if (nodelab)
cp(phy$node.label[1])
cp(";")
}
else {
cp(")")
if (nodelab)
cp(phy$node.label[1])
cp(":")
cp(sprintf(f.d, phy$root.edge))
cp(";")
}
...
The real difficulty is to find the proper order of the nodes. I searched and searched but couldn't find a way to find the right order a posteriori.... so that means we will have to get that information during the transformation from an object of class hclust to an object of class phylo.
And luckily, if you look into the function as.phylo.hclust, there is a vector containing the nodes index in their correct order vis-à-vis the previous hclust object:
> as.phylo.hclust
function (x, ...)
{
N <- dim(x$merge)[1]
edge <- matrix(0L, 2 * N, 2)
edge.length <- numeric(2 * N)
node <- integer(N) #<-This one
...
Which means we can make our own as.phylo.hclust with a nodenames parameter as long as it is in the same order as the nodes in the hclust object (which is the case in your example since pvclust keeps a coherent order internally, i. e. the order of the nodes in the hclust is the same as in the table in which you picked the bootstraps):
# NB: in the following function definition I only modified the commented lines
as.phylo.hclust.with.nodenames <- function (x, nodenames, ...) #We add a nodenames argument
{
N <- dim(x$merge)[1]
edge <- matrix(0L, 2 * N, 2)
edge.length <- numeric(2 * N)
node <- integer(N)
node[N] <- N + 2L
cur.nod <- N + 3L
j <- 1L
for (i in N:1) {
edge[j:(j + 1), 1] <- node[i]
for (l in 1:2) {
k <- j + l - 1L
y <- x$merge[i, l]
if (y > 0) {
edge[k, 2] <- node[y] <- cur.nod
cur.nod <- cur.nod + 1L
edge.length[k] <- x$height[i] - x$height[y]
}
else {
edge[k, 2] <- -y
edge.length[k] <- x$height[i]
}
}
j <- j + 2L
}
if (is.null(x$labels))
x$labels <- as.character(1:(N + 1))
node.lab <- nodenames[order(node)] #Here we define our node labels
obj <- list(edge = edge, edge.length = edge.length/2, tip.label = x$labels,
Nnode = N, node.label = node.lab) #And you put them in the final object
class(obj) <- "phylo"
reorder(obj)
}
In the end, here is how you would use this new function in your case study:
bootstraps <- (round(y$edges,2)*100)[,1:2]
yy<-as.phylo.hclust.with.nodenames(y$hclust, nodenames=bootstraps[,2])
write.tree(yy,tree.names=TRUE,digits=2)
[1] "((x5:0.27,x8:0.27)100:0.24,((x7:0.25,(x4:0.14,(x1:0.13,x3:0.13)61:0.014)99:0.11)100:0.23,(x2:0.46,x6:0.46)56:0.022)61:0.027)100;"
#See the bootstraps ^^^ here for instance
plot(yy,show.node.label=TRUE) #To show that the order is correct
plot(y) #To compare with (here I used the yellow value)
I am trying to apply a function I wrote that uses the 'pls' package to make a model and then use it
to predict several test set(in this case 9), returning the R2,RMSEP and prediction bias of each test set
for n number of subset selected from the data frame.
the function is
cpo<-function(data,newdata1,newdata2,newdata3,newdata4,newdata5,newdata6,newdata7,newdata8,newdata9){
data.pls<-plsr(protein~.,8,data=data,validation="LOO")#making a pls model
newdata1.pred<-predict(data.pls,8,newdata=newdata1) #using the model to predict test sets
newdata2.pred<-predict(data.pls,8,newdata=newdata2)
newdata3.pred<-predict(data.pls,8,newdata=newdata3)
newdata4.pred<-predict(data.pls,8,newdata=newdata4)
newdata5.pred<-predict(data.pls,8,newdata=newdata5)
newdata6.pred<-predict(data.pls,8,newdata=newdata6)
newdata7.pred<-predict(data.pls,8,newdata=newdata7)
newdata8.pred<-predict(data.pls,8,newdata=newdata8)
newdata9.pred<-predict(data.pls,8,newdata=newdata9)
pred.bias1<-mean(newdata1.pred-newdata1[742]) #calculating the prediction bias
pred.bias2<-mean(newdata2.pred-newdata2[742])
pred.bias3<-mean(newdata3.pred-newdata3[742]) #[742] reference values in column742
pred.bias4<-mean(newdata4.pred-newdata4[742])
pred.bias5<-mean(newdata5.pred-newdata5[742])
pred.bias6<-mean(newdata6.pred-newdata6[742])
pred.bias7<-mean(newdata7.pred-newdata7[742])
pred.bias8<-mean(newdata8.pred-newdata8[742])
pred.bias9<-mean(newdata9.pred-newdata9[742])
r<-c(R2(data.pls,"train"),RMSEP(data.pls,"train"),pred.bias1,
pred.bias2,pred.bias3,pred.bias4,pred.bias5,pred.bias6,
pred.bias7,pred.bias8,pred.bias9)
return(r)
}
selecting n number of subsets (based on an answer from my question[1]: Select several subsets by taking different row interval and appy function to all subsets
and applying cpo function to each subset I tried
Edited based on #Gavin advice
FO03 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets) # sub is the n number subsets created by selecting rows
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
for(i in sub) { #for every subset in sub i want to apply cpo
outList[[i]] <- cpo(data=sub,newdata1=gag11p,newdata2=gag12p,newdata3=gag13p,
newdata4=gag21p,newdata5=gag22p,newdata6=gag23p,
newdata7=gag31p,newdata8=gag32p,newdata9=gag33p) #new data are test sets loaded in the workspace
}
return(outlist)
}
FOO3(GAGp,10,10)
when I try this I keep getting 'Error in eval(expr, envir, enclos) : object 'protein' not found' not found.
Protein is used in the plsr formula of cpo, and is in the data set.
I then tried to use the plsr function directly as seen below
FOO4 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
cal<-vector("list", length=nSubsets) #for each subset in sub make a pls model for protein
names(cal)<-c(paste("cal",1:nSubsets, sep=""))
for(i in sub) {
cal[[i]] <- plsr(protein~.,8,data=sub,validation="LOO")
}
return(outlist) # return is just used to end script and check if error still occurs
}
FOO4(gagpm,10,10)
When I tried this I get the same error 'Error in eval(expr, envir, enclos) : object 'protein' not found'.
Any advice on how to deal with this and make the function work will be much appreciated.
I suspect the problem is immediately at the start of FOO3():
FOO3 <- function(data, nSubsets, nSkip) {
outList <- vector("list", r <- c(R2(data.pls,"train"), RMSEP(data.pls,"train"),
pred.bias1, pred.bias2, pred.bias3, pred.bias4, pred.bias5,
pred.bias6, pred.bias7, pred.bias8, pred.bias9))
Not sure what you are trying to do when creating outList, but vector() has two arguments and you seem to be assigning to r a vector of numerics that you want R to use as the length argument to vector().
Here you are using the object data.pls and this doesn't exist yet - and never will in the frame of FOO3() - it is only ever created in cpo().
Your second loop looks totally wrong - you are not assigning the output from cpo() to anything. I suspect you wanted:
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
....
for(i in subset) {
outList[[i]] <- cpo(....)
}
return(outList)
But that depends on what subset is etc. You also haven't got the syntax for this loop right. You have
for(i in(subset)) {
when it should be
for(i in subset) {
And subset and data aren't great names as these are common R functions and modelling arguments.
There are lots of problems with your code. Try to start simple and build up from there.
I have managed to achieved what i wanted using this, if there is a better way of doing it (i'm sure there must be) I'm eager to learn.This function preforms the following task
1. select "n" number of subsets from a dataframe
2. For each subset created, a plsr model is made
3. Each plsr model is used to predict 9 test sets
4. For each prediction, the prediction bias is calculated
far5<- function(data, nSubsets, nSkip){
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]} #sub is the subsets created
mop<- lapply(sub,cpr2) #assigning output from cpr to mop
names(mop)<-c(paste("mop", mop, sep=""))
return(names(mop))
}
call: far5(data,nSubsets, nSkip))
The first part -selecting the subsets is based on the answer to my question Select several subsets by taking different row interval and appy function to all subsets
I was then able to apply the function cpr2 to the subsets created using "lapply" instead of the "for' loop as was previously done.
cpr2 is a modification of cpo, for which only data is supplied, and the new data to be predicted is used directly in the function as shown below.
cpr2<-function(data){
data.pls<-plsr(protein~.,8,data=data,validation="LOO") #make plsr model
gag11p.pred<-predict(data.pls,8,newdata=gag11p) #predict each test set
gag12p.pred<-predict(data.pls,8,newdata=gag12p)
gag13p.pred<-predict(data.pls,8,newdata=gag13p)
gag21p.pred<-predict(data.pls,8,newdata=gag21p)
gag22p.pred<-predict(data.pls,8,newdata=gag22p)
gag23p.pred<-predict(data.pls,8,newdata=gag23p)
gag31p.pred<-predict(data.pls,8,newdata=gag31p)
gag32p.pred<-predict(data.pls,8,newdata=gag32p)
gag33p.pred<-predict(data.pls,8,newdata=gag33p)
pred.bias1<-mean(gag11p.pred-gag11p[742]) #calculate prediction bias
pred.bias2<-mean(gag12p.pred-gag12p[742])
pred.bias3<-mean(gag13p.pred-gag13p[742])
pred.bias4<-mean(gag21p.pred-gag21p[742])
pred.bias5<-mean(gag22p.pred-gag22p[742])
pred.bias6<-mean(gag23p.pred-gag23p[742])
pred.bias7<-mean(gag31p.pred-gag31p[742])
pred.bias8<-mean(gag32p.pred-gag32p[742])
pred.bias9<-mean(gag33p.pred-gag33p[742])
r<-signif(c(pred.bias1,pred.bias2,pred.bias3,pred.bias4,pred.bias5,
pred.bias6,pred.bias7,pred.bias8,pred.bias9),2)
out<-c(R2(data.pls,"train",ncomp=8),RMSEP(data.pls,"train",ncomp=8),r)
return(out)
} #signif use to return 2 decimal place for prediction bias
call:cpr2(data)
I was able to use this to solve my problem, however since the amount of new data to be predicted was only nine, it was possible to list them out as i did. If there is a more generalized way to do this I'm interested in learning.