Related
I am trying to calculate network indexes (clustering, modularity, edge density, degree, centrality etc) from 1000 simulated null matrices using the igraph package in R. The data I'm using is a mixed-species bird flock data that I've used to generate the null matrices.
Here's the code:
## Construct null matrices ##
library(EcoSimR)
library(igraph)
# creating a 1000 empty matrices
fl_emp <- lapply(1:1000, function(i) data.frame())
# simulating 1000 matrices by randomization
fl_wp_n <- replicate(1000, sim5(fl_wp[,3:ncol(fl_wp)]),simplify = FALSE) #fl_wp is the raw data
#sim5 function is from the package 'EcoSimR'
for(i in 1:length(fl_emp))
{
fl_wp_ig <- graph_from_incidence_matrix(fl_wp_n[[i]]) #Creating new igraph object to convert the null matrices to igraph objects to calculate network indexes
fl_wp_cw <- cluster_walktrap(fl_wp_ig[[i]])
fl_wp_mod <- modularity(fl_wp_cw[[i]]) ##Network index, this does not work
}
Here's what the simulated matrices look like(fl_wp_n) :
[1]: https://i.stack.imgur.com/1Q0Na.png
It is basically a list of 1000 elements, where each element is a simulated 133x74 matrix where the rows represent flock ID and the columns represent Species ID.
This is the error I'm getting when I run the loop:
> for(i in 1:length(fl_emp))
+ {
+ fl_wp_ig <- graph_from_incidence_matrix(fl_wp_n[[i]])
+ fl_wp_cw <- cluster_walktrap(fl_wp_ig[[i]])
+ fl_wp_mod <- modularity(fl_wp_cw[[i]])
+ }
Error in cluster_walktrap(fl_wp_ig[[i]]) : Not a graph object!
It seems to be not recognizing fl_wp_ig as an igraph object. Any idea why?
Is there a better way to do calculate indices for a 1000 matrices in one loop?
Sorry if this is a dumb question, I'm new to igraph and R in general
Thanks a lot in advance!
If you have a look at the documentation for 1. cluster_walktrap, you will see the function expects a graph object. As #Szabolcs pointed out, when you are index fl_wp_ig[[i]] in the for-loop, you are returning the vertices adjacent to vertex [[i]], but not the graph itself. You only should iterate over fl_wp_n[[i]] because you want to use every time a 'matrix' but not the other variables.
So you could try:
list_outputs = list()
for(i in 1:length(fl_emp))
{
# fl_wp_n[[i]] gets 1 matrix each iteration. Output -> graph object
fl_wp_ig <- graph_from_incidence_matrix(fl_wp_n[[i]])
# Use the whole graph object fl_wp_ig
fl_wp_cw <- cluster_walktrap(fl_wp_ig)
# Use the whole fl_wp_cw output
fl_wp_mod <- modularity(fl_wp_cw)
# NOTE that you are not storing the result of each iteration in a variable to keep it,
# you are overwritting fl_wp_mod
# You could have create a empty list before the for-loop and then fill it
list_outputs = append(list_outputs, fl_wp_mod)
}
Also, if you find it difficult to see the whole picture, you could try to create a custom function and use apply methods instead of a for-loop.
# Custom function
cluster_modularity = function(graph_object){
# takes only one graph_object at time
fl_wp_ig <- graph_from_incidence_matrix(graph_object)
fl_wp_cw <- cluster_walktrap(fl_wp_ig)
fl_wp_mod <- modularity(fl_wp_cw)
}
# Iterate using lapply to store the outputs in a list - for example
list_outputs = lapply(fl_wp_n, cluster_modularity)
This question stems directly from a previous one I asked here:
Phylogenetics in R: collapsing descendant tips of an internal node
((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;
I have created this tree from a gene tree in R, using the midpoint function to root it. Now I apply the following function to it:
drop_dupes <- function(tree,thres=1e-05){
tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
toDrop <- tree$edge.length[tips] < thres
newtree <- drop.tip(tree,tree$tip.label[toDrop])
return(newtree)
}
Now my issue is that when I apply this function I do not get the tree I expected. The output when plotted looks like this:
However, if I write the tree out to a text file using write.tree and then read it in again as a Newick string, when I then apply the drop_dupes function above, I get a different tree, as shown in the below plot:
And this is what is confusing me. Why am I getting two different outputs when applying the same function to the very same tree, with the only difference being if it's read in or it's already a variable in R?
Edit: Here are some further details. The initial gene tree is as follows:
(B.weihenstephanensis.KBAB4.ffn:0.00136179,B.weihenstephanensisWSBC10204.ffn:0.00267375,(((B.cereus.NJW.ffn:0.00191517,(B.thuringiensis.HS181.ffn:0.00196859,(B.thuringiensis.Bt407.ffn:0.00000001,B.thuringiensis.chinensisCT43.ffn:0.00000001)0.879000:0.00205168)0.738000:0.00112995)0.969000:0.01796015,(B.cereus.FORC013.ffn:0.00000005,((B.thuringiensis.galleriaeHD29.ffn:0.00000001,(B.thuringiensis.kurstakiYBT1520.ffn:0.00000001,B.thuringiensis.YWC28.ffn:0.00000001)0.000000:0.00000001)0.971000:0.00403036,(B.cereus.ATCC14579.ffn:0.00201426,B.thuringiensis.tolworthi.ffn:0.00000005)0.000000:0.00000005)0.377000:0.00099978)0.969000:0.01717066)1.000000:0.04615485,(((B.cereus.FM1.ffn:0.00093161,B.cereus.FT9.ffn:0.00532243)0.990000:0.01252989,((B.cereus.AH187.ffn:0.00000001,B.cereus.NC7401.ffn:0.00000001)0.694000:0.00123243,(B.thuringiensis.finitimusYBT020.ffn:0.00272478,(B.cereus.ATCC10987.ffn:0.00085725,B.cereus.FRI35.ffn:0.00113572)0.994000:0.01307761)0.973000:0.00847373)0.972000:0.01103656)0.863000:0.00843782,((B.thuringiensis.9727.ffn:0.00202680,(B.cereus.03BB102.ffn:0.00099593,B.cereus.D17.ffn:0.00000001)0.741000:0.00099081)0.822000:0.00297097,(B.cereus.E33L.ffn:0.00200672,(B.cereus.S28.ffn:0.00000001,(B.thuringiensis.HD1011.ffn:0.00000001,(B.anthracis.Vollum1B.ffn:0.00000001,(B.anthracis.Turkey32.ffn:0.00000001,(B.anthracis.RA3.ffn:0.00099682,(B.anthracis.Pasteur.ffn:0.00000001,(B.anthracis.Larissa.ffn:0.00000001,(B.anthracis.Cvac02.ffn:0.00000001,(B.anthracis.CDC684.ffn:0.00000001,(B.anthracis.BFV.ffn:0.00000001,(B.anthracis.Ames.ffn:0.00000001,(B.anthracis.A16R.ffn:0.00000001,(B.anthracis.A0248.ffn:0.00000001,B.anthracis.A1144.ffn:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000005)0.000000:0.00000005)0.956000:0.00200056)0.000000:0.00000006)0.805000:0.00100953)0.809000:0.00210137)0.957000:0.01233888)0.929000:0.01906982)1.000000:0.05586907);
I read it into R as tree1. I then use the following code:
#Function to label nodes and tips as sequential integers
sort.names <- function(tr){
tr$node.label<-(length(tr$tip.label) + 1):(length(tr$tip.label)+ tr$Nnode)
##some of these are tips, some are nodes, need to treat differently
tr$tip.label<-1:(length(tr$tip.label))
return(tr)
}
#Function to check if tree is rooted, and if it is not to use midpoint #rooting
rootCheck <- function(tree){
if(is.rooted(tree) == FALSE){
rootedTree <- midpoint(tree)
}
return(rootedTree)
}
#The above mentioned function to remove duplicate tips
drop_dupes <- function(tree,thres=1e-05){
tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
toDrop <- tree$edge.length[tips] < thres
newtree <- drop.tip(tree,tree$tip.label[toDrop])
return(newtree)
}
#Use functions on tree
a <- rootCheck(tree1)
b <- sort.names(a)
c <- di2multi(b, tol = 1e-05)
d <- drop_dupes(c)
Now at this point if I plot tree d, I will get the first plot above. However, if I write tree c to a text file, then read it back in and and then use the drop_dupes function on it, I will get the latter tree.
I have checked the newick file of tree c against the Newick tree at the top of the page and it is definitely the same.
The problem is in the sort.names function. It effectively rearranges the way the tree is written, and indexing then refers to other nodes than it should.
If you need the tip.labels numbered, why not label them separately?
tax.num <- data.frame(taxa = tree1$tip.label,
numbers = 1:(length(tree1$tip.label)))
a <- midpoint(drop_dupes(tree1))
a$tip.label <- tax.num$number[match(a$tip.label, tax.num$taxa)]
plot(a)
Also, di2multi seems to be redundant. It creates polytomies where the goal, as I understand it, is to discard tips with short branches.
I'm working with a set of results of INLA package in R. These results are stored in objects with meaningful names so I can have, for instance, model_a, model_b... in current environment. For each of these models I'd like to do several processing tasks including extracting of the data to separate data frame, which can then be used to merge to spatial data to create map, etc.
Turning to simpler, reproducible example let's assume two results
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
model_a <- lm(weight ~ group)
model_b <- lm(weight ~ group - 1)
I can handle the steps for an individual model, for instance:
model_a_sum <- data.frame(var = character(1), model_a_value = numeric(1))
model_a_sum$var <- "Intercept"
model_a_sum$model_a_value <- model_a$coefficients[1]
png("model_a_plot.png")
plot(model_a, las = 1)
dev.off()
Now, I'd like to reuse this code for each of the models, essentially constructing correct names depending on the model I'm using. I'm more Stata than R person and inside Stata that would be a trivial task to use the stub of a name (model_a, or even a only..) and construct foreach loop that would implement all the steps, adapting names for each of the models.
In R, for loops have been bashed all over the internet so I presume I shouldn't attempt to venture into the territory of:
models <- c("model_a", "model_b", "model_c")
for (model in models) {
...
}
What would be the better solution for such scenario?
Update 1: Since comments suggested that for might indeed be an option I'm trying to put all the tasks into a loop. So far I manged to name the data frame correctly using assign and get correct data plotted under correct name using get:
models <- c("model_a", "model_b")
for (i in 1:length(models)) {
# create df
name.df <- paste0(models[i], "_sum")
assign(name.df, data.frame(var = character(1), value = numeric(1)))
# replace variables of df with results from the model
# plot and save
name.plot <- paste0(models[i], "_plot.png")
png(name.plot)
plot(get(models[i]), which = 1, las = 1)
dev.off()
}
Is this reasonable approach? Any better solutions?
One thing I cannot solve is having the second variable of the df named according to the model (ie. model_a_value instead of current value. Any ideas how to solve that?
Some general tips/advice:
As mentioned in comments, don't believe much of the negativity about for loops in R. The issue is not that they are bad, but more that they are correlated with some bad code patterns that are inefficient.
More important is to use the right data organization. Don't keep the models each in a separate object!. Put them in a list:
l <- vector("list",3)
l[[1]] <- lm(...)
l[[2]] <- lm(...)
l[[3]] <- lm(...)
Then name the list:
names(l) <- paste0("model_",letters[1:3])
Now you can loop over the list without resorting to awkward and unnecessary tools like assign and get, and more importantly when you're ready to step up from for loops to tools like lapply you're all good to go.
I would use similar strategies for your data frames as well.
See #joran answer, this one is to show use of assign and get but should be avoided when possible.
I would go this way for the for loop:
for (model in models) {
m <- get(model) # to get the real model object
# create the model_?_sum dataframe
assign(paste0(model,"_sum"), data.frame(var = "Intercept", value = m$coefficients[1]))
assign(paste0(model,"_sum"), setNames( get(paste0(model,"_sum")), c("var",paste0(model,"_value"))) ) # per comment to rename the value column thanks to #Franck in chat for the guidance
# paste0 to create the text
png(paste0(model,"_plot.png"))
plot(m, las = 1) # use the m object to graph
dev.off()
}
which give the two images and this:
> model_a_sum
var value
(Intercept) Integer 5.032
> model_b_sum
var value
groupCtl Integer 5.032
>
I'm unsure of why you wish this dataframe, but I hope this give clues on how to makes variables names and how to access them.
So, I built a function called sort.song.
My goal with this function is to randomly sample the rows of a data.frame (DATA) and then filter it out (DATA.NEW) to analyse it. I want to do it multiple times (let's say 10 times). By the end, I want that each object (mantel.something) resulted from this function to be saved in my workspace with a name that I can relate to each cycle (mantel.something1, mantel.somenthing2...mantel.something10).
I have the following code, so far:
sort.song<-function(DATA){
require(ade4)
for(i in 1:10){ # Am I using for correctly here?
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantel.numnotes[i]<<-mantel.rtest(coord.dist,num.notes.dist,nrepet=1000)
mantel.songdur[i]<<-mantel.rtest(coord.dist,songdur.dist,nrepet=1000)
mantel.hfreq[i]<<-mantel.rtest(coord.dist,hfreq.dist,nrepet=1000)
mantel.lfreq[i]<<-mantel.rtest(coord.dist,lfreq.dist,nrepet=1000)
mantel.bwidth[i]<<-mantel.rtest(coord.dist,bwidth.dist,nrepet=1000)
mantel.hfreqlnote[i]<<-mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
}
}
Could someone please help me to do it the right way?
I think I'm not assigning the cycles correctly for each mantel.somenthing object.
Many thanks in advance!
The best way to implement what you are trying to do is through a list. You can even make it take two indices, the first for the iterations, the second for the type of analysis.
mantellist <- as.list(1:10) ## initiate list with some values
for (i in 1:10){
...
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
...)
}
return(mantellist)
In this way you can index your specific analysis for each iteration in an intuitive way:
mantellist[[2]][['hfreq']]
mantellist[[2]]$hfreq ## alternative
EDIT by Mohr:
Just for clarification...
So, according to your suggestion the code should be something like this:
sort.song<-function(DATA){
require(ade4)
mantellist <- as.list(1:10)
for(i in 1:10){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq=mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth=mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote=mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
)
}
return(mantellist)
}
You can achieve your objective of repeating this exercise 10 (or more times) without using an explicit for-loop. Rather than have the function run the loop, write the sort.song function to run one iteration of the process, then you can use replicate to repeat that process however many times you desire.
It is generally good practice not to create a bunch of named objects in your global environment. Instead, you can hold of the results of each iteration of this process in a single object. replicate will return an array (if possible) otherwise a list (in the example below, a list of lists). So, the list will have 10 elements (one for each iteration) and each element will itself be a list containing named elements corresponding to each result of mantel.rtest.
sort.song<-function(DATA){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist <- dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist <- dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist <- dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist <- dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist <- dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist <- dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist <- dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
return(list(
numnotes = mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur = mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq = mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq = mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth = mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote = mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
))
}
require(ade4)
replicate(10, sort.song(DATA))
One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)