looping over multiPhylo: why does "1:length(trees)" work? - r

I have a question about for-loops and lists/multiPhylo-objects. I've actually solved the issue at hand, but I don't understand why the solution I applied worked. I'd love to know why it worked so that I can understand things better.
goal: use a for-loop to apply a particular function to every tree in a multiPhylo object, write each tree to file.
The particular set of trees I'm looping over happen to be accessible publicly via GitHub, so you can have a look yourself if you'd like.
library(ape)
#reading in data
fn <- "https://github.com/D-PLACE/dplace-data/blob/master/phylogenies/gray_et_al2009/original/a400-m1pcv-time.trees.gz"
trees <- read.nexus(fn)
The particular function I want to apply is just ape::drop.tip(). I'm doing a prune in two stages, but for the sake of the reprex let's just say I want to drop one tip from each tree - "Sisingga". I would have imagined that code chunk (a) below would work, but it doesn't. Instead (b) works. Why?
tip_to_drop <- "Sisingga"
index <- 0 #starting a count to have something to name files with uniquely
#code chunk a
for(tree in trees){
index <- index +1
tree <- ape::drop.tip(tree, tip_to_drop)
output_fn <- paste0("tree_", index, ".txt") #making unique file name
write.nexus(tree,file = output_fn )
}
index <- 0
#code chunk b
for(tree in 1:length(trees)){
index <- index +1
tree <- trees[[tree]]
tree <- ape::drop.tip(tree, tip_to_drop)
output_fn <- paste0("tree_", index, ".txt")
write.nexus(tree,file = output_fn )
}
I'm happy that I've solved my problem, but I'm left a bit confused. Any light would be welcome.
p.s. the reason I thought chunk (a) would work is because it has in the past in a similar situation, but with another multiPhylo object. In that case it was https://cdstar.shh.mpg.de/bitstreams/EAEA0-D501-DBB8-65C4-0/tree_glottolog_newick.txt.

Related

How do you combine objects created in rvest looping function after an iteration?

I hope you are having a good day.
I'm trying to scrape Trustpilot-reviews in the sports-section.
I want four columns with number of reviews, trustscore, subcategories and companynames.
There are 43 pages it should iterate over, with 20 companies in each page.
After an iteration the data should be placed underneath the previous data. This can be cleaned up afterwards using filtering though.
The important part, and what I suspect is my problem is getting everything put together at the end.
The code as-is produce the error
"Error in .subset2(x, i, exact = exact) : subscript out of bounds"
If you know anything about this, some pointers on how the code can be corrected would be appreciated.
Here is the code I'm having trouble with:
Trustpilot_company_data <- data.frame()
page_urls = sprintf('https://dk.trustpilot.com/categories/sports?page=%s&status=all', 2:43)
page_urls = c(page_urls, 'https://dk.trustpilot.com/categories/sports?status=all')
for (i in 1:length(page_urls)) {
session <- html_session(page_urls[i])
trustscore_data_html <- html_nodes(session,'.styles_textRating__19_fv')
trustscore_data <- html_text(trustscore_data_html)
trustscore_data <- gsub("anmeldelser","",trustscore_data)
trustscore_data <- gsub("TrustScore","",trustscore_data)
trustscore_data <- as.data.frame(trustscore_data)
trustscore_data <- separate(trustscore_data, col="trustscore_data", sep="·", into=c("antal anmeldelser", "trustscore"))
number_of_reviews<- trustscore_data$`antal anmeldelser`
Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]") %>%
as.numeric(number_of_reviews)
trustscores <- trustscore_data$trustscore
Trustpilot_company_data[[i]]$trustscores <- trimws(trustscores, whitespace = "[\\h\\v]") %>%
as.numeric(trustscores)
subcategories_data_html <- html_nodes(session,'.styles_categories__c4nU-')
subcategories_data <- html_text(subcategories_data_html)
Trustpilot_company_data[[i]]$subcategories_data <- gsub("·",",",subcategories_data)
company_name_data_html <- html_nodes(session,'.styles_businessTitle__1IANo')
Trustpilot_company_data[[i]]$company_name_data <- html_text(company_name_data_html)
Trustpilot_company_data[[i]]$company_name_data <- rep(i,length(Trustpilot_company_data[[i]]$company_name_data))
}
Best regards
Anders
There seem to be several things going on here.
First, as a rule, growing a data frame this way is not good practice.
Second, in this case you seem to be trying to add the new element for each column one at a time, which makes things more awkward for you. And you are trying to access the data frame as if it were a list. So, for example, this isn't going to work:
Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]")
Trustpilot_company_data is a data frame, so it has rows and columns. So to access a particular row and column with [] you say e.g. dat[5,10] for the fifth row and tenth column of dat. Instead you are trying to use [[i]] which is the syntax for accessing the elements of a list. In this case you'd need to write e.g.
Trustpilot_company_data[i, "number_of_reviews"]
to access the thing you're trying to get at.
Third, doing this one column at a time is a bad idea. If you're going to try to grow a data frame, assemble each new mini-data-frame completely first and then add it to the bottom with rbind(). E.g.,
df <- data.frame()
for(i in 1:5) {
new_piece <- data.frame(a = i,
b = i,
c = i)
df <- rbind(df, new_piece)
}
But fourth and most important, don't grow data frames in this way in the first place. Instead, see for example this answer.

Object selection in loop

I am currently experiencing perpetual issues with object selection within loops in R. I am fairly convinced that this is a common problem but I cannot seem to find the answer so here I am...
Here's a practical example of a problem I have:
I have a dataframe as source with a series of variables named sequentially (X1,X2,X3,X4, and so on). I am looking to create a function which takes the data as source matches it to another dataset to create a new, combined dataset.
The number of variables will vary. I want to pass my function a parameter which tells it how many variables I have, and the function needs to adjust the number of times it will run the code accordingly. This seems like a task for a for loop, but again there doesn't appear to be an easy way for that selection and recreation of variables within a loop.
Here's the code I need to repeat:
new1$X1 <- data$X1[match(new1$matf1, data$rowID)]
new1$X2 <- data$X2[match(new1$matf1, data$rowID)]
new1$X3 <- data$X3[match(new1$matf1, data$rowID)]
new1$X4 <- data$X4[match(new1$matf1, data$rowID)]
new1$X5 <- data$X5[match(new1$matf1, data$rowID)]
(...)
return(new1)
I've attempted something like this:
for(i in 1:5) {
new1$Xi <- assign(paste0("X", i)), as.vector(paste0("data$X",i)[match(new1$matf1, data$rowID)])
}
without success.
Thank you for your help!
You can try this simple way, however a join would be more efficient:
vals <- paste0('X',1:5)
for(i in vals){
new1[[i]] <- data[[i]][match(new1$matf1, data$rowID)]
}

Phylogenetics in R: different results when working on a tree compared to reading it in

This question stems directly from a previous one I asked here:
Phylogenetics in R: collapsing descendant tips of an internal node
((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;
I have created this tree from a gene tree in R, using the midpoint function to root it. Now I apply the following function to it:
drop_dupes <- function(tree,thres=1e-05){
tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
toDrop <- tree$edge.length[tips] < thres
newtree <- drop.tip(tree,tree$tip.label[toDrop])
return(newtree)
}
Now my issue is that when I apply this function I do not get the tree I expected. The output when plotted looks like this:
However, if I write the tree out to a text file using write.tree and then read it in again as a Newick string, when I then apply the drop_dupes function above, I get a different tree, as shown in the below plot:
And this is what is confusing me. Why am I getting two different outputs when applying the same function to the very same tree, with the only difference being if it's read in or it's already a variable in R?
Edit: Here are some further details. The initial gene tree is as follows:
(B.weihenstephanensis.KBAB4.ffn:0.00136179,B.weihenstephanensisWSBC10204.ffn:0.00267375,(((B.cereus.NJW.ffn:0.00191517,(B.thuringiensis.HS181.ffn:0.00196859,(B.thuringiensis.Bt407.ffn:0.00000001,B.thuringiensis.chinensisCT43.ffn:0.00000001)0.879000:0.00205168)0.738000:0.00112995)0.969000:0.01796015,(B.cereus.FORC013.ffn:0.00000005,((B.thuringiensis.galleriaeHD29.ffn:0.00000001,(B.thuringiensis.kurstakiYBT1520.ffn:0.00000001,B.thuringiensis.YWC28.ffn:0.00000001)0.000000:0.00000001)0.971000:0.00403036,(B.cereus.ATCC14579.ffn:0.00201426,B.thuringiensis.tolworthi.ffn:0.00000005)0.000000:0.00000005)0.377000:0.00099978)0.969000:0.01717066)1.000000:0.04615485,(((B.cereus.FM1.ffn:0.00093161,B.cereus.FT9.ffn:0.00532243)0.990000:0.01252989,((B.cereus.AH187.ffn:0.00000001,B.cereus.NC7401.ffn:0.00000001)0.694000:0.00123243,(B.thuringiensis.finitimusYBT020.ffn:0.00272478,(B.cereus.ATCC10987.ffn:0.00085725,B.cereus.FRI35.ffn:0.00113572)0.994000:0.01307761)0.973000:0.00847373)0.972000:0.01103656)0.863000:0.00843782,((B.thuringiensis.9727.ffn:0.00202680,(B.cereus.03BB102.ffn:0.00099593,B.cereus.D17.ffn:0.00000001)0.741000:0.00099081)0.822000:0.00297097,(B.cereus.E33L.ffn:0.00200672,(B.cereus.S28.ffn:0.00000001,(B.thuringiensis.HD1011.ffn:0.00000001,(B.anthracis.Vollum1B.ffn:0.00000001,(B.anthracis.Turkey32.ffn:0.00000001,(B.anthracis.RA3.ffn:0.00099682,(B.anthracis.Pasteur.ffn:0.00000001,(B.anthracis.Larissa.ffn:0.00000001,(B.anthracis.Cvac02.ffn:0.00000001,(B.anthracis.CDC684.ffn:0.00000001,(B.anthracis.BFV.ffn:0.00000001,(B.anthracis.Ames.ffn:0.00000001,(B.anthracis.A16R.ffn:0.00000001,(B.anthracis.A0248.ffn:0.00000001,B.anthracis.A1144.ffn:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000005)0.000000:0.00000005)0.956000:0.00200056)0.000000:0.00000006)0.805000:0.00100953)0.809000:0.00210137)0.957000:0.01233888)0.929000:0.01906982)1.000000:0.05586907);
I read it into R as tree1. I then use the following code:
#Function to label nodes and tips as sequential integers
sort.names <- function(tr){
tr$node.label<-(length(tr$tip.label) + 1):(length(tr$tip.label)+ tr$Nnode)
##some of these are tips, some are nodes, need to treat differently
tr$tip.label<-1:(length(tr$tip.label))
return(tr)
}
#Function to check if tree is rooted, and if it is not to use midpoint #rooting
rootCheck <- function(tree){
if(is.rooted(tree) == FALSE){
rootedTree <- midpoint(tree)
}
return(rootedTree)
}
#The above mentioned function to remove duplicate tips
drop_dupes <- function(tree,thres=1e-05){
tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
toDrop <- tree$edge.length[tips] < thres
newtree <- drop.tip(tree,tree$tip.label[toDrop])
return(newtree)
}
#Use functions on tree
a <- rootCheck(tree1)
b <- sort.names(a)
c <- di2multi(b, tol = 1e-05)
d <- drop_dupes(c)
Now at this point if I plot tree d, I will get the first plot above. However, if I write tree c to a text file, then read it back in and and then use the drop_dupes function on it, I will get the latter tree.
I have checked the newick file of tree c against the Newick tree at the top of the page and it is definitely the same.
The problem is in the sort.names function. It effectively rearranges the way the tree is written, and indexing then refers to other nodes than it should.
If you need the tip.labels numbered, why not label them separately?
tax.num <- data.frame(taxa = tree1$tip.label,
numbers = 1:(length(tree1$tip.label)))
a <- midpoint(drop_dupes(tree1))
a$tip.label <- tax.num$number[match(a$tip.label, tax.num$taxa)]
plot(a)
Also, di2multi seems to be redundant. It creates polytomies where the goal, as I understand it, is to discard tips with short branches.

How to efficiently iterate through a complicated function that outputs a dataframe?

I essentially need to iterate through a set of values for parameters A,B,C to generate a table of results that will help me analyze the importance of such parameters. This is for a program in R.
Let's say that:
A goes from rangeA = 1:10
B goes from rangeB = 11:20
C goes from rangeC = 21:30
The simplest (not most efficient) solution that I currently use goes something like this:
### here I create this empty dataframe because I add on each tmp calc later
res <- data.frame()
### here i just create a random dataframe for replicative purposes
dataset <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
ParameterAdjustment() <- function{
for(a in rangeA){
for(b in rangeB){
for(c in rangeC){
### this is a complicated calculation that is much more
### difficult than the replicable example below
tmp <- CalculateSomething(dataset,a,b,c)
### an example calculation
### EDIT NEW EXAMPLE CALCULATION
tmp <- colMeans(dataset+a*b*c)
tmp <- data.frame(data.frame(t(tmp),sd(tmp))
res <- rbind(res,tmp)
}
}
}
return(res)
}
My problem is that this works fine with my original dataset that runs calculations on a 7000x500 dataframe. However, my new datasets are much larger and performance has become a significant issue. Can anyone suggest or help with a more efficient solution? Thank you.
Not sure what language the above is, so not sure how relevant this is but here goes: Are you outputting/sending the data as you go or collecting all the display-results in memory then outputting them all in one go at the end? When I've encountered similar problems with large datasets and this approach has helped me out a few times. For example, sending 10,000s of data-points back to the client for a graph, rather than generating an array of all those points and sending that, I output to screen after each point and then free up the memory. It still takes a while but that's unavoidable. The important bit is that it doesn't crash.

Running the same function multiple times and saving results with different names in workspace

So, I built a function called sort.song.
My goal with this function is to randomly sample the rows of a data.frame (DATA) and then filter it out (DATA.NEW) to analyse it. I want to do it multiple times (let's say 10 times). By the end, I want that each object (mantel.something) resulted from this function to be saved in my workspace with a name that I can relate to each cycle (mantel.something1, mantel.somenthing2...mantel.something10).
I have the following code, so far:
sort.song<-function(DATA){
require(ade4)
for(i in 1:10){ # Am I using for correctly here?
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantel.numnotes[i]<<-mantel.rtest(coord.dist,num.notes.dist,nrepet=1000)
mantel.songdur[i]<<-mantel.rtest(coord.dist,songdur.dist,nrepet=1000)
mantel.hfreq[i]<<-mantel.rtest(coord.dist,hfreq.dist,nrepet=1000)
mantel.lfreq[i]<<-mantel.rtest(coord.dist,lfreq.dist,nrepet=1000)
mantel.bwidth[i]<<-mantel.rtest(coord.dist,bwidth.dist,nrepet=1000)
mantel.hfreqlnote[i]<<-mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
}
}
Could someone please help me to do it the right way?
I think I'm not assigning the cycles correctly for each mantel.somenthing object.
Many thanks in advance!
The best way to implement what you are trying to do is through a list. You can even make it take two indices, the first for the iterations, the second for the type of analysis.
mantellist <- as.list(1:10) ## initiate list with some values
for (i in 1:10){
...
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
...)
}
return(mantellist)
In this way you can index your specific analysis for each iteration in an intuitive way:
mantellist[[2]][['hfreq']]
mantellist[[2]]$hfreq ## alternative
EDIT by Mohr:
Just for clarification...
So, according to your suggestion the code should be something like this:
sort.song<-function(DATA){
require(ade4)
mantellist <- as.list(1:10)
for(i in 1:10){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq=mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth=mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote=mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
)
}
return(mantellist)
}
You can achieve your objective of repeating this exercise 10 (or more times) without using an explicit for-loop. Rather than have the function run the loop, write the sort.song function to run one iteration of the process, then you can use replicate to repeat that process however many times you desire.
It is generally good practice not to create a bunch of named objects in your global environment. Instead, you can hold of the results of each iteration of this process in a single object. replicate will return an array (if possible) otherwise a list (in the example below, a list of lists). So, the list will have 10 elements (one for each iteration) and each element will itself be a list containing named elements corresponding to each result of mantel.rtest.
sort.song<-function(DATA){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist <- dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist <- dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist <- dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist <- dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist <- dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist <- dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist <- dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
return(list(
numnotes = mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur = mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq = mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq = mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth = mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote = mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
))
}
require(ade4)
replicate(10, sort.song(DATA))

Resources