How to efficiently iterate through a complicated function that outputs a dataframe? - r

I essentially need to iterate through a set of values for parameters A,B,C to generate a table of results that will help me analyze the importance of such parameters. This is for a program in R.
Let's say that:
A goes from rangeA = 1:10
B goes from rangeB = 11:20
C goes from rangeC = 21:30
The simplest (not most efficient) solution that I currently use goes something like this:
### here I create this empty dataframe because I add on each tmp calc later
res <- data.frame()
### here i just create a random dataframe for replicative purposes
dataset <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
ParameterAdjustment() <- function{
for(a in rangeA){
for(b in rangeB){
for(c in rangeC){
### this is a complicated calculation that is much more
### difficult than the replicable example below
tmp <- CalculateSomething(dataset,a,b,c)
### an example calculation
### EDIT NEW EXAMPLE CALCULATION
tmp <- colMeans(dataset+a*b*c)
tmp <- data.frame(data.frame(t(tmp),sd(tmp))
res <- rbind(res,tmp)
}
}
}
return(res)
}
My problem is that this works fine with my original dataset that runs calculations on a 7000x500 dataframe. However, my new datasets are much larger and performance has become a significant issue. Can anyone suggest or help with a more efficient solution? Thank you.

Not sure what language the above is, so not sure how relevant this is but here goes: Are you outputting/sending the data as you go or collecting all the display-results in memory then outputting them all in one go at the end? When I've encountered similar problems with large datasets and this approach has helped me out a few times. For example, sending 10,000s of data-points back to the client for a graph, rather than generating an array of all those points and sending that, I output to screen after each point and then free up the memory. It still takes a while but that's unavoidable. The important bit is that it doesn't crash.

Related

looping over multiPhylo: why does "1:length(trees)" work?

I have a question about for-loops and lists/multiPhylo-objects. I've actually solved the issue at hand, but I don't understand why the solution I applied worked. I'd love to know why it worked so that I can understand things better.
goal: use a for-loop to apply a particular function to every tree in a multiPhylo object, write each tree to file.
The particular set of trees I'm looping over happen to be accessible publicly via GitHub, so you can have a look yourself if you'd like.
library(ape)
#reading in data
fn <- "https://github.com/D-PLACE/dplace-data/blob/master/phylogenies/gray_et_al2009/original/a400-m1pcv-time.trees.gz"
trees <- read.nexus(fn)
The particular function I want to apply is just ape::drop.tip(). I'm doing a prune in two stages, but for the sake of the reprex let's just say I want to drop one tip from each tree - "Sisingga". I would have imagined that code chunk (a) below would work, but it doesn't. Instead (b) works. Why?
tip_to_drop <- "Sisingga"
index <- 0 #starting a count to have something to name files with uniquely
#code chunk a
for(tree in trees){
index <- index +1
tree <- ape::drop.tip(tree, tip_to_drop)
output_fn <- paste0("tree_", index, ".txt") #making unique file name
write.nexus(tree,file = output_fn )
}
index <- 0
#code chunk b
for(tree in 1:length(trees)){
index <- index +1
tree <- trees[[tree]]
tree <- ape::drop.tip(tree, tip_to_drop)
output_fn <- paste0("tree_", index, ".txt")
write.nexus(tree,file = output_fn )
}
I'm happy that I've solved my problem, but I'm left a bit confused. Any light would be welcome.
p.s. the reason I thought chunk (a) would work is because it has in the past in a similar situation, but with another multiPhylo object. In that case it was https://cdstar.shh.mpg.de/bitstreams/EAEA0-D501-DBB8-65C4-0/tree_glottolog_newick.txt.

How to list results for several calculations in R

I have loaded two source files, performed some iterative calculations, and then i need to display/export the results. There are hundreds of iterative calculations, hence hundreds of results. However, only results of the final calculation is displayed.
In this example, i have shortened the list of calculations to only 3. Please refer to line 7 (k in 1:3). How do i get R to display result of all calculations?
Many thanks in advance to those who can offer help. If this question has already been asked before, a link would be great. I could not find this probably because i do not know the right terms to search for.
# Load files
d1<-read.csv('testhourly.csv',sep=",",header=F)
names(d1)<-c("elapsedtime","units")
d2<-read.csv('testevent.csv',sep=",",header=F)
names(d2)<-c("eventno","starttime","endtime","starttemp","endtemp")
# Perform for calculations 1 to 3
for(k in 1:3){
a<-d2[k,2]
b<-d2[k,3]
x<-d1[a:b,]$q
a2<-d2[k,2]-1
b2<-d2[k,3]-1
y<-d1[a2:b2,]$q
z <- (x-y)}
results <- sum(z)
# Export results
write.csv(results, file = "results.csv")
You are not saving your output inside the loop for every iteration, so your loop only returns the final value of the last iteration.
temp=vector("list",3)
for(k in 1:3) {
a<-d2[k,2]
b<-d2[k,3]
x<-d1[a:b,]$q
a2<-d2[k,2]-1
b2<-d2[k,3]-1
y<-d1[a2:b2,]$q
temp[[k]] <- (x-y)
}
results <- sum(unlist(temp))

Object selection in loop

I am currently experiencing perpetual issues with object selection within loops in R. I am fairly convinced that this is a common problem but I cannot seem to find the answer so here I am...
Here's a practical example of a problem I have:
I have a dataframe as source with a series of variables named sequentially (X1,X2,X3,X4, and so on). I am looking to create a function which takes the data as source matches it to another dataset to create a new, combined dataset.
The number of variables will vary. I want to pass my function a parameter which tells it how many variables I have, and the function needs to adjust the number of times it will run the code accordingly. This seems like a task for a for loop, but again there doesn't appear to be an easy way for that selection and recreation of variables within a loop.
Here's the code I need to repeat:
new1$X1 <- data$X1[match(new1$matf1, data$rowID)]
new1$X2 <- data$X2[match(new1$matf1, data$rowID)]
new1$X3 <- data$X3[match(new1$matf1, data$rowID)]
new1$X4 <- data$X4[match(new1$matf1, data$rowID)]
new1$X5 <- data$X5[match(new1$matf1, data$rowID)]
(...)
return(new1)
I've attempted something like this:
for(i in 1:5) {
new1$Xi <- assign(paste0("X", i)), as.vector(paste0("data$X",i)[match(new1$matf1, data$rowID)])
}
without success.
Thank you for your help!
You can try this simple way, however a join would be more efficient:
vals <- paste0('X',1:5)
for(i in vals){
new1[[i]] <- data[[i]][match(new1$matf1, data$rowID)]
}

How Can I Avoid This For Loop? (R)

I currently have a for loop as below and it does not run as fast as I would like it to.
library(dplyr)
DF<-data.frame(Name=c('Bob','Joe','Sally')) #etc
PrimaryResult <- Function1(DF)
ResultsDF<-Function2(PrimaryResult)
for(i in 1:9)
{
Filtered<-filter(DF,Name!=PrimaryResult[i,2])
NextResult <- Function1(Filtered)
ResultsDF<-rbind(ResultsDF,Function2(NextResult))
}
The code takes an initial result of Function1 (which is a list of names) and tries it again with each name in the initial result being excluded individually to provide alternative results. These are returned as a one row data frame via Function2 and appended to the Results data frame.
How can I make this faster?
It seems like your main problem is the appending results from function 2 each iteration with rbind. This is classically slow because you are telling R to rewrite a bunch of information at each time step and R does not really know how large of a vector you are going to end up with.
Try making your results into a list vector. I don't really know what your functions do so I can't really assist with that part.
results_list <- vector("list", 10)
results_list[[1]] <- Function2(PrimaryResult)
for(i in 1:9){
Filtered<-filter(DF,Name!=PrimaryResult[i,2])
NextResult <- Function1(Filtered)
results_list[[i+1]]<-rbind(results_list[[i]],Function2(NextResult))
}
This is not perfect, but it should speed things up a bit.

Faster alternative methods to for-loop in R for pattern matching

I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner
abb <- c()
for(i in 1:length(data$text)){
for(j in 1:length(AbbreviationList$Abb)){
abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
}
}
The abbreviation data frame looks something like the image below and can be generated using the following code
Abbreviation <- c(c("hru", "how are you"),
c("asap", "as soon as possible"),
c("bf", "boyfriend"),
c("ur", "your"),
c("u", "you"),
c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)
names(Abbreviation) <- c("abb","Fullform")
And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.
data <- data.frame(unlist(c("its good to see you, hru doing?",
"I am near bridge come ASAP",
"Can u tell me the method u used for",
"afk so couldn't respond to ur mails",
"asmof I dont know who is your bf?")))
names(data) <- "text"
Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.
Thanks for the help!
This should be faster, and without side effect.
mapply(function(x,y){
abb <- paste0("(\\b", x, "\\b)")
gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
I use mapply to avoid the side effect of for.
First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.
Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )
for( j in 1:length( Abbreviation$abb ) ) {
data$text <- gsub( Abbreviation$regex[j],
Abbreviation$Fullform[j], data$text,
ignore.case= T )
}
The above code works with the example data.

Resources