I am having trouble optimising a piece of R code. The following example code should illustrate my optimisation problem:
Some initialisations and a function definition:
a <- c(10,20,30,40,50,60,70,80)
b <- c(“a”,”b”,”c”,”d”,”z”,”g”,”h”,”r”)
c <- c(1,2,3,4,5,6,7,8)
myframe <- data.frame(a,b,c)
values <- vector(length=columns)
solution <- matrix(nrow=nrow(myframe),ncol=columns+3)
myfunction <- function(frame,columns){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
The problematic for-loop looks like this:
columns = 6
for(i in 1:nrow(myframe){
values <- myfunction(as.matrix(myframe[i,]), columns)
values[columns+2] = i
values[columns+3] = myframe[i,3]
#more columns added with simple operations (i.e. sum)
solution <- rbind(solution,values)
#solution is a large matrix from outside the for-loop
}
The problem seems to be the rbind function. I frequently get error messages regarding the size of solution which seems to be to large after a while (more than 50 MB).
I want to replace this loop and the rbind with a list and lapply and/or foreach. I have started with converting myframeto a list.
myframe_list <- lapply(seq_len(nrow(myframe)), function(i) myframe[i,])
I have not really come further than this, although I tried applying this very good introduction to parallel processing.
How do I have to reconstruct the for-loop without having to change myfunction? Obviously I am open to different solutions...
Edit: This problem seems to be straight from the 2nd circle of hell from the R Inferno. Any suggestions?
The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like
my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))
A bit to long for comment, so I put it here:
If columns is known in advance:
myfunction <- function(frame){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
apply(myframe, 2, myfunction)
If columns is not given via environment, you can use:
apply(myframe, 2, myfunction, columns) with your original myfunction definition.
Related
I am trying to create a vector or list of values based on the output of a function performed on individual elements of a column.
library(hpoPlot)
xyz_hpo <- c("HP:0003698", "HP:0007082", "HP:0006956")
getallancs <- function(hpo_col) {
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- list()
output[[length(anc) + 1]] <- append(output, anc)
}
return(anc)
}
all_ancs <- getallancs(xyz_hpo)
get.ancestors outputs a character vector of variable length depending on each term. How can I loop through hpo_col adding the length of each ancs vector to the output vector?
Welcome to Stack Overflow :) Great job on providing a minimal reproducible example!
As mentioned in the comments, you need to move the output <- list() outside of your for loop, and return it after the loop. At present it is being reset for each iteration of the loop, which is not what you want. I also think you want to return a vector rather than a list, so I have changed the type of output.
Also, in your original question, you say that you want to return the length of each anc vector in the loop, so I have changed the function to output the length of each iteration, rather than the whole vector.
getallancs <- function(hpo_col) {
output <- numeric()
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- append(output, length(anc))
}
return(output)
}
If you are only doing this for a few cases, such as your example, this approach will be fine, however, this paradigm is typically quite slow in R and it's better to try and vectorise this style of calculation if possible. This is especially important if you are running this for a large number of elements where computation will take more than a few seconds.
For example, one way the function above could be vectorised is like so:
all_ancs <- sapply(xyz_hpo, function(x) length(get.ancestors(hpo.terms, x)))
If in fact you did mean to output the whole vector of anc, not just the lengths, the original function would look like this:
getallancs <- function(hpo_col) {
output <- character()
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- c(output, anc)
}
return(output)
}
Or a vectorised version could be
all_ancs <- unlist(lapply(xyz_hpo, function(x) get.ancestors(hpo.terms, x)))
Hope that helps. If it solves your problem, please mark this as the answer.
Incredibly basic question. I'm brand new to R. I feel bad for asking, but also like someone will crush it:
I'm trying to generate a number of vectors with a for loop. Each with an unique name, numbered by iteration. The code I'm attaching throws an error, but I think it explains what I'm trying to do in principle fairly well.
Thanks in advance.
vectorBuilder <- function(num){
for (x in num){
paste0("vec",x) <- rnorm(10000, mean = 0, sd = 1)}
}
numSeries <- 1:10
vectorBuilder(numSeries)
You can write the function to return a named list :
create_vector <- function(n) {
setNames(replicate(n, rnorm(10000), simplify = FALSE),
paste0('vec', seq_len(n)))
}
and call it as :
data <- create_vector(10)
data will have list of length 10 with each element having a vector of size 10000. It is better to keep data in this list instead of creating lot of vectors in global environment. However, if you still want separate vectors you can use list2env :
list2env(data, .GlobalEnv)
I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)
After using R for the last little bit, I have distanced myself from using for loops for everything, but I still don't know how to cycle through names without using for loops. Whenever I am processing mulitple things, I will use for loops as a way to cover all my bases in one go. Here is a mock example of something I would do. Is there a simpler way to go about doing this?
names <- c("John_Doe","Jane_Doe")
employee <- vector(length = length(names))
for(i in 1:length(names)){
filename <- paste0(names[i],".csv")
employee[i] <- read.csv(filename,header = FALSE)
}
Not sure if it's simpler, but you could try this:
dfs <- lapply(seq_along(names), function(i) read.csv(paste0(names[i], ".csv"), header = FALSE))
Instead of looping it's applying the same function to your index or 1 through the length of your names vector
So, I built a function called sort.song.
My goal with this function is to randomly sample the rows of a data.frame (DATA) and then filter it out (DATA.NEW) to analyse it. I want to do it multiple times (let's say 10 times). By the end, I want that each object (mantel.something) resulted from this function to be saved in my workspace with a name that I can relate to each cycle (mantel.something1, mantel.somenthing2...mantel.something10).
I have the following code, so far:
sort.song<-function(DATA){
require(ade4)
for(i in 1:10){ # Am I using for correctly here?
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantel.numnotes[i]<<-mantel.rtest(coord.dist,num.notes.dist,nrepet=1000)
mantel.songdur[i]<<-mantel.rtest(coord.dist,songdur.dist,nrepet=1000)
mantel.hfreq[i]<<-mantel.rtest(coord.dist,hfreq.dist,nrepet=1000)
mantel.lfreq[i]<<-mantel.rtest(coord.dist,lfreq.dist,nrepet=1000)
mantel.bwidth[i]<<-mantel.rtest(coord.dist,bwidth.dist,nrepet=1000)
mantel.hfreqlnote[i]<<-mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
}
}
Could someone please help me to do it the right way?
I think I'm not assigning the cycles correctly for each mantel.somenthing object.
Many thanks in advance!
The best way to implement what you are trying to do is through a list. You can even make it take two indices, the first for the iterations, the second for the type of analysis.
mantellist <- as.list(1:10) ## initiate list with some values
for (i in 1:10){
...
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
...)
}
return(mantellist)
In this way you can index your specific analysis for each iteration in an intuitive way:
mantellist[[2]][['hfreq']]
mantellist[[2]]$hfreq ## alternative
EDIT by Mohr:
Just for clarification...
So, according to your suggestion the code should be something like this:
sort.song<-function(DATA){
require(ade4)
mantellist <- as.list(1:10)
for(i in 1:10){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq=mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth=mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote=mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
)
}
return(mantellist)
}
You can achieve your objective of repeating this exercise 10 (or more times) without using an explicit for-loop. Rather than have the function run the loop, write the sort.song function to run one iteration of the process, then you can use replicate to repeat that process however many times you desire.
It is generally good practice not to create a bunch of named objects in your global environment. Instead, you can hold of the results of each iteration of this process in a single object. replicate will return an array (if possible) otherwise a list (in the example below, a list of lists). So, the list will have 10 elements (one for each iteration) and each element will itself be a list containing named elements corresponding to each result of mantel.rtest.
sort.song<-function(DATA){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist <- dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist <- dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist <- dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist <- dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist <- dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist <- dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist <- dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
return(list(
numnotes = mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur = mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq = mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq = mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth = mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote = mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
))
}
require(ade4)
replicate(10, sort.song(DATA))