How to use lapply or purrr::map or any other fast way instead of "for loop" with lists? - r

I am calculating a index that needs a matrix of species x sites and a matrix of cophenetic distances between species (generated from a phylogenetic tree). This block of code gives the objects needed to calculate it (site and tree):
library(ape)#phylogenetic tree
library(picante)#ses.mpd calculation
library(purrr)#list of distance matrices
#Sample matrix
set.seed(0000)
site <- matrix(data = sample(c(0, 1), 15, prob = c(0.4, 0.6), replace = T), ncol = 5, nrow = 3)
colnames(site) <- c("t1", "t2", "t3", "t4", "t5")
rownames(site) <- c("samp1", "samp2", "samp3")
#Sample phylogenetic tree
tree <- rcoal(5)
#Reordering species names in the community to match the order in the tree
site <- site[, tree$tip.label]
From the output above, I need to calculate ses.mpd 100 times using the same community matrix, but changing the distance matrix (100 of them stored in a list of 4gb). I used for loops to calculate ses.mpd, but I realised that it would take more than a month to get the output! I have used lapply before, but I do not know how to use it this time, neither purrr::map. I have seen similar questions here: Apply a function to list of matrices and here:Calculate function for all row combinations of two matrices in R, but none of them actually resembles my problem. Here is the code I used with for loop (updated by #Parfait). I need any other way faster than a loop to get the same output. Any suggestion? Thank you very much!
#Empty list for the resolved trees
many.trees <- list()
#Creates 5 resolved trees with the function ape::multi2di
for(i in 1:5){
many.trees[[i]] <- multi2di(tree)
}
#For each resolved tree, creates a distance matrix
many.dists <- map(many.trees, cophenetic)
#ses.mpd using each of the distance matrices above
out <- list()
for(i in 1:5){
out.2[[i]] <- ses.mpd(site, many.dists[[length(many.dists)]])# Thanks, #Parfait.
}

Consider an apply family solution for more compact code and avoid bookkeeping of initializing empty lists and assigning to it.
# Creates 5 resolved trees with the function ape::multi2di
many.trees <- replicate(5, multi2di(tree), simplify = FALSE)
# For each resolved tree, creates a distance matrix
many.dists <- lapply(many.trees, cophenetic)
# ses.mpd using each of the distance matrices above
out_nested <- lapply(many.dists, function(d) ses.mpd(site, d))
To retain names (if included in above methods), change lapply to sapply (equivalent to lapply with simplify=FALSE but maintains USE.NAMES=TRUE). The result would then be named lists.
# For each resolved tree, creates a distance matrix
many.dists <- sapply(many.trees, cophenetic, simplify = FALSE)
# ses.mpd using each of the distance matrices above
out <- sapply(many.dists, function(d) ses.mpd(site, d), simplify = FALSE)
out$first_name
out$second_name
out$third_name
out$fourth_name
out$fifth_name

Related

How to iterate a given process 1'000 times and average the results

I am here to ask you about R language and how to construct a loop to iterate some functions several times.
Here is my problem: I have a numeric matrix obtained from previous analyses (matrix1) that I want to compare (using the overlap function that results in a single value) with another numeric matrix that I get by extracting values of a given raster with a set of randomly created points, as many as the values in the first numeric matrix.
I want to repeat the random sampling procedure 1'000 times, in order to get 1'000 different sets of random points, then repeat the comparison with matrix1 1'000 times (one for each set of random points), and, in the end, calculate the mean of the results to get a single value.
Hereafter I give you an example of the functions I want to use:
#matrix1 is the first matrix, obtained before starting the potential loop;
#LineVector is a polyline shapefile that has to be used within the loop and downloaded before it;
#Raster is a raster from which I should extract values at points location;
#The loop should start from here:
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
#This value, obtained 1'000 times, has then to be averaged into a single number.
I hope I have posed my question in a clear and understandable manner, and I hope you can help me with this problem.
Thanks to everyone in advance, I wish you a good day!
Easy way i can figure out is to use "replicate":
values <- replicate(1000, {
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
Overlapping_value
})
mean(values)

Optimize lapply for distance matrix function R

I am trying to find the cluster number from HDBSCAN analysis of cell coordinates grouped by an image ID in a dataframe.
My approach so far is to split the dataframe containing the ID, X, and Y columns by the ID and then use lapply to run a function on each element as such:
dlist <- split(d[, -c(1)], d$ID) #subgroup dataframe "d" as list and remove the ID column
cls <- function(x) {
dm <- dist(x, method = "euclidean", p = 2) %>% as.matrix() #run distance matrix for each imageID's X,Y coordinates
cl <- hdbscan(dm, minPts = 3) #run unsupervised cluster analysis on matrix
lv <- length(cl$cluster_scores)
return(lv) #return the cluster number for each image ID
}
ClusterNumbers <- lapply(dlist, FUN = cls) %>% bind_rows()
I know the cluster analysis methodology may not be the most robust but it is just a proof of concept at present. My issue currently is that this method is obviously painfully slow, so I am looking for a way (short of submitting this to the uni HPCC) to make this process more efficient and run quicker. I have tried generating the matrices prior to the cluster analysis etc but the number of data prohibits this as I cannot assign vectors that large.
Any help would be awesome.

What is the fastest way to perform an exhaustive search in R

I am implementing a version of the Very Large Scale Relieff algorithm detailed here.
Simply put, Very Large Scale Relieff split the set of features N into several random subsets Ns where Ns << N. Then it calculates the Relieff weights for the features in the subset Ns. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.
I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
as_tibble()
outcome <- rnorm(100)
#create fake data for example
#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
#set seed for results consistency among different run
set.seed(x)
#subsample the features table by extracting a subset of columns
subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
subset_matrix <- fake_table[,subset_index]
#assign the outcome as last column of the subset
subset_matrix[,ncol(subset_matrix)+1] <- outcome
#use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
#create a data frame with as many columns as features in the subset and only one row
#with the Relieff weigths
rf_df <- rf_weights %>%
unname() %>%
matrix(., ncol = length(.), byrow = TRUE) %>%
as_tibble() %>%
set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time
However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
#initalize the weight as 0
current_weigth <- 0
#for all element in result (i.e. repetitions of the subsampling process)
for (el in 1:length(result)){
#assign new weight accessing the table
new_weigth <- result[[el]][[1,feat]]
#skip is empty (i.e. the features is not present in the current subset)
if(is_empty(new_weigth)){next}
#if new weight is higher than current weight assign the value to current weight
if (current_weigth < new_weigth){
current_weigth <- new_weigth}}
current_weigth
}
end_time <- Sys.time()
end_time - start_time
If I understood what you are trying to do correctly, then the answer is simpler than you think.
Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature?
if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:
allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)
This follows #DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
nr = 50L
nc = 200L
## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr)
## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)
## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))
st = Sys.time()
for (i in seq_len(n_reps)) {
set.seed(i)
## similar way to do data simulations as OP
sub_cols = sample(seq_len(nc), nc_sample_size)
sub_mat = cbind(mat[, sub_cols], outcome)
rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
## assign back to pre-allocated result
res[i, sub_cols] = rf_weights
}
## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st
The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.
This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.
The first chapter of Rcpp for everyone says:
Chapter 1 Suitable situations to use Rcpp
R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.
Loop operations in which later iterations depend on previous
iterations.
Accessing each element of a vector/matrix.
Recurrent function calls within loops.
Changing the size of vectors dynamically.
Operations that need advanced data structures and algorithms.
Wickham's Advanced R has a good chapter on that topic too.

Find slope by linear regression of 2 matrices (R)

I have 2 matrices. One contains the quantities a client bought of products. The matrix looks like this quantitymatrix:
the other one contains unitprices at which a client bought the products. The matrix looks like this pricematrix:
How can I run a linear regression with the matrices so that I obtain the slope for each product?
Your data:
quantity <- matrix(c(4,2,6, 9,4,3, 1,1,2, 3,1,5), 3, 4)
price <- matrix(c(1,0.5,8, 4.2,1.2,2, 2,5,2, 1,2.5,1), 3, 4)
First, you have to transform your two matrices into a single data frame. (Although you can avoid that if you want, but I think it makes it much more straightforward if you do so):
df <- data.frame(quantity = as.numeric(quantity),
price = as.numeric(price),
product = rep(1:4, each = 3), ID = 1:3)
Then, run the linear models by groups:
lms <- by(df, df$product, FUN = function(x) lm(price~quantity, data = x))
And get the slopes:
slopes <- sapply(lms, coef)[2,]
If however, you want to keep the orignial matrices as they are, you can run a simple loop:
slopes <- numeric(dim(price)[2])
for (i in 1:dim(price)[2]) {
model <- lm(price[,i]~quantity[,i])
slopes[i] <- coef(model)[2]
}
NB: this solution assumes that the two matrices have identical dimensions.
And if you want to avoid loops, the following solution may be faster:
f <- function(x,y) coef(lm(y~x))[2]
l <- function(m) lapply(seq_len(ncol(m)), function(i) m[,i])
mapply(f, l(quantity), l(price))

Simulating multivariate normals with different covariance matrices

I'm trying to simulate draws from multivariate normals with different covariance matrices (after Gibbs sampling). I have used rmvnorm (in the library mvtnorm) inside a loop, but that is very slow. I have also tried to use an array and pass it to rmvnorm, but didn't suceed, either. I've used apply, and also tried sapply on a list, but it isn't significantly faster.
Is there a fast way to do this?
Thanks!
EDIT: My code looks something like this (with more dimensions and simulations). I'm using mvtnorm::rmvnorm.
library(mvtnorm)
covs = array(dim=c(5,5,2))
covs[,,1] = diag(5)
covs[,,2] = 5*diag(5)
sample = matrix(nrow=2, ncol=5)
for (i in 1:2) {
sample[i,] = rmvnorm(1, sigma=covs[,,i])
}
apply(covs, 3, function(x) rmvnorm(1, sigma=x)) # tried apply
l1 = list() # also tried with a list
l1[[1]] = diag(5)
l1[[2]] = 5*diag(5)
sapply(l1, function(x) rmvnorm(1, sigma=x))

Resources