Optimize lapply for distance matrix function R - r

I am trying to find the cluster number from HDBSCAN analysis of cell coordinates grouped by an image ID in a dataframe.
My approach so far is to split the dataframe containing the ID, X, and Y columns by the ID and then use lapply to run a function on each element as such:
dlist <- split(d[, -c(1)], d$ID) #subgroup dataframe "d" as list and remove the ID column
cls <- function(x) {
dm <- dist(x, method = "euclidean", p = 2) %>% as.matrix() #run distance matrix for each imageID's X,Y coordinates
cl <- hdbscan(dm, minPts = 3) #run unsupervised cluster analysis on matrix
lv <- length(cl$cluster_scores)
return(lv) #return the cluster number for each image ID
}
ClusterNumbers <- lapply(dlist, FUN = cls) %>% bind_rows()
I know the cluster analysis methodology may not be the most robust but it is just a proof of concept at present. My issue currently is that this method is obviously painfully slow, so I am looking for a way (short of submitting this to the uni HPCC) to make this process more efficient and run quicker. I have tried generating the matrices prior to the cluster analysis etc but the number of data prohibits this as I cannot assign vectors that large.
Any help would be awesome.

Related

How to iterate a given process 1'000 times and average the results

I am here to ask you about R language and how to construct a loop to iterate some functions several times.
Here is my problem: I have a numeric matrix obtained from previous analyses (matrix1) that I want to compare (using the overlap function that results in a single value) with another numeric matrix that I get by extracting values of a given raster with a set of randomly created points, as many as the values in the first numeric matrix.
I want to repeat the random sampling procedure 1'000 times, in order to get 1'000 different sets of random points, then repeat the comparison with matrix1 1'000 times (one for each set of random points), and, in the end, calculate the mean of the results to get a single value.
Hereafter I give you an example of the functions I want to use:
#matrix1 is the first matrix, obtained before starting the potential loop;
#LineVector is a polyline shapefile that has to be used within the loop and downloaded before it;
#Raster is a raster from which I should extract values at points location;
#The loop should start from here:
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
#This value, obtained 1'000 times, has then to be averaged into a single number.
I hope I have posed my question in a clear and understandable manner, and I hope you can help me with this problem.
Thanks to everyone in advance, I wish you a good day!
Easy way i can figure out is to use "replicate":
values <- replicate(1000, {
Random_points <- st_sample(LineVector, size = 2000, exact = TRUE, type = "random")
Random_points <- Random_points[!st_is_empty(Random_points)]
Random_points_vect <- vect(Random_points)
Random_values <- terra::extract(Raster, Random_points_vect, ID = F, raw = T)
Random_values <- na.omit(Random_values[, c("Capriolo")])
Values_list <- list(matrix1, Random_values)
Overlapping_value <- overlap(Values_list, type = "2")
Overlapping_value
})
mean(values)

What is the fastest way to perform an exhaustive search in R

I am implementing a version of the Very Large Scale Relieff algorithm detailed here.
Simply put, Very Large Scale Relieff split the set of features N into several random subsets Ns where Ns << N. Then it calculates the Relieff weights for the features in the subset Ns. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.
I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
as_tibble()
outcome <- rnorm(100)
#create fake data for example
#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
#set seed for results consistency among different run
set.seed(x)
#subsample the features table by extracting a subset of columns
subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
subset_matrix <- fake_table[,subset_index]
#assign the outcome as last column of the subset
subset_matrix[,ncol(subset_matrix)+1] <- outcome
#use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
#create a data frame with as many columns as features in the subset and only one row
#with the Relieff weigths
rf_df <- rf_weights %>%
unname() %>%
matrix(., ncol = length(.), byrow = TRUE) %>%
as_tibble() %>%
set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time
However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
#initalize the weight as 0
current_weigth <- 0
#for all element in result (i.e. repetitions of the subsampling process)
for (el in 1:length(result)){
#assign new weight accessing the table
new_weigth <- result[[el]][[1,feat]]
#skip is empty (i.e. the features is not present in the current subset)
if(is_empty(new_weigth)){next}
#if new weight is higher than current weight assign the value to current weight
if (current_weigth < new_weigth){
current_weigth <- new_weigth}}
current_weigth
}
end_time <- Sys.time()
end_time - start_time
If I understood what you are trying to do correctly, then the answer is simpler than you think.
Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature?
if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:
allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)
This follows #DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
nr = 50L
nc = 200L
## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr)
## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)
## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))
st = Sys.time()
for (i in seq_len(n_reps)) {
set.seed(i)
## similar way to do data simulations as OP
sub_cols = sample(seq_len(nc), nc_sample_size)
sub_mat = cbind(mat[, sub_cols], outcome)
rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
## assign back to pre-allocated result
res[i, sub_cols] = rf_weights
}
## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st
The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.
This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.
The first chapter of Rcpp for everyone says:
Chapter 1 Suitable situations to use Rcpp
R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.
Loop operations in which later iterations depend on previous
iterations.
Accessing each element of a vector/matrix.
Recurrent function calls within loops.
Changing the size of vectors dynamically.
Operations that need advanced data structures and algorithms.
Wickham's Advanced R has a good chapter on that topic too.

How to use lapply or purrr::map or any other fast way instead of "for loop" with lists?

I am calculating a index that needs a matrix of species x sites and a matrix of cophenetic distances between species (generated from a phylogenetic tree). This block of code gives the objects needed to calculate it (site and tree):
library(ape)#phylogenetic tree
library(picante)#ses.mpd calculation
library(purrr)#list of distance matrices
#Sample matrix
set.seed(0000)
site <- matrix(data = sample(c(0, 1), 15, prob = c(0.4, 0.6), replace = T), ncol = 5, nrow = 3)
colnames(site) <- c("t1", "t2", "t3", "t4", "t5")
rownames(site) <- c("samp1", "samp2", "samp3")
#Sample phylogenetic tree
tree <- rcoal(5)
#Reordering species names in the community to match the order in the tree
site <- site[, tree$tip.label]
From the output above, I need to calculate ses.mpd 100 times using the same community matrix, but changing the distance matrix (100 of them stored in a list of 4gb). I used for loops to calculate ses.mpd, but I realised that it would take more than a month to get the output! I have used lapply before, but I do not know how to use it this time, neither purrr::map. I have seen similar questions here: Apply a function to list of matrices and here:Calculate function for all row combinations of two matrices in R, but none of them actually resembles my problem. Here is the code I used with for loop (updated by #Parfait). I need any other way faster than a loop to get the same output. Any suggestion? Thank you very much!
#Empty list for the resolved trees
many.trees <- list()
#Creates 5 resolved trees with the function ape::multi2di
for(i in 1:5){
many.trees[[i]] <- multi2di(tree)
}
#For each resolved tree, creates a distance matrix
many.dists <- map(many.trees, cophenetic)
#ses.mpd using each of the distance matrices above
out <- list()
for(i in 1:5){
out.2[[i]] <- ses.mpd(site, many.dists[[length(many.dists)]])# Thanks, #Parfait.
}
Consider an apply family solution for more compact code and avoid bookkeeping of initializing empty lists and assigning to it.
# Creates 5 resolved trees with the function ape::multi2di
many.trees <- replicate(5, multi2di(tree), simplify = FALSE)
# For each resolved tree, creates a distance matrix
many.dists <- lapply(many.trees, cophenetic)
# ses.mpd using each of the distance matrices above
out_nested <- lapply(many.dists, function(d) ses.mpd(site, d))
To retain names (if included in above methods), change lapply to sapply (equivalent to lapply with simplify=FALSE but maintains USE.NAMES=TRUE). The result would then be named lists.
# For each resolved tree, creates a distance matrix
many.dists <- sapply(many.trees, cophenetic, simplify = FALSE)
# ses.mpd using each of the distance matrices above
out <- sapply(many.dists, function(d) ses.mpd(site, d), simplify = FALSE)
out$first_name
out$second_name
out$third_name
out$fourth_name
out$fifth_name

Function to calculate Euclidean distance in R

I am trying to implement KNN classifier in R from scratch on iris data set and as a part of this i have written a function to calculate the Euclidean distance. Here is my code.
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
distance <- 0
for(i in 1:nrow(k))
distance[i] <- sqrt((k[,1][i] - unk[,1][i])^2 + (k[,2][i] - unk[,2][i])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
However, when i call the function it's returning the first value correctly and rest as NA.
Could anyone show where i could have gone wrong with the code?
Thanks in advance.
The aim is to calculate the distance between the ith row of known_data, and the single unknown_data point.
How to fix your code
When you calculate distance[i], you're trying to access the ith row of the unknown data point, which doesn't exits, and is hence NA. I believe your code should run fine if you make the following edits:
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
# Make distance a vector [although not technically required]
distance <- rep(0, nrow(k))
for(i in 1:nrow(k))
# Change unk[,1][i] to unk[1,1] and similarly for unk[,2][i]
distance[i] <- sqrt((k[,1][i] - unk[1,1])^2 + (k[,2][i] - unk[1,2])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
One final note - in the version of R I'm using, the known dataset uses a Species as opposed to Class column
An alternative method
As suggested by #Roman Luštrik, the entire aim of getting the Euclidean distances can be achieved with a simple one-liner:
sqrt((known_data[, 1] - unknown_data[, 1])^2 + (known_data[, 2] - unknown_data[, 2])^2)
This is very similar to the function you wrote, but does it in vectorised form, rather than through a loop, which is often a preferable way of doing things in R.
The best and fastst way is using h2o package:
#load library
library(h2o)
#initialize the node
h2o.init()
#transform the df to h2o type
known_data<-as.h2o(known_data)
unknown_data<-as.h2o(unknown_data)
#create a matrix in which the distances are going to be record
matrix1<-h2o.createFrame(rows=nrow(known_data),cols=unknown_data)
#do a loop to calculate the distance between all the rows of both df
for(i in 1:nrow(unknown_data)){
matrix[,i]<-as.data.frame(h2o.distance(known_data, unknown_data[i,],"l2"))
}

Output multiple vectors from for loop in R

As someone relatively new to R I'm having an issue with creating a for loop.
I have a very large data set with 9000 observations and 25 categorical variables, which I've transformed into binary data and preformed hierarchical clustering. Now I want to try K-Modes clustering to produce an Elbow Plot using the "within-cluster simple-matching distance for each cluster", which is outputted from kmodes$withindiff. I can sum this for each of the k in 1:8 clusters to get the Elbow Plot.
library(klaR)
for(k in 1:8)
{
WCSM[k] <- sum(kmodes(data,k,iter.max=100)$withindiff)
}
plot(1:8,WCSM,type="b", xlab="Number of Clusters",ylab="Within-Cluster
Simple-Matching Distance Summed", main="K-modes Elbow Plot")
My issue is that I want further output from k-modes. For each k in 1:8 I would like to get the vector of integers indicating the cluster to which each object is allocated to given by kmodes$cluster. I need to create a for loop that loops through each k in 1:8 and saves each of the outputs into 8 separate vectors. But I don't know how to do such a for loop. I could just run the 8 lines of code separately but they each take 15mins to run with iter.max=10 so increasing this to iter.max=100 will need to be left running overnight so a loop would be useful.
cl.kmodes2=kmodes(data, 2,iter.max=100)
cl.kmodes3=kmodes(data, 3,iter.max=100)
cl.kmodes4=kmodes(data, 4,iter.max=100)
cl.kmodes5=kmodes(data, 5,iter.max=100)
cl.kmodes6=kmodes(data, 6,iter.max=100)
cl.kmodes7=kmodes(data, 7,iter.max=100)
cl.kmodes8=kmodes(data, 8,iter.max=100)
Ultimately I want to compare the results from the hierarchical binary clustering to the k-modes clustering by getting the Adjusted Rand Index. For example, cutting the tree at k=4 for the hierarchical cluster and comparing this to a 4 cluster solution from k-modes:
dist.binary = dist(data, method="binary")
cl.binary = hclust(dist.binary, method="complete")
hcl.4 = cutree(cl.binary, k = 4)
tab = table(hcl.4, cl.kmodes4$cluster)
library(e1071)
classAgreement(tab)
I agree with Imo, using a list is the best solution.
If you don't want to do that, you could also use assign() to create a new vector in every iteration:
library(klaR)
for(k in 1:8) {
assign(paste("cl.kmodes", k, sep = ""), kmodes(data, k, iter.max = 100))
}
The best method is to put the output from your clusters into a named list:
library(klaR)
myClusterList <- list()
for(k in 1:8) {
myClusterList[[paste0("k.", i)]] <- kmodes(data, i,iter.max=100)
}
You can then pull out the any of the contents easily:
sum(myClusterList[["k.1"]]$withindiff)
or
sum(myClusterList[[1]]$withindiff)
You can also save the list to use in future R sessions, see ?save.

Resources