Speed of Daisy Function - r

I'm working on improving the speed of a function (for a dissimilarity measure) I'm writing which is quite similar mathematically to the Euclidean distance function. However, when I time my function compared to that implemented in the daisy function from the cluster package, I find quite a significant difference in speed, with daisy performing much better. Given that (I'm assuming) a dissimilarity measure would require O(n x p) time due to the need to compare each object to itself over all variables (where n is number of objects and p is number of variables), I find it difficult to understand how the daisy function performs so well (near constant time, from the few experiments I've done) relative to my simple and direct implementation. I present the code I have used both to implement and test below. I have tried looking through the r source code for the implementation of the daisy function, but I found it difficult to understand. I found no nested for loop. Any help with understanding why this function performs so fast and how I could possibly modify my code to have similar speed would be very highly appreciated.
euclidean <- function (df){
no_obj <- nrow(df)
dist <- array(0, dim = c(no_obj, no_obj))
for (i in 1:no_obj){
for (j in 1:no_obj){
dist_v <- 0
if(i != j){
for (v in 1:ncol(df)){
dist_v <- dist_v + sqrt((df[i,v] - df[j,v])^2)
}
}
dist[i,j] <- dist_v
}
}
return(dist)
}
data("iris")
tic <- Sys.time()
dst <- euclidean(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Euclidean]: ", time))
tic <- Sys.time()
dst <- daisy(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Daisy]: ", time))

one option:
euclidean3 <- function(df) {
require(data.table)
n <- nrow(df)
i <- CJ(1:n, 1:n) # generate all row combinations
dl <- sapply(df, function(x) sqrt((x[i[[1]]] - x[i[[2]]])^2)) # loop over columns
dv <- rowSums(dl) # sum values of columns
d <- matrix(dv, n, n) # fill in matrix
d
}
dst3 <- euclidean3(iris[,1:4])
all.equal(euclidean(iris[,1:4]), dst3) # TRUE
[1] "Time taken [Euclidean3]: 0.008"
[1] "Time taken [Daisy]: 0.002"
Largest bottleneck in your code is selecting data.frame elements in loop (df[j,v])). Maybe changing it to matrix also could improver speed. I believe there could be more performant approach on stackoverflow, you just need to search by correct keywords...

Related

What is the fastest way to perform an exhaustive search in R

I am implementing a version of the Very Large Scale Relieff algorithm detailed here.
Simply put, Very Large Scale Relieff split the set of features N into several random subsets Ns where Ns << N. Then it calculates the Relieff weights for the features in the subset Ns. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.
I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
as_tibble()
outcome <- rnorm(100)
#create fake data for example
#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
#set seed for results consistency among different run
set.seed(x)
#subsample the features table by extracting a subset of columns
subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
subset_matrix <- fake_table[,subset_index]
#assign the outcome as last column of the subset
subset_matrix[,ncol(subset_matrix)+1] <- outcome
#use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
#create a data frame with as many columns as features in the subset and only one row
#with the Relieff weigths
rf_df <- rf_weights %>%
unname() %>%
matrix(., ncol = length(.), byrow = TRUE) %>%
as_tibble() %>%
set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time
However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
#initalize the weight as 0
current_weigth <- 0
#for all element in result (i.e. repetitions of the subsampling process)
for (el in 1:length(result)){
#assign new weight accessing the table
new_weigth <- result[[el]][[1,feat]]
#skip is empty (i.e. the features is not present in the current subset)
if(is_empty(new_weigth)){next}
#if new weight is higher than current weight assign the value to current weight
if (current_weigth < new_weigth){
current_weigth <- new_weigth}}
current_weigth
}
end_time <- Sys.time()
end_time - start_time
If I understood what you are trying to do correctly, then the answer is simpler than you think.
Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature?
if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:
allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)
This follows #DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
nr = 50L
nc = 200L
## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr)
## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)
## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))
st = Sys.time()
for (i in seq_len(n_reps)) {
set.seed(i)
## similar way to do data simulations as OP
sub_cols = sample(seq_len(nc), nc_sample_size)
sub_mat = cbind(mat[, sub_cols], outcome)
rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
## assign back to pre-allocated result
res[i, sub_cols] = rf_weights
}
## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st
The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.
This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.
The first chapter of Rcpp for everyone says:
Chapter 1 Suitable situations to use Rcpp
R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.
Loop operations in which later iterations depend on previous
iterations.
Accessing each element of a vector/matrix.
Recurrent function calls within loops.
Changing the size of vectors dynamically.
Operations that need advanced data structures and algorithms.
Wickham's Advanced R has a good chapter on that topic too.

Head/tail breaks classification algorithm in R

I am trying to implement the head/tail breaks classification algorithm in R (see here). This relatively new algorithm is a less computationally expensive alternative to other classification methods used in Cartography for highly skewed data.
So far, I have been looking as template a code in Python (see here) with relatively success. Here is my implementation in R:
# fake data to classify
pareto_data <- c()
for (i in 1:100){
pareto_data[i] <- (1.0/i)^1.16
}
# head/tail breaks algorithm
ht <- function(data){
ln <- length(data)
mn <- mean(data)
res <- append(c(),mn) # this is where I was hopping to store my output
head <- subset(data,data>=mn)
while (length(head)>=1 & length(head)/ln <= 0.40){
print(res)
return(ht(head))
}
#return(res)
}
ht(pareto_data)
As a result of running above code, I have been able to print the following:
[1] 0.03849691
[1] 0.1779904
[1] 0.4818454
This output is very likely the same of running the original Python code I have been using as template. However, I have not been successful in storing it in either a vector or a list.
I would be really thankful if you can give hints to overcome this problem and also to improve my code (which is not exactly the same as the original one in Python, particularly in the conditions of the while statement).
A possible recursive version of the algorithm could be the following.
ht_breaks <- function(x){
ht_inner <- function(x, mu){
n <- length(x)
mu <- c(mu, mean(x))
h <- x[x > mean(x)]
if(length(h) > 1 && length(h)/n <= 0.4){
ht_inner(h, mu)
} else mu
}
ht_inner(x, NULL)
}
pareto_data <- (1.0/(1:100))^1.16
ht_breaks(pareto_data)
#[1] 0.03849691 0.17799039 0.48184535

Pairwise Dijkstra's with early termination according to hop-count in R

I am looking for the most computational and memory friendly approach to computing particular entries of the distance matrix D obtained by pairwise Dijkstra's algorithm in R. More precisely, I only need D[i,j] if the hop-count (unweighted) distance between node i and node j is at most a particular integer k (D[i,j] itself may computed as a weighted shortest path length for which the number of hops may be greater than k). D should be encoded as a sparse matrix for memory efficiency.
I was wondering if there has been some work done on this or if there is an efficient approach towards optimizing the current igraph functions to account for this restriction. E.g., early exit in pairwise Dijkstra's algorithm could really improve the efficiency of solving my problem.
I have tried to make this as efficient as possible myself, but with no luck so far. Some first attempt is illustrated below.
library(igraph)
library(Matrix)
library(spam)
# Hope this to the more efficient one
bounded_hop_pairG_1 <- function(G, k=2){
to <- ego(G, order=k)
D <- sparseMatrix(i=unlist(lapply(1:length(V(G)), function(v) rep(v, length(to[[v]])))),
j=unlist(to),
x=unlist(lapply(1:length(V(G)), function(v) distances(G, v=v, to=to[[v]]))))
return(D)
}
# Hope this to be the less efficient one
bounded_hop_pairG_2 <- function(G, k=2){
D <- distances(G)
D[distances(G, weight=NA) > k] <- 0
return(as.spam(D))
}
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
# Check whether 'distances' actually implements early termination
start_time <- Sys.time()
d1 <- distances(G, v=1)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.00497961 secs
start_time <- Sys.time()
d2 <- distances(G, v=1, to=521)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.002238274 secs (consistently smaller than above)
start_time <- Sys.time()
D1 <- bounded_hop_pairG_1(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 2.671333 secs
start_time <- Sys.time()
D2 <- bounded_hop_pairG_2(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 1.101419 secs
Though I suspect my first function to apply early termination and never stores the full pairwise distance matrix, it appears to be much less efficient than my second function (which also performs a full unweighted pairwise distance computation) in terms of computational time. Hence, I was hoping somebody could point out the most efficient way to implement the first function in R.
you could try cppRouting package available via github.
It provides functions like get_distance_matrix() which can use all cores.
library(cppRouting)
library(igraph)
library(spam)
library(Matrix)
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
#Graph to data frame
G2<-as_long_data_frame(G)
#Weighted graph
graph1<-makegraph(G2[,1:3],directed = F)
#Unweighted graph
graph2<-makegraph(cbind(G2[,1:2],rep(1,nrow(G2))),directed = F)
nodes<-unique(c(G2$from,G2$to)) %>% sort
myfunc<-function(Gr1,Gr2,nd,k=2,cores=FALSE){
test<-get_distance_matrix(graph,nd,nd,allcores = cores)
test2<-get_distance_matrix(graph2,nd,nd,allcores = cores)
test[test2>k]<-0
return(as.spam(test))
}
#Your first function
system.time(
D1 <- bounded_hop_pairG_1(G)
)
#2.18s
#Your second function
system.time(
D2 <- bounded_hop_pairG_2(G)
)
#1.01s
#One core
system.time(
D3 <- myfunc(graph1,graph2,nodes))
#0.69s
#Parallel
system.time(
D4 <- myfunc(graph1,graph2,nodes,cores=TRUE))
#0.32s
If you really want to stop the algorithm when k-nodes is reached and have a little knowledge in C++, it seems rather simple to slightly modify original Dijkstra algorithm then use it via Rcpp.

Simulate over an xts object in R

I am looking for a function, or package, that will help me with this goal. I've looked through several packages but can't find what I am looking for:
Lets say I have an xts object with 10 columns and 250 rows.
What I want to do is run a simulation, such that I get a robust calculation of my performance metric over the period.
So, lets say that I have 250 data points, I want to run x number of simulations over random samples of the data computing the Sharpe Ratio using the function (PerformanceAnalytics::SharpeRatio) varying the random samples to be lengths 30-240, and then find the average. Keep in mind I want to do this for every column and I'd rather not have to use apply if possible. I'd also like to find something that processes the information rather quickly.
What package or functions would best serve this purpose?
Thank you!
Subsetting xts objects for the rows you want to randomly sample should be good enough, performance wise, if that is your main concern. If you want some other concrete examples, you may find it useful to look at the monte carlo simulation functions recently added to the R blotter package:
https://github.com/braverock/blotter/blob/master/R/mcsim.R
Your requirements are quite detailed and a little tricky to follow, but I think this example may be what you're after?
This solution does use apply functions though! Because it just makes life easier. If you don't use lapply, the code will expand quickly and distract from achieving the goal quickly (and you risk introducing bugs with longer, messier code; one reason to use apply family functions where you can).
library(quantmod)
library(PerformanceAnalytics)
# Set up the data:
syms <- c("GOOG", "FB", "TSLA", "SNAP", "MU")
getSymbols(syms)
z <- do.call(merge, lapply(syms, function(s) {
x <- get(s)
dailyReturn(Cl(x))
}))
# Here we have 250 rows, 5 columns:
z <- tail(z, 250)
colnames(z) <- paste0(syms, ".rets")
subSample <- function(x, n.sub = 40) {
# Assuming subsampling by row, preserving all returns and cross symbol dependence structure at a given timestamp
ii <- sample(1:NROW(x), size = n.sub, replace = FALSE)
# sort in order to preserve time ordering?
ii <- sort(ii)
xs <- x[ii, ]
xs
}
set.seed(5)
# test:
z2 <- subSample(z, n.sub = 40)
zShrp <- SharpeRatio(z2)[1, ]
# now run simulation:
nSteps <- seq(30, 240, by = 30)
sharpeSimulation <- function(x, n.sub) {
x <- subSample(x, n.sub)
SharpeRatio(x)[1, ]
}
res <- lapply(nSteps, FUN = sharpeSimulation, x = z)
res <- do.call(rbind, res)
resMean <- colMeans(res)
resMean
# GOOG.rets FB.rets TSLA.rets SNAP.rets MU.rets
# 0.085353854 0.059577882 0.009783841 0.026328660 0.080846592
Do you realise that SharpeRatio uses sapply? And it's likely other performance metrics you want to use will as well. Since you seem to have something against apply (possibly all apply functions in R), this might be worth noting.

Calculating a GP correlation matrix outside of a loop

So I am trying to calculate the correlation matrix associated with a Gaussian Process using R and was hoping for some suggestions for doing so without using the triple for-loop I have written below. Mainly I want to try and condense the code for readable purposes and also to speed up calculations.
#Example Data
n = 500
x1 = sample(1:100,n,replace=T)
x2 = sample(1:100,n,replace=T)
x3 = sample(1:100,n,replace=T)
X = cbind(x1,x2,x3)
R = matrix(NA,nrow=n,ncol=n)
for(i in 1:nrow(X)){
for(j in 1:nrow(X)){
temp = 0
for(k in 1:ncol(X)){
temp = -abs(X[i,k]-X[j,k])^1.99 + temp
}
R[i,j] = exp(temp)
}
}
So as n gets large, the code gets much slower. Also worth noting, since this is a correlation matrix, the matrix is syymetric and the diagonal is equal to 1.
It's much faster using this:
y <- t(X)
R <- exp(-sapply(1:ncol(y), function(i) colSums((y-y[,i])^2)))
If you want ot keep your original formula:
R <- exp(-sapply(1:ncol(y), function(i) colSums(abs(y-y[,i])^1.99)))
I'm wondering if you could cut your calculation and looping times in half by changing these two lines? (Actually the timing was improved by more than 50% 14.304 secs improved to 6.234 secs )
1: for(j in 1:nrow(X)){
2: R[i,j] = exp(temp)
To:
1: for(j in i:nrow(X)){
2: R[i,j] = R[j,i]= exp(temp)
Tested:
> all.equal(R, R2)
[1] TRUE
That way you populate the lower triangle without doing any calculations.BTW, what's with the 1.99? This is perhaps a problem more suited to submitting as a C program. The Rcpp package supports this and there are a lot of worked examples on SO. Perhaps a search on: [r] rcpp nested loops

Resources