Kind of long inquiry, but I wanted to be clear. I have a function that generates a matrix of weights. In this case, it's a 60 by 1 matrix
smthwgtvector <- function(deltat, theta) { (1-exp(-(1/theta)))/(1-exp(-(1/theta)*deltat)) * exp(((data.matrix(1:deltat)-deltat))/theta)}
Output looks like
> class(sixty.week.vector)
[1] "matrix" "array"
> head(sixty.week.vector)
[,1]
[1,] 0.009780893
[2,] 0.009945274
[3,] 0.010112417
[4,] 0.010282370
[5,] 0.010455179
[6,] 0.010630892
I then change the theta argument to generate a list of of 60 vectors
window <- data.matrix(1:60)
vect.list.try <- lapply(window, function(x) smthwgtvector(deltat = 60, theta = window[[x]]))
So far, so good. Now I have a covariance matrix function that has a matrix, a row number, and a weight vector as arguments. I've applied the function to a rolling window of 60 observations using lapply.
wid <- 60
mtx.list.try <- lapply(seq(nrow(assetclassreturns1)-wid+1), function(i) wgtdcovariance(assetclassreturns1[i-1+seq(wid),],60,vect.list.try[[60]]))
My issue is this: I can get 598 matrices (the number of 60 observation rolling windows in my data) using a single row vector, in this case vect.list.try[[60]]
What I need is for each rolling window to have 60 matrices, for a total of 35880 = 598*60 matrices. So the covariance function applied to rows 1:60 using vector[[60]], then the same rows using vector[[59]], and so on, before moving on to rows 2:61.
I've tried this
mtx.list.large <- lapply(seq(nrow(assetclassreturns1)-wid+1), function(i,z) wgtdcovariance(assetclassreturns1[i-1+seq(wid),],60,vect.list.try[[z]]))
I get an error message saying Error in t.default(weightvector) : argument is not a matrix
I know this should be possible, I'm just not seeing a way. Thank you guys!
I am looking for a way to speed up this algorithm.
My situation is as follows. I have a dataset with 25,000 users with 6 habits. My goal is to develop a hierarchical clustering for the 25,000 users. I run this on a server with 16 cores, 128GB RAM.
It took me 3 weeks just for 10,000 users using 6 cores non-stop on my server to calculate this distance matrix. As you can imagine this is too long for my research.
For each of the 6 habits I have created a probability mass distribution (PMF). The PMFs may differ in size (columns) per per habbit. Some habits have 10 columns some 256, all depending on the user with most unbahitual behavior.
The first step in my algrithm is to develop a distance matrix. I use the Hellinger distance to calculate the distance, which is contrary to some packages that use e.g. cathersian/Manhattan. I do need the Hellinger distance, see https://en.wikipedia.org/wiki/Hellinger_distance
What I currently tried is to speed up the algorithm by applying a multicore proces, 6 habits each on a seperate core. Two things that may be beneficial for speed up
(1) C implementation - but I have no idea how to do this (I am not a C programmer) Could you help me on this C implementation if this would be helpful?
(2) make a carthesian product by joining on the table by itself and have all rows and thereafte do a rowwise calculation. The point there is that R gives an error by default in e.g. data.table. Any suggestions for this?
Any other thoughts?
Best Regards Jurjen
# example for 1 habit with 100 users and a PMF of 5 columns
Habit1<-data.frame(col1=abs(rnorm(100)),
col2=abs(c(rnorm(20),runif(50),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),10))),
col3=abs(c(rnorm(30),runif(30),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))),
col4=abs(c(rnorm(10),runif(10),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),60))),
col5=abs(c(rnorm(50),runif(10),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))))
# give all users a username same as rowname
rownames(Habit1)<- c(1:100)
# actual calculation
Result<-calculatedistances(Habit1)
HellingerDistance <-function(x){
#takes two equal sized vectors and calculates the hellinger distance between the vectors
# hellinger distance function
return(sqrt(sum(((sqrt(x[1,]) - sqrt(x[2,]))^2)))/sqrt(2))
}
calculatedistances <- function(x){
# takes a dataframe of user IID in the first column and a set of N values per user thereafter
# first set all NA to 0
x[is.na(x)] <- 0
#create matrix of 2 subsets based on rownumber
# 1 first the diagronal with
D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))
# create a dataframe with hellinger distances
B <<-data.frame(first=rownames(x)[D[1,]],
second=rownames(x)[D[2,]],
distance=apply(D, 2, function(y) HellingerDistance(x[ y,]))
)
# reshape dataframe into a matrix with users on x and y axis
B<<-reshape(B, direction="wide", idvar="second", timevar="first")
# convert wide table to distance table object
d <<- as.dist(B[,-1], diag = FALSE)
attr(d, "Labels") <- B[, 1]
return(d)
}
I understand this is not a complete answer, but this suggestion is too long for a comment.
Here is how I would go about using data.table to speed up the process. The way it stands, this code still does not achieve what you requested maybe because I'm not entirely sure what you want but hopefully this will give a clear idea of how to proceed from here.
Also, you might wanna take a look at the HellingerDist{distrEx} function to calculate Hellinger Distance.
library(data.table)
# convert Habit1 into a data.table
setDT(Habit1)
# assign ids instead of working with rownames
Habit1[, id := 1:100]
# replace NAs with 0
for (j in seq_len(ncol(Habit1)))
set(Habit1, which(is.na(Habit1[[j]])),j,0)
# convert all values to numeric
for (k in seq_along(Habit1)) set(Habit1, j = k, value = as.numeric(Habit1[[k]]))
# get all possible combinations of id pairs in long format
D <- cbind(matrix(rep(1:nrow(Habit1),each=2),nrow=2),combn(1:nrow(Habit1), 2))
D <- as.data.table(D)
D <- transpose(D)
# add to this dataset the probability mass distribution (PMF) of each id V1 and V2
# this solution dynamically adapts to number of columns in each Habit dataset
colnumber <- ncol(Habit1) - 1
cols <- paste0('i.col',1:colnumber)
D[Habit1, c(paste0("id1_col",1:colnumber)) := mget(cols ), on=.(V1 = id)]
D[Habit1, c(paste0("id2_col",1:colnumber)) := mget(cols ), on=.(V2 = id)]
# [STATIC] calculate hellinger distance
D[, H := sqrt(sum(((sqrt(c(id1_col1, id1_col2, id1_col3, id1_col4, id1_col5)) - sqrt(c(id2_col1, id2_col2, id2_col3, id2_col4, id2_col5)))^2)))/sqrt(2) , by = .(V1, V2)]
Now, if you want to make this flexible to the number of columns in each habit data set:
# get names of columns
part1 <- names(D)[names(D) %like% "id1"]
part2 <- names(D)[names(D) %like% "id2"]
# calculate distance
D[, H2 := sqrt(sum(((sqrt( .SD[, ..part1] ) - sqrt( .SD[, ..part2] ))^2)))/sqrt(2) , by = .(V1,V2) ]
Now, for a much faster distance calculation
# change 1st colnames to avoid conflict
names(D)[1:2] <- c('x', 'y')
# [dynamic] calculate hellinger distance
D[melt(D, measure = patterns("^id1", "^id2"), value.name = c("v", "f"))[
, sqrt(sum(((sqrt( v ) - sqrt( f ))^2)))/sqrt(2), by=.(x,y)], H3 := V1, on = .(x,y)]
# same results
#> identical(D$H, D$H2, D$H3)
#> [1] TRUE
The first thing to optimize code is profiling. By profiling the code you provided, it seems that the main bottleneck is HellingerDistance function.
Improve algorithm. In your HellingerDistancefunction, it can be seen when calculating distance of each pair, you recalculate the square-root each time, which is a total waste of time. So here is the improved version, calculatedistances1 is the new function, it first calculate the square-root of x and use new HellingerDistanceSqrt to calculate Hellinger distance, it can be seen the new version speeds up 40%.
Improve data structure. I also notice that your x in your original calulatedistance function is a data.frame which overloads too much, so I transform it to a matrix by as.matrix which makes the code faster by more than a magnitude.
Finally, the new calculatedistances1 is more than 70 times faster than the original version on my machine.
# example for 1 habit with 100 users and a PMF of 5 columns
Habit1<-data.frame(col1=abs(rnorm(100)),
col2=abs(c(rnorm(20),runif(50),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),10))),
col3=abs(c(rnorm(30),runif(30),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))),
col4=abs(c(rnorm(10),runif(10),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),60))),
col5=abs(c(rnorm(50),runif(10),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))))
# give all users a username same as rowname
rownames(Habit1)<- c(1:100)
HellingerDistance <-function(x){
#takes two equal sized vectors and calculates the hellinger distance between the vectors
# hellinger distance function
return(sqrt(sum(((sqrt(x[1,]) - sqrt(x[2,]))^2)))/sqrt(2))
}
HellingerDistanceSqrt <-function(sqrtx){
#takes two equal sized vectors and calculates the hellinger distance between the vectors
# hellinger distance function
return(sqrt(sum(((sqrtx[1,] - sqrtx[2,])^2)))/sqrt(2))
}
calculatedistances <- function(x){
# takes a dataframe of user IID in the first column and a set of N values per user thereafter
# first set all NA to 0
x[is.na(x)] <- 0
#create matrix of 2 subsets based on rownumber
# 1 first the diagronal with
D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))
# create a dataframe with hellinger distances
B <<-data.frame(first=rownames(x)[D[1,]],
second=rownames(x)[D[2,]],
distance=apply(D, 2, function(y) HellingerDistance(x[ y,]))
)
# reshape dataframe into a matrix with users on x and y axis
B<<-reshape(B, direction="wide", idvar="second", timevar="first")
# convert wide table to distance table object
d <<- as.dist(B[,-1], diag = FALSE)
attr(d, "Labels") <- B[, 1]
return(d)
}
calculatedistances1 <- function(x){
# takes a dataframe of user IID in the first column and a set of N values per user thereafter
# first set all NA to 0
x[is.na(x)] <- 0
x <- sqrt(as.matrix(x))
#create matrix of 2 subsets based on rownumber
# 1 first the diagronal with
D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))
# create a dataframe with hellinger distances
B <<-data.frame(first=rownames(x)[D[1,]],
second=rownames(x)[D[2,]],
distance=apply(D, 2, function(y) HellingerDistanceSqrt(x[ y,]))
)
# reshape dataframe into a matrix with users on x and y axis
B<<-reshape(B, direction="wide", idvar="second", timevar="first")
# convert wide table to distance table object
d <<- as.dist(B[,-1], diag = FALSE)
attr(d, "Labels") <- B[, 1]
return(d)
}
# actual calculation
system.time(Result<-calculatedistances(Habit1))
system.time(Result1<-calculatedistances1(Habit1))
identical(Result, Result1)
Sorry if this has been posted before. I looked for the answer both on Google and Stackoverflow and couldn't find a solution.
Right now I have two matrices of data in R. I am trying to loop through each row in the matrix, and find the row in the other matrix that is most similar by some distance metric (for now least squared). I figured out one method but it is O(n^2) which is prohibitive for my data.
I think this might be similar to some dictionary learning techniques but I couldn't find anything.
Thanks!
Both matrices are just 30 by n matrices with a number at each entry.
distance.fun=function(mat1,mat2){
match=c()
for (i in 1:nrow(mat1)){
if (all(is.na(mat1[i,]))==FALSE){
dist=c()
for (j in 1:nrow(mat2)){
dist[j]=sum((mat1[i,]-mat2[j,])^2)
match[i]=which(min(dist) %in% dist)
}
}
}
return(match)
}
A better strategy would be to compute the distance matrix all at once first, then extract the mins. Here's an example using simualted data
set.seed(15)
mat1<-matrix(runif(2*25), ncol=2)
mat2<-matrix(runif(2*25), ncol=2)
and here's a helper function that can calculate the distances between values in one matrix to another. It uses the built in dist function but it does do unnecessary within-group comparisons that we eventually have to filter out, still it may be better performing overall.
distab<-function(m1, m2) {
stopifnot(ncol(m1)==ncol(m2))
m<-as.matrix(dist(rbind(m1, m2)))[1:nrow(m1), -(1:nrow(m1))]
rownames(m)<-rownames(m1)
colnames(m)<-rownames(m2)
m
}
mydist<-distab(mat1, mat2)
now that we have the between-group distances, we just need to minimize the matches.
best <- apply(mydist, 2, which.min)
rr <- cbind(m1.row=seq.int(nrow(mat1)), best.m2.row = best)
head(rr) #just print a few
# m1.row best.m2.row
# [1,] 1 1
# [2,] 2 14
# [3,] 3 7
# [4,] 4 3
# [5,] 5 23
# [6,] 6 15
note that with a strategy like this (we well as with your original implementation) it is possible for multiple rows from mat1 to match to the same row in mat2 and some rows in mat2 to be unmatched to mat1.
Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.
I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.
Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.
# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))
# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)
# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1
So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.
# Make a list to output the list of within-threshold rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
for(i in 1:nrow(u)){
m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
}
})
m
That works. But I thought using a function with apply() would be more efficient. Which it is...
# Make the user-defined function for the threshold matching
match <- function(x,...){
which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
m <- apply(u,1,match)
})
Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.
Somewhat sneaky:
library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)
Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.
Alas, this seems only slightly faster than the for loop
unlist(Map(function(x,y) {
which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))
but at least it's something.
I have a cunning plan :-) . How about just doing calculations:
> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
[,1] [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
[,1] [,2]
[1,] -1 1
[2,] 1 1
[3,] -1 1
[4,] 1 -1
[5,] 1 1
Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo.
What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)