R: Vectorize loop to create pairwise matrix - r

I want to speed up a function for creating a pairwise matrix that describes the number of times an object is selected before and after all other objects, within a set of locations.
Here is an example df:
df <- data.frame(Shop = c("A","A","A","B","B","C","C","D","D","D","E","E","E"),
Fruit = c("apple", "orange", "pear",
"orange", "pear",
"pear", "apple",
"pear", "apple", "orange",
"pear", "apple", "orange"),
Order = c(1, 2, 3,
1, 2,
1, 2,
1, 2, 3,
1, 1, 1))
In each Shop, Fruit is picked by a customer in a given Order.
The following function creates an m x n pairwise matrix:
loop.function <- function(df){
fruits <- unique(df$Fruit)
nt <- length(fruits)
mat <- array(dim=c(nt,nt))
for(m in 1:nt){
for(n in 1:nt){
## filter df for each pair of fruit
xm <- df[df$Fruit == fruits[m],]
xn <- df[df$Fruit == fruits[n],]
## index instances when a pair of fruit are picked in same shop
mm <- match(xm$Shop, xn$Shop)
## filter xm and xn based on mm
xm <- xm[! is.na(mm),]
xn <- xn[mm[! is.na(mm)],]
## assign number of times fruit[m] is picked after fruit[n] to mat[m,n]
mat[m,n] <- sum(xn$Order < xm$Order)
}
}
row.names(mat) <- fruits
colnames(mat) <- fruits
return(mat)
}
Where mat[m,n] is the number of times fruits[m] is picked after fruits[n]. And mat[n,m] is the number of times fruits[m] is picked before fruits[n]. It is not recorded if pairs of fruit are picked at the same time (e.g. in Shop E).
See expected output:
>loop.function(df)
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
You can see here that pear is chosen twice before apple (in Shop C and D), and apple is chosen once before pear (in Shop A).
I am trying to improve my knowledge of vectorization, especially in place of loops, so I want to know how this loop can be vectorized.
(I have a feeling there may be a solution using outer(), but my knowledge of vectorizing functions is still very limited.)
Update
See benchmarking with real data times = 10000 for loop.function(), tidyverse.function(), loop.function2(), datatable.function() and loop.function.TMS():
Unit: milliseconds
expr min lq mean median uq max neval cld
loop.function(dat) 186.588600 202.78350 225.724249 215.56575 234.035750 999.8234 10000 e
tidyverse.function(dat) 21.523400 22.93695 26.795815 23.67290 26.862700 295.7456 10000 c
loop.function2(dat) 119.695400 126.48825 142.568758 135.23555 148.876100 929.0066 10000 d
datatable.function(dat) 8.517600 9.28085 10.644163 9.97835 10.766749 215.3245 10000 b
loop.function.TMS(dat) 4.482001 5.08030 5.916408 5.38215 5.833699 77.1935 10000 a
Probably the most interesting result for me is the performance of tidyverse.function() on the real data. I will have to try add Rccp solutions at a later date - I'm having trouble making them work on the real data.
I appreciate all the interest and answers given to this post - my intention was to learn and improve performance, and there is certainly a lot to learn from all the comments and solutions given. Thanks!

A data.table solution :
library(data.table)
setDT(df)
setkey(df,Shop)
dcast(df[df,on=.(Shop=Shop),allow.cartesian=T][
,.(cnt=sum(i.Order<Order&i.Fruit!=Fruit)),by=.(Fruit,i.Fruit)]
,Fruit~i.Fruit,value.var='cnt')
Fruit apple orange pear
1: apple 0 0 2
2: orange 2 0 1
3: pear 1 2 0
The Shop index isn't necessary for this example, but will probably improve performance on a larger dataset.
As the question raised many comments on performance, I decided to check what Rcpp could bring:
library(Rcpp)
cppFunction('NumericMatrix rcppPair(DataFrame df) {
std::vector<std::string> Shop = Rcpp::as<std::vector<std::string> >(df["Shop"]);
Rcpp::NumericVector Order = df["Order"];
Rcpp::StringVector Fruit = df["Fruit"];
StringVector FruitLevels = sort_unique(Fruit);
IntegerVector FruitInt = match(Fruit, FruitLevels);
int n = FruitLevels.length();
std::string currentShop = "";
int order, fruit, i, f;
NumericMatrix result(n,n);
NumericVector fruitOrder(n);
for (i=0;i<Fruit.length();i++){
if (currentShop != Shop[i]) {
//Init counter for each shop
currentShop = Shop[i];
std::fill(fruitOrder.begin(), fruitOrder.end(), 0);
}
order = Order[i];
fruit = FruitInt[i];
fruitOrder[fruit-1] = order;
for (f=0;f<n;f++) {
if (order > fruitOrder[f] & fruitOrder[f]>0 ) {
result(fruit-1,f) = result(fruit-1,f)+1;
}
}
}
rownames(result) = FruitLevels;
colnames(result) = FruitLevels;
return(result);
}
')
rcppPair(df)
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
On the example dataset, this runs >500 times faster than the data.table solution, probably because it doesn't have the cartesian product problem. This isn't supposed to be robust on wrong input, and expects that shops / order are in ascending order.
Considering the few minutes spent to find the 3 lines of code for the data.table solution, compared to the much longer Rcpp solution / debugging process, I wouldn't recommend to go for Rcpp here unless there's a real performance bottleneck.
Interesting however to remember that if performance is a must, Rcpp might be worth the effort.

Here is an approach that makes simple modifications to make it 5x faster.
loop.function2 <- function(df){
spl_df = split(df[, c(1L, 3L)], df[[2L]])
mat <- array(0L,
dim=c(length(spl_df), length(spl_df)),
dimnames = list(names(spl_df), names(spl_df)))
for (m in 1:(length(spl_df) - 1L)) {
xm = spl_df[[m]]
mShop = xm$Shop
for (n in ((1+m):length(spl_df))) {
xn = spl_df[[n]]
mm = match(mShop, xn$Shop)
inds = which(!is.na(mm))
mOrder = xm[inds, "Order"]
nOrder = xn[mm[inds], "Order"]
mat[m, n] <- sum(nOrder < mOrder)
mat[n, m] <- sum(mOrder < nOrder)
}
}
mat
}
There are 3 main concepts:
The original df[df$Fruits == fruits[m], ] lines were inefficient as you would be making the same comparison length(Fruits)^2 times. Instead, we can use split() which means we are only scanning the Fruits once.
There was a lot of use of df$var which will extract the vector during each loop. Here, we place the assignment of xm outside of the inner loop and we try to minimize what we need to subset / extract.
I changed it to be closer to combn as we can re-use our match() condition by doing both sum(xmOrder > xnOrder) and then switching it to sum(xmOrder < xnOrder).
Performance:
bench::mark(loop.function(df), loop.function2(df))
# A tibble: 2 x 13
## expression min median
## <bch:expr> <bch:tm> <bch:>
##1 loop.function(df) 3.57ms 4.34ms
##2 loop.function2(df) 677.2us 858.6us
My hunch is that for your larger dataset, #Waldi's data.table solution will be faster. But for smaller datasets, this should be pretty perfomant.
Finally, here's yet another rcpp approach that seems to be slower than #Waldi:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerMatrix loop_function_cpp(List x) {
int x_size = x.size();
IntegerMatrix ans(x_size, x_size);
for (int m = 0; m < x_size - 1; m++) {
DataFrame xm = x[m];
CharacterVector mShop = xm[0];
IntegerVector mOrder = xm[1];
int nrows = mShop.size();
for (int n = m + 1; n < x_size; n++) {
DataFrame xn = x[n];
CharacterVector nShop = xn[0];
IntegerVector nOrder = xn[1];
for (int i = 0; i < nrows; i++) {
for (int j = 0; j < nrows; j++) {
if (mShop[i] == nShop[j]) {
if (mOrder[i] > nOrder[j])
ans(m, n)++;
else
ans(n, m)++;
break;
}
}
}
}
}
return(ans);
}
loop_wrapper = function(df) {
loop_function_cpp(split(df[, c(1L, 3L)], df[[2L]]))
}
loop_wrapper(df)
``

It seems not possible to vectorize over the original data frame df. But if you transform it using reshape2::dcast(), to have one line per each shop:
require(reshape2)
df$Fruit <- as.character(df$Fruit)
by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
# Shop apple orange pear
# 1 A 1 2 3
# 2 B NA 1 2
# 3 C 2 NA 1
# 4 D 2 3 1
# 5 E 1 1 1
..., then you can easily vectorize at least for each combination of [m, n]:
fruits <- unique(df$Fruit)
outer(fruits, fruits,
Vectorize(
function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE),
c("m", "n")
),
by_shop)
# [,1] [,2] [,3]
# [1,] 0 0 2
# [2,] 2 0 1
# [3,] 1 2 0
This is probably the solution you desired to do with outer. Much faster solution would be a true vectorization over all combinations of fruits [m, n], but I've been thinking about it and I don't see any way to do it. So I had to use the Vectorize function which of course is much slower than true vectorization.
Benchmark comparison with your original function:
Unit: milliseconds
expr min lq mean median uq max neval
loop.function(df) 3.788794 3.926851 4.157606 4.002502 4.090898 9.529923 100
loop.function.TMS(df) 1.582858 1.625566 1.804140 1.670095 1.756671 8.569813 100
Function & benchmark code (also added the preservation of the dimnames):
require(reshape2)
loop.function.TMS <- function(df) {
df$Fruit <- as.character(df$Fruit)
by_shop <- dcast(df, Shop ~ Fruit, value.var = "Order")
fruits <- unique(df$Fruit)
o <- outer(fruits, fruits, Vectorize(function (m, n, by_shop) sum(by_shop[,m] > by_shop[,n], na.rm = TRUE), c("m", "n")), by_shop)
colnames(o) <- rownames(o) <- fruits
o
}
require(microbenchmark)
microbenchmark(loop.function(df), loop.function.TMS(df))

OK, here is a solution:
library(tidyverse)
# a dataframe with all fruit combinations
df_compare <- expand.grid(row_fruit = unique(df$Fruit)
, column_fruit = unique(df$Fruit)
, stringsAsFactors = FALSE)
df_compare %>%
left_join(df, by = c("row_fruit" = "Fruit")) %>%
left_join(df, by = c("column_fruit" = "Fruit")) %>%
filter(Shop.x == Shop.y &
Order.x < Order.y) %>%
group_by(row_fruit, column_fruit) %>%
summarise(obs = n()) %>%
pivot_wider(names_from = row_fruit, values_from = obs) %>%
arrange(column_fruit) %>%
mutate_if(is.numeric, function(x) replace_na(x, 0)) %>%
column_to_rownames("column_fruit") %>%
as.matrix()
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
If you don't know what is going on in the second code part (df_compare %>% ...), read the "pipe" (%>%) as 'then'. Run the code from df_compare to just before any of the pipes to see the intermediate results.

Related

Vectorize function / increase calculation speed in data.table

Currently I have the following data.table :
item city dummyvar
A Austin 1
A Austin 1
A Austin 100
B Austin 2
B Austin 2
B Austin 200
A NY 1
A NY 1
A NY 100
B NY 2
B NY 2
B NY 200
and I have a user-defined function called ImbalancePoints, which is applied to dummyvar and it returns the rows where it detects an abrupt change in dummyvar. The way I am doing this is as follows:
my.data.table[,
.(item, city , imb.points = list(unique(try(ImbalancePoints(dummyvar), silent = T))) ),
by = .(city, item)
]
And for the NY case lets say that I get a data.table object like the following:
item city imb.points
A NY 3,449
where the column imb.points is a column with nested lists as its elements, and for this example the numbers 3 and 449 denote the rows where there is an abrupt change for the case of city = NY and item = A. However the problem that I am facing is that I have approx. 3000 different items for 12 different cities, and it is taking a long time to calculate this. I was wondering if you could give me an idea of how to vectorize/speed up this calculation since the last time that I attempted this it took almost 2 hours and it didn't finish.
I don't know if its of any help but I am also attaching the ImbalancePoints function:
library(pracma)
ImbalancePr <- function(eval.column) {
n <- length(eval.column)
imbalance <- rep(0, n)
b_t = rep(0,n)
elem_diff <- diff(eval.column)
for(i in 2:n)
{
imbalance[i] <- sign(elem_diff[i-1]) * (elem_diff[i-1] != 0)
+ imbalance[i-1]*(elem_diff[i-1] == 0)
}
return(imbalance)
}
ImbalancePoints <- function(eval.column, w0 = 100, bkw_T = 10, bkw_b = 10){
bv_t <- ImbalancePr(eval.column)
w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
Tstar <- w0
E0t <- Tstar
repeat{
Tlast <- sum(Tstar)
nbt <- min(bkw_b, Tlast-1)
P <- pracma::movavg(bv_t[1:Tlast], n = nbt, type = "e")
P <- tail(P,1)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if(max(bv_t_cumsum) < bv_t_expected){break}else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if(Tlast > length(eval.column)[1]){break}else{
Tstar <- c(Tstar,Tnew)
if(length(Tstar) <= 2){
E0t <- mean(Tstar)
}else{
nt <- min(bkw_T,length(Tstar)-1)
E0t <- pracma::movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
E0t <- tail(E0t,1)
}
}
}
return(sort(unique(Tstar)))
}
EDIT: Thanks to Paul insight then my problem is just to vectorize the repeat loop inside the ImbalancePoints function. However I am not very proficient coding and I don't see a straightforward solution to it. If someone could give me a suggestion or if you know about an auxiliary function/library I will appreciate it.
This posting consist of several sections addressing different issues:
Vectorizing ImbalancePr()
Profiling ImbalancePoints()
Speeding-up movavg() with Rcpp by a factor of 4
Vectorizing ImbalancePr()
I believe ImbalancePr() can be replaced by
fImbalancePr <- function(x) c(0, sign(diff(x)))
At least, it returns the same result, wenn benchmarked (with check of results):
library(bench)
library(ggplot2)
bm <- press(
n = c(10, 100, 1000, 10000),
{
x <- rep(0, n)
set.seed(123)
x[sample(n, n/5)] <- 100
print(table(x))
mark(
ImbalancePr(x),
fImbalancePr(x)
)
}
)
Running with:
n
1 10
x
0 100
8 2
2 100
x
0 100
80 20
3 1000
x
0 100
800 200
4 10000
x
0 100
8000 2000
autoplot(bm)
fImbalancePr() is always faster than OP's original version. The speed advantage increases with vector length.
Profiling ImbalancePoints()
However, this improvement does not have much impact on the overall performance of ImbalancePoints():
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
}
)
}
)
fImbalancePoint() is a variant of ImbalancePoint() where ImbalancePr() has been replaced by fImbalancePr().
autoplot(bm)
There is a minor improvement but this does not help to cut down the reported execution time of 2 hours significantly.
We can use profvis to identify where the time is spent within ImbalancePoints():
library(profvis)
x <- c(rep(0, 480L), rep(c(0:9, 9:0), 2L), rep(0, 480L))
profvis({
for (i in 1:100) ffImbalancePoints(x)
})
Timings are collected by sampling, therefore a sufficient number of repetitions is required to get a good coverage.
The results from one run are shown in this screenshot from RStudio:
movavg() consumes 25% of the time spent in ImbalancePoints().
According to the profiling, another 20% are required for the double colon operator in pracma::movavg(). It might be worthwhile to test if there is a speedup from loading the pracma paackage beforehand using library(pracma).
10% are spent in calls to tail(). tail(x, 1) can be replaced by x[length(x)] which is more than a magnitude faster.
If we look at code of movavg() by typing pracma::movavg (without parentheses) we see that there is a iterative loop which cannot be vectorized:
...
else if (type == "e") {
a <- 2/(n + 1)
y[1] <- x[1]
for (k in 2:nx) y[k] <- a * x[k] + (1 - a) * y[k - 1]
}
...
In addition, only the last value of the time series created by the call to movavg() is used. So, there might be two options for performance improvements here:
Choose a different weighted means function which uses only data points within a limited window.
Re-implement movavg() in C++ using Rcpp.
Speeding-up movavg() with Rcpp
Replacing the call to pracma::movavg() and the subsequent call to tail() by on Rcpp function we can gain a speed-up up to a factor of 4 for ImbalancePoints() overall.
EMA_last_cpp(x, n) replaces tail(pracma::movavg(x, n, type = "e"), 1)
library(Rcpp)
cppFunction("
double EMA_last_cpp(const NumericVector& x, const int n) {
int nx = x.size();
double a = 2.0 / (n + 1.0);
double b = 1.0 - a;
double y;
y = x[0];
for(int k = 1; k < nx; k++){
y = a * x[k] + b * y;
}
return y;
}
")
Now, we can modify ImbalancePoints() accordingly. In addition, the call to ImbalancePr() is replaced and the code is modified in two other places (see comments):
fImbalancePoints <-
function(eval.column,
w0 = 100,
bkw_T = 10,
bkw_b = 10) {
# bv_t <- ImbalancePr(eval.column)
bv_t <- c(0, sign(diff(eval.column)))
# w0 <- min(min(which(cumsum(bv_t) != 0)), w0)
w0 <- min(which(bv_t != 0)[1L], w0) # pick first change point
Tstar <- w0
E0t <- Tstar
repeat {
Tlast <- sum(Tstar)
# remove warning:
# In max(bv_t_cumsum) : no non-missing arguments to max; returning -Inf
if (Tlast >= length(bv_t)) break
nbt <- min(bkw_b, Tlast - 1)
# P <- movavg(bv_t[1:Tlast], n = nbt, type = "e")
# P <- tail(P, 1)
P <- EMA_last_cpp(bv_t[1:Tlast], n = nbt)
bv_t_expected <- E0t * abs(P)
bv_t_cumsum <- abs(cumsum(bv_t[-(1:Tlast)]))
if (max(bv_t_cumsum) < bv_t_expected) {
break
} else{
Tnew <- min(which(bv_t_cumsum >= bv_t_expected))
}
Tlast <- Tlast + Tnew
if (Tlast > length(eval.column)[1]) {
break
} else{
Tstar <- c(Tstar, Tnew)
if (length(Tstar) <= 2) {
E0t <- mean(Tstar)
} else{
nt <- min(bkw_T, length(Tstar) - 1)
# E0t <- movavg(Tstar[1:length(Tstar)], n = nt, type = "e")
# E0t <- tail(E0t, 1)
E0t <- EMA_last_cpp(Tstar[1:length(Tstar)], n = nt)
}
}
}
return(sort(unique(Tstar)))
}
The benchmark
library(bench)
library(ggplot2)
bm <- press(
n = c(10L, 100L, 1000L),
{
x <- replace(rep(0, n), n, 100)
y <- c(rep(2, n), rep(-3, n), rep(5, n))
mark(
original = {
list(
ImbalancePoints(x),
ImbalancePoints(y)
)
},
modified = {
list(
fImbalancePoints(x),
fImbalancePoints(y)
)
},
min_time = 1
)
}
)
bm
expression n min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <int> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 original 10 315.1us 369us 2318. 2.66KB 4.16 2231 4 962.49ms
2 modified 10 120us 136us 6092. 195.11KB 5.31 5733 5 940.99ms
3 original 100 583.4us 671us 1283. 55.09KB 4.16 1234 4 961.78ms
4 modified 100 145.4us 167us 5146. 47.68KB 4.19 4916 4 955.29ms
5 original 1000 438.1ms 469ms 2.17 157.37MB 4.33 3 6 1.38s
6 modified 1000 97.1ms 103ms 9.53 152.09MB 17.1 10 18 1.05s
shows that the modified version is about 3 to 5 times faster than the original version. This may help the OP to reduce the compute time for his production dataset from 2+ hours by a significant factor.

R Counting the frequency of combinations of three digits efficiently

I have a data.frame where each ID has exactly 3 attributes. For simplification I put only 100 rows, although in my real dataset it's around a 1.000.000. There are around 50 different possible attributes. The attributes are a mixture out of numbers and characters.
data <- data.frame(id = 1:100,
a1 = sample(letters,100,replace = T),
a2 = sample(letters,100,replace = T),
a3 = sample(letters,100,replace = T),
stringsAsFactors=FALSE) %>%
as_tibble()
I want to know what are the most frequent combinations (the order does not matter)
So the outcome is supposed to be something like this
pattern | frequency
a,a,a | 10
A,b,c | 5
a,e,c | 4
... | ....
First I started to create a vector which contains all possible combinations:
possible_combinations <- combn(c(letters,LETTERS),3) %>%
t() %>%
as_tibble() %>%
unite("combination",sep="") %>%
pull()
Then I wrote this nested loop to count the frequencies:
counter = 0
inner_counter = 0
combination_counter = vector(mode = "numeric",length = length (possible_combinations))
for (j in 1:length(possible_combinations)){
for (i in 1:nrow(data)){
# inner Counter Counts when Attribute of one ID is in one combination
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,2]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,3]] )
inner_counter = inner_counter + str_count(possible_combinations[j] , data[[i,4]] )
# if all three attributes are in a combination, then the Counter increases by one
if(inner_counter == 3) {
counter = counter + 1 }
inner_counter = 0
}
# combination_counter is a vector which saves the frequency with
# which a combination ocurred in all different ids
combination_counter[[j]] = inner_counter
inner_counter = 0
}
I know this is really not very R like, but I don't know how to do it in a different way. The runtime is even bad for my little toy example and it's almost infeasible for my real data.
You could as well do this with base r:
table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ",")))
The problem that you are going to run into is dealing with the massive number of combinations. Even if you try to apply a simple solution of sorting each row, this will cost a lot of time for the number of rows you are dealing with.
Take the following example with the straightforward approach offered by #Lennyy:
set.seed(123)
n <- 1e7
data <- data.frame(id = 1:n,
a1 = sample(letters, n, replace = T),
a2 = sample(letters, n, replace = T),
a3 = sample(letters, n, replace = T),
stringsAsFactors = FALSE)
system.time(t2 <- table(apply(data[,2:4], 1, function(x) paste0(sort(x), collapse = ","))))
user system elapsed
373.281 1.695 375.445
That's a long time...
Here is the output for reference:
head(t2)
a,a,a a,a,b a,a,c a,a,d a,a,e a,a,f
603 1657 1620 1682 1759 1734
We need to somehow code each row quickly without worrying about which column a particular element came from. Additionally, we need to do this in a way that will guarantee uniqueness.
What about a hash table? We can easily do this with Rcpp.
#include <Rcpp.h>
#include <unordered_map>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
IntegerVector countCombos(IntegerMatrix myMat, int numAttr, CharacterVector myAttr) {
unsigned long int numRows = myMat.nrow();
unsigned long int numCols = myMat.ncol();
std::unordered_map<std::string, int> mapOfVecs;
for (std::size_t i = 0; i < numRows; ++i) {
std::vector<int> testVec(numAttr, 0);
for (std::size_t j = 0; j < numCols; ++j) {
++testVec[myMat(i, j) - 1];
}
std::string myKey(testVec.begin(), testVec.end());
auto it = mapOfVecs.find(myKey);
if (it == mapOfVecs.end()) {
mapOfVecs.insert({myKey, 1});
} else {
++(it->second);
}
}
std::size_t count = 0;
IntegerVector out(mapOfVecs.size());
CharacterVector myNames(mapOfVecs.size());
for (const auto& elem: mapOfVecs) {
std::size_t i = 0;
for (auto myChar: elem.first) {
while (myChar) {
myNames[count] += myAttr[i];
--myChar;
}
++i;
}
out[count++] = elem.second;
}
out.attr("names") = myNames;
return out;
}
This offers a great efficiency gain over any of the other solutions posted:
myRows <- 1:nrow(data)
attrCount <- 26
matOfInts <- vapply(2:ncol(data), function(x) {
match(data[, x], letters)
}, myRows, USE.NAMES = FALSE)
system.time(t <- countCombos(matOfInts, attrCount, letters))
user system elapsed
2.570 0.007 2.579
That's over 100 times faster!!!!
And here is the output:
head(t)
jkk ddd qvv ttu aaq ccd
1710 563 1672 1663 1731 1775
Testing equality (the output is in different order, so we must sort first):
identical(sort(unname(t)), as.integer(sort(unname(t2))))
[1] TRUE
Explanation
The countCombos function accepts a matrix of integers. This matrix represents the indices of elements of the unique attributes (in our example, this would be represented by letters).
As we are dealing with combinations with repetition, we can easily represent them as an indexing frequency vector.
The template vector is:
a b c d e y z
| | | | | | |
v v v v v v v
(0, 0, 0, 0, 0, ... 0, 0)
And here is how certain combinations get mapped:
aaa -->> (3, rep(0, 25))
zdd -->> dzd -->> ddz -->> (0, 0, 0, 2, rep(0, 21), 1)
Once we have created our vector, we convert it to a string, so ddz becomes:
ddz --> c((0,0,0,2, rep(0, 21),1) -->> `00020000000000000000000001`
And this is the key that is used in our hash.
If I've understood you correctly the ordering of the attributes doesn't matter, so aba is the same as aab and baa. You also have 50 different attributes and all other solutions seems to rely on typing these in manually.
The following code creates a column that is the concatenated of all attribute columns, sorts it to ignore the order of the attributes, and the calculates the count per group:
library(dplyr)
library(rlang)
cnames <- colnames(data)
cnames <- cnames[2:length(cnames)] #assuming the first column is the only non-attribute column,
#remove any other non-attribute columns as necessary
#!!!syms(cnames) outputs them as the columns rather than text, taken from here
# https://stackoverflow.com/questions/44613279/dplyr-concat-columns-stored-in-variable-mutate-and-non-standard-evaluation?rq=1
data %>%
mutate(comb = sort(paste0(!!!syms(cnames)))) %>%
group_by(comb) %>%
summarise(cnt = n())
You can use dplyr to do this efficiently. First use group_by to group variables a1, a2, and a3, then use summarize and n() to count frequencies:
set.seed(100)
N = 1e5
data <- data.frame(id = 1:N,
a1 = sample(letters[1:5],N,replace = T),
a2 = sample(letters[1:5],N,replace = T),
a3 = sample(letters[1:5],N,replace = T),
stringsAsFactors=FALSE)
data %>%
group_by(a1, a2, a3) %>%
summarize(count = n()) %>%
arrange(count)
## A tibble: 125 x 4
## Groups: a1, a2 [25]
# a1 a2 a3 count
# <chr> <chr> <chr> <int>
# 1 b a d 735
# 2 c b d 741
# 3 a d e 747
# 4 d a e 754
# 5 d e e 754
# 6 d e c 756
# 7 e a d 756
# 8 d c d 757
# 9 c c c 758
#10 d a b 759
## ... with 115 more rows

R Improve performance of function(s)

This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.

The which function in R is not giving the desired output

I have a matrix that contains 3 columns and in total 10,000 elements. First and second columns are indexes and third column is the score. I want to normalize the score column based on this formula:
Normalized_score_i_j = score_i_j / ((sqrt(score_i_i) * (sqrt(score_j_j))
score_i_j = the current score itself
score_i_i = look at current score's index in first column, and in the dataset look for a score that has that index in both its first and second columns
score_j_j = look at current score's index in second column, and in the dataset look for a score that has that index in both its first and second columns
An example is for instance, if df is as follow:
df <- read.table(text = "
First.Protein,Second.Protein,Score
1,1,25
1,2,90
1,3,82
1,4,19
2,1,90
2,2,99
2,3,76
2,4,79
3,1,82
3,2,76
3,3,91
3,4,33
4,1,28
4,2,11
4,3,99
4,4,50
", header = TRUE, sep = ",")
If we are normalizing this row:
First.Protein Second.Protein Score
4 3 99
The normalized score will be:
The score itself divided by the sqrt of a score that its First.Protein and Second.Protein index are both 4 multiplied by the sqrt of a score where its First.Protein and Second.Protein indexes are both 3.
Therefore:
Normalized = 99 / (sqrt(50) * sqrt(91)) = 1.467674
I have the code below, but it is behaving very weirdly and is giving me values that are not at all normalized and are in fact very odd:
for(i in 1:nrow(Smith_Waterman_Scores))
{
Smith_Waterman_Scores$Score[i] <-
Smith_Waterman_Scores$Score[i] /
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$First.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$First.Protein[i])])) *
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$Second.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$Second.Protein[i])]))
}
Here's a re-write of your original attempt (which() is not necessary; just use the logical vector for sub-setting; with() allows you to refer to variables in the data frame without having to re-type the name of the data.frame -- easier to read but also easier to make a mistake)
orig0 <- function(df) {
for(i in 1:nrow(df)) {
df$Score[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
df$Score
}
The problem is that Score[ii] and Score[jj] appear on the right-hand side both before and after they've been updated. Here's a revision where the original columns are interpreted as 'read-only'
orig1 <- function(df) {
normalized <- numeric(nrow(df)) # pre-allocate
for(i in 1:nrow(df)) {
normalized[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
normalized
}
I think the results are now correct (see below). A better implementation would use sapply (or vapply) to avoid having to worry about the allocation of the return value
orig2 <- function(df) {
sapply(seq_len(nrow(df)), function(i) {
with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
})
}
Now that the results are correct, we can ask about performance. Your solution requires a scan of, e.g., First.Protein, each time through the loop. There are N=nrow(df) elements of First.Protein, and you're going through the loop N times, so you'll be making a multiple of N * N = N^2 comparisons -- if you increase the size of the data frame from 10 to 100 rows, the time taken will change from 10 * 10 = 100 units, to 100 * 100 = 10000 units of time.
Several of the answers attempt to avoid that polynomial scaling. My answer does this using match() on a vector of values; this probably scales as N (each look-up occurs in constant time, and there are N look-ups), which is much better than polynomial.
Create a subset of data with identical first and second proteins
ii = df[df$First.Protein == df$Second.Protein,]
Here's the ijth score from the original data frame
s_ij = df$Score
Look up First.Protein of df in ii and record the score; likewise for Second.Protein
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
The normalized scores are then
> s_ij / (sqrt(s_ii) * sqrt(s_jj))
[1] 1.0000000 1.8090681 1.7191871 0.5374012 1.8090681 1.0000000 0.8007101
[8] 1.1228571 1.7191871 0.8007101 1.0000000 0.4892245 0.7919596 0.1563472
[15] 1.4676736 1.0000000
This will be fast, using a single call to match() instead of many calls to which() inside a for loop or tests for identity inside an apply() -- both of the latter make N^2 comparisons and so scale very poorly.
I summarized some of the proposed solutions as
f0 <- function(df) {
contingency = xtabs(Score ~ ., df)
diagonals <- unname(diag(contingency))
i <- df$First.Protein
j <- df$Second.Protein
idx <- matrix(c(i, j), ncol=2)
contingency[idx] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
}
f1 <- function(df) {
ii = df[df$First.Protein == df$Second.Protein,]
s_ij = df$Score
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
s_ij / (sqrt(s_ii) * sqrt(s_jj))
}
f2 <- function(dt) {
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt[dt.lookup][
, Normalized := Score / (sqrt(Score1) * sqrt(Score2))][
, .(First.Protein, Second.Protein, Normalized)]
}
f3 <- function(dt) {
eq = dt[First.Protein == Second.Protein]
dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL]
}
I know how to programmatically check that the first two generate consistent results; I don't know data.table well enough to get the normalized result out in the same order as the input columns for f2() so can't compare with the others (though they look correct 'by eye'). f3() produces numerically similar but not identical results
> identical(orig1(df), f0(df))
[1] TRUE
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), { f3(dt3); dt3[["Normalized"]] }) # pass by reference!
[1] FALSE
> all.equal(f0(df), { f3(dt3); dt3[["Normalized"]] })
[1] TRUE
There are performance differences
library(data.table)
dt2 <- as.data.table(df)
dt3 <- as.data.table(df)
library(microbenchmark)
microbenchmark(f0(df), f1(df), f2(dt2), f3(dt3))
with
> microbenchmark(f0(df), f1(df), f2(df), f3(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 967.117 992.8365 1059.7076 1030.9710 1094.247 2384.360 100
f1(df) 176.238 192.8610 210.4059 207.8865 219.687 333.260 100
f2(df) 4884.922 4947.6650 5156.0985 5017.1785 5142.498 6785.975 100
f3(df) 3281.185 3329.4440 3463.8073 3366.3825 3443.400 5144.430 100
The solutions f0 - f3 are likely to scale well (especially data.table) with real data; the fact that the times are in microseconds probably means that speed is not important (now that we are not implementing an N^2 algorithm).
On reflection, a more straight-forward impelementation of f1() just looks up the 'diagonal' elements
f1a <- function(df) {
ii = df[df$First.Protein == df$Second.Protein, ]
d = sqrt(ii$Score[order(ii$First.Protein)])
df$Score / (d[df$First.Protein] * d[df$Second.Protein])
}
You may be doing this in a very round-about manner. Can you see if this works for you:
R> xx
First Second Score
1 1 1 25
2 1 2 90
3 1 3 82
4 1 4 19
5 2 1 90
6 2 2 99
7 2 3 76
8 2 4 79
9 3 1 82
10 3 2 76
11 3 3 91
12 3 4 33
13 4 1 28
14 4 2 11
15 4 3 99
16 4 4 50
R> contingency = xtabs(Score ~ ., data=xx)
R> contingency
Second
First 1 2 3 4
1 25 90 82 19
2 90 99 76 79
3 82 76 91 33
4 28 11 99 50
R> diagonals <- unname(diag(contingency))
R> diagonals
[1] 25 99 91 50
R> normalize <- function (i, j, contingencies, diagonals) {
+ contingencies[i, j] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
+ }
R> normalize(4, 3, contingency, diagonals)
[1] 1.467674
Here's how I'd approach using data.table. Hopefully #MartinMorgan finds this easier to understand :-).
require(data.table) # v1.9.6+
dt = as.data.table(df) # or use setDT(df) to convert by reference
eq = dt[First.Protein == Second.Protein]
So far, I've just created a new data.table eq which contains all rows where both columns are equal.
dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]
Here we add columns Score_ii and Score_jj while joining on columns First.Protein and Second.Protein. That it is a join operation should be clear because of on= argument. The i. refers to the Score column in the data.table provided in the i-argument (here, eq's Score).
Note that we can use match() here as well. But that wouldn't work if you've to lookup directly (and as efficiently) based on more than one column. Using on=, we can extend this quite easily, and is also much easier to read/understand.
Once we've all the required columns, the task is just to get the final Normalised column (and delete the intermediates if they're not necessary).
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL] # delete if you don't want them
I'll leave out the micro- and milli- second benchmarks as I'm not interested in them.
PS: The columns Score_ii and Score_jj are added above on purpose under the assumption that you might need them. If you don't want them at all, you can also do:
Score_ii = eq[dt, Score, on = "First.Protein"] ## -- (1)
Score_jj = eq[dt, Score, on = "Second.Protein"]
(1) reads: for each row in dt get matching row in eq while matching on column First.Protein and extract eq$Score corresponding to that matching row.
Then, we can directly add the Normalised column as:
dt[, Normalised := Score / sqrt(Score_ii * Score_jj)]
You can can implement this with joins, here is an example using data.table:
library(data.table)
dt <- data.table(df)
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt <- dt[dt.lookup][
, Normalized := Score / (sqrt(Score1) * sqrt(Score2))][
, .(First.Protein, Second.Protein, Normalized)]
Just make sure you don't use for loops.
Loop through rows using apply:
#compute
df$ScoreNorm <-
apply(df, 1, function(i){
i[3] /
(
sqrt(df[ df$First.Protein == i[1] &
df$Second.Protein == i[1], "Score"]) *
sqrt(df[ df$First.Protein == i[2] &
df$Second.Protein == i[2], "Score"])
)
})
#test output
df[15, ]
# First.Protein Second.Protein Score ScoreNorm
# 15 4 3 99 1.467674

faster way to compare rows in a data frame

Consider the data frame below. I want to compare each row with rows below and then take the rows that are equal in more than 3 values.
I wrote the code below, but it is very slow if you have a large data frame.
How could I do that faster?
data <- as.data.frame(matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T))
rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
>data
V1 V2 V3 V4 V5
sample_1 10 11 10 13 9
sample_2 10 11 10 14 9
sample_3 10 10 8 12 9
sample_4 10 11 10 13 9
sample_5 13 13 10 13 9
output <- data.frame(sample = NA, duplicate = NA, matches = NA)
dfrow <- 1
for(i in 1:nrow(data)) {
sample <- data[i, ]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
matches <- 0
for(V in 1:ncol(data)) {
if(data[j,V] == sample[,V]) {
matches <- matches + 1
}
}
if(matches > 3) {
duplicate <- data[j, ]
pair <- cbind(rownames(sample), rownames(duplicate), matches)
output[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}
>output
sample duplicate matches
1 sample_1 sample_2 4
2 sample_1 sample_4 5
3 sample_2 sample_4 4
Here is an Rcpp solution. However, if the result matrix gets too big (i.e., there are too many hits), this will throw an error. I run the loops twice, first to get the necessary size of the result matrix and then to fill it. There is probably a better possibility. Also, obviously, this will only work with integers. If your matrix is numeric, you'll have to deal with floating point precision.
library(Rcpp)
library(inline)
#C++ code:
body <- '
const IntegerMatrix M(as<IntegerMatrix>(MM));
const int m=M.ncol(), n=M.nrow();
long count1;
int count2;
count1 = 0;
for (int i=0; i<(n-1); i++)
{
for (int j=(i+1); j<n; j++)
{
count2 = 0;
for (int k=0; k<m; k++) {
if (M(i,k)==M(j,k)) count2++;
}
if (count2>3) count1++;
}
}
IntegerMatrix R(count1,3);
count1 = 0;
for (int i=0; i<(n-1); i++)
{
for (int j=(i+1); j<n; j++)
{
count2 = 0;
for (int k=0; k<m; k++) {
if (M(i,k)==M(j,k)) count2++;
}
if (count2>3) {
count1++;
R(count1-1,0) = i+1;
R(count1-1,1) = j+1;
R(count1-1,2) = count2;
}
}
}
return wrap(R);
'
fun <- cxxfunction(signature(MM = "matrix"),
body,plugin="Rcpp")
#with your data
fun(as.matrix(data))
# [,1] [,2] [,3]
# [1,] 1 2 4
# [2,] 1 4 5
# [3,] 2 4 4
#Benchmarks
set.seed(42)
mat1 <- matrix(sample(1:10,250*26,TRUE),ncol=26)
mat2 <- matrix(sample(1:10,2500*26,TRUE),ncol=26)
mat3 <- matrix(sample(1:10,10000*26,TRUE),ncol=26)
mat4 <- matrix(sample(1:10,25000*26,TRUE),ncol=26)
library(microbenchmark)
microbenchmark(
fun(mat1),
fun(mat2),
fun(mat3),
fun(mat4),
times=3
)
# Unit: milliseconds
# expr min lq median uq max neval
# fun(mat1) 2.675568 2.689586 2.703603 2.732487 2.761371 3
# fun(mat2) 272.600480 274.680815 276.761151 276.796217 276.831282 3
# fun(mat3) 4623.875203 4643.634249 4663.393296 4708.067638 4752.741979 3
# fun(mat4) 29041.878164 29047.151348 29052.424532 29235.839275 29419.254017 3
EDIT: Not sure what I was thinking last night when I subtracted rows considering I could've directly tested for equality. Removed that uncessary step from the code below.
Here is one approach that may either be slightly clever or poorly thought out... but hopefully the former. The idea is that instead of doing a series of comparisons row-by-row you can instead perform some vectorized operations by subtracting the row from the rest of the data frame and then looking at the number of elements that are equal to zero. Here is a simple implementation of the approach:
> library(data.table)
> data <- as.data.frame(matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T))
> rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
>
> findMatch <- function(i,n){
+ tmp <- colSums(t(data[-(1:i),]) == unlist(data[i,]))
+ tmp <- tmp[tmp > n]
+ if(length(tmp) > 0) return(data.table(sample=rownames(data)[i],duplicate=names(tmp),match=tmp))
+ return(NULL)
+ }
>
> system.time(tab <- rbindlist(lapply(1:(nrow(data)-1),findMatch,n=3)))
user system elapsed
0.003 0.000 0.003
> tab
sample duplicate match
1: sample_1 sample_2 4
2: sample_1 sample_4 5
3: sample_2 sample_4 4
EDIT: Here is version2 that uses matrices and pre-tranposes the data so you only need to do that once. It should scale better to your example with a non-trivial amount of data.
library(data.table)
data <- matrix(round(runif(26*250000,0,25)),ncol=26)
tdata <- t(data)
findMatch <- function(i,n){
tmp <- colSums(tdata[,-(1:i)] == data[i,])
j <- which(tmp > n)
if(length(tmp) > 0) return(data.table(sample=i,duplicate=j+1,match=tmp[j]))
return(NULL)
}
tab <- rbindlist(lapply(1:(nrow(data)-1),findMatch,n=3))
I ran than on my machine for a bit and got through the first 1500 iterations a full 250,000 x 26 matrix in under 15 minutes and required 600 Mb memory. Since previous iterations do not impact future iterations you could certainly chunk this into parts and run it separately if needed.
This is not a complete answer, just a quick workout that comes in mind is to use matrices instead of data.frame (those are quite slow tbh). Matrices are quite fast in R and by completing at least some operations in it and then appending the vector with column names will result in significant speed increase.
Just a quick demo:
data <- matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T)rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
mu<-c("sample_1","sample_2","sample_3","sample_4","sample_5")
t=proc.time()
tab <- data.frame(sample = NA, duplicate = NA, matches = NA)
dfrow <- 1
for(i in 1:nrow(data)) {
sample <- data[i, ]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
matches <- 0
for(V in 1:ncol(data)) {
if(data[j,V] == sample[V]) {
matches <- matches + 1
}
}
if(matches > 3) {
duplicate <- data[j, ]
pair <- cbind(mu[i], mu[j], matches)
tab[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}
proc.time()-t
On the average, on my machine, yields
user system elapsed
0.00 0.06 0.06
While in your case I get
user system elapsed
0.02 0.06 0.08
I'm not sure whether there's something more quicker than matrices. You can also play around with parallelisation, but for loops C++ code inlining are quite often used (package Rcpp).
library(data.table)
#creating the data
dt <- data.table(read.table(textConnection(
"Sample V1 V2 V3 V4 V5
sample_1 10 11 10 13 9
sample_2 10 11 10 14 9
sample_3 10 10 8 12 9
sample_4 10 11 10 13 9
sample_5 13 13 10 13 9"), header= TRUE))
# some constants which will be used frequently
nr = nrow(dt)
nc = ncol(dt)-1
#list into which we will insert the no. of matches for each sample
#for example's sake, i still suggest you write output to a file possibly
totalmatches <- vector(mode = "list", length = (nr-1))
#looping over each sample
for ( i in 1:(nr-1))
{
# all combinations of i with i+1 to nr
samplematch <- cbind(dt[i],dt[(i+1):nr])
# renaming the comparison sample columns
setnames(samplematch,append(colnames(dt),paste0(colnames(dt),"2")))
#calculating number of matches
samplematch[,noofmatches := 0]
for (j in 1:nc)
{
samplematch[,noofmatches := noofmatches+1*(get(paste0("V",j)) == get(paste0("V",j,"2")))]
}
# removing individual value columns and matches < 3
samplematch <- samplematch[noofmatches >= 3,list(Sample,Sample2,noofmatches)]
# adding to the list
totalmatches[[i]] <- samplematch
}
The output -
rbindlist(totalmatches)
Sample Sample2 noofmatches
1: sample_1 sample_2 4
2: sample_1 sample_4 5
3: sample_1 sample_5 3
4: sample_2 sample_4 4
5: sample_4 sample_5 3
The performance on matrices seems to be better though, this method clocked -
user system elapsed
0.17 0.01 0.19
Everything that has been said in the comments is very valid; in particular, I also don't necessarily think R is the best place to do this. That said, this works a lot quicker for me than what you've posed on a much larger dataset (~9.7 sec vs. unfinished after two minutes):
data <- matrix(sample(1:30, 10000, replace=TRUE), ncol=5)
#Pre-prepare
x <- 1
#Loop
for(i in seq(nrow(data)-2)){
#Find the number of matches on that row
sums <- apply(data[seq(from=-1,to=-i),], 1, function(x) sum(x==data[i,]))
#Find how many are greater than/equal to 3
matches <- which(sums >= 3)
#Prepare output
output[seq(from=x, length.out=length(matches)),1] <- rep(i, length(matches))
output[seq(from=x, length.out=length(matches)),2] <- matches
output[seq(from=x, length.out=length(matches)),3] <- sums[matches]
#Alter the counter of how many we've made...
x <- x + length(matches)
}
#Cleanup output
output <- output[!is.na(output[,1]),]})
...I'm fairly certain my weird x variable and the assignment of output could be improved/turned into an apply-type problem, but it's late and I'm tired! Good luck!
Well, I took a stab at it, the following code runs about 3 times faster than the original.
f <- function(ind, mydf){
res <- NULL
matches <- colSums(t(mydf[-(1:ind),])==mydf[ind,])
Ndups <- sum(matches > 3)
if(Ndups > 0){
res <- data.frame(sample=rep(ind,Ndups),duplicate=which(matches > 3),
matches= matches[matches > 3],stringsAsFactors = F)
rownames(res) <- NULL
return(as.matrix(res))
}
return(res)
}
f(1,mydf=as.matrix(data))
f(2,mydf=as.matrix(data))
system.time(
for(i in 1:1000){
tab <- NULL
for(j in 1:(dim(data)[1]-1))
tab <- rbind(tab,f(j,mydf=as.matrix(data)))
}
)/1000
tab
Assuming that all the entries in your dataset are of the same mode (numeric), turn it into a matrix. By transposing, you can take advantage of how == can be vectorized.
data <- as.matrix(data)
data <- t(data)
output <- lapply(seq_len(ncol(data) - 1), function(x) {
tmp <- data[,x] == data[, (x+1):ncol(data)]
n_matches <- {
if (x == ncol(data) - 1) {
setNames(sum(tmp),colnames(data)[ncol(data)])
} else {
colSums(tmp)
}
}
good_matches <- n_matches[n_matches >= 3]
})
The big question is how to output the results. As it stands I have your data in a list. I would think that this is the least memory-intensive way of storing your data.
[[1]]
sample_2 sample_4 sample_5
4 5 3
[[2]]
sample_4
4
[[3]]
named numeric(0)
[[4]]
sample_5
3
If you want a data frame output, then you'll want to tweak the return value of the function within lapply. Perhaps add in the last line of the function:
return(data.frame(
sample = colnames(data)[x],
duplicate = names(good_matches),
noofmatches = good_matches,
stringsAsFactors = FALSE))
And then use:
newoutput <- do.call(rbind, output)
## or, using plyr
# require(plyr)
# newoutput <- rbind.fill(output)

Resources