R maximum distance of a matrix after removing some of values - r

Suppose we have a matrix like below,
A <- matrix(c(1,7,13,19,9,5,8,14,20,10,3,4,15,21,1,2,4,16,22,2,8,3,17,23,1,6,3,18,24,2), nrow=5)
A
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 5 3 2 8 6
[2,] 7 8 4 4 3 3
[3,] 13 14 15 16 17 18
[4,] 19 20 21 22 23 24
[5,] 9 10 1 2 1 2
The dist function can calculate the maximum absolute distance between each row of the matrix A and return distance matrix D using dist(A, method = "maximum"). D[i,j] = \max_{k}(|A[i,k]-A[j,k]|) For example,
D[1,2] = max( abs( A[1,] - A[2,] ) ) = max(6, 3, 1, 2, 5, 3) = 6
However, in my case, I need to firstly remove the i, j element , i.e, D[i,j] = \max_{k not equal to i or j}(|A[i,k]-A[j,k]|), for example, in the above example, the answer beomes
D[1,2] = max( abs( A[1,] - A[2,] ) ) = max( 1, 2, 5, 3) = 5
I have no idea how to do this in a efficient way, I know I can use for loop, but the data set is large, for loop is extremely slow.

Assume that your real matrix also has columns more than rows. Here is a base R implementation of the function you want:
max_dist <- function(mat, i, j) {
mat <- mat[c(i, j), -c(i, j)]
max(abs(mat[1L, ] - mat[2L, ]))
}
dist1 <- function(mat) {
n <- nrow(mat)
ids <- do.call(rbind, lapply(2:n, function(i, e) cbind(i:e, rep.int(i - 1L, e - i + 1L)), n))
out <- apply(ids, 1L, function(i) max_dist(mat, i[[1L]], i[[2L]]))
attributes(out) <- list(
Size = n, Labels = dimnames(mat)[[1L]], Diag = FALSE,
Upper = FALSE, method = "dist1", call = match.call(),
class = "dist"
)
out
}
If you think R is not fast enough for your case, then you can use the package parallelDist, which allows user-defined C++ distance functions. Consider the following implementation:
library(parallelDist)
library(RcppXPtrUtils)
library(RcppArmadillo)
mydist_ptr <- cppXPtr("double mydist(const arma::mat &A, const arma::mat &B) {
arma::uvec ids = {0, (unsigned int)A(0, 0), (unsigned int)B(0, 0)};
arma::mat A_ = A, B_ = B;
A_.shed_cols(ids); B_.shed_cols(ids);
return abs((A_ - B_)).max();
}", depends = "RcppArmadillo")
dist2 <- function(mat) {
# prepend row numbers to the matrix
# this later allows cpp function `mydist` to identify which rows to drop
parDist(cbind(seq_len(nrow(mat)), mat), method = "custom", func = mydist_ptr)
}
Test with the following matrices (small_m is the example in your post):
small_m <- matrix(c(1,5,3,2,8,6,7,8,4,4,3,3,13,14,15,16,17,18,19,20,21,22,23,24,9,10,1,2,1,2), 5, 6, byrow = TRUE)
large_m <- matrix(rnorm(1000000), 10, 100000)
Benchmark
# no real difference between these two implementations when the input matrix is small
> microbenchmark::microbenchmark(dist1(small_m), dist2(small_m))
Unit: microseconds
expr min lq mean median uq max neval cld
dist1(small_m) 77.4 87.10 112.403 106.5 125.95 212.2 100 a
dist2(small_m) 145.5 160.25 177.786 170.2 183.80 286.7 100 b
# `dist2` is faster with large matrix input. However, the efficiency of `dist1` is also acceptable IMO.
> microbenchmark::microbenchmark(dist1(large_m), dist2(large_m))
Unit: milliseconds
expr min lq mean median uq max neval cld
dist1(large_m) 129.7531 139.3909 152.13154 143.0549 149.5870 322.0173 100 b
dist2(large_m) 48.8025 52.5081 55.84333 55.5175 58.6095 67.6470 100 a
Output as follows
> dist1(small_m)
1 2 3 4
2 5
3 14 15
4 18 21 6
5 5 3 16 22
> dist2(small_m)
1 2 3 4
2 5
3 14 15
4 18 21 6
5 5 3 16 22

Here is a base R option using dist + combn + as.dist
r <- diag(0,nrow(m))
r[lower.tri(r)] <- combn(1:nrow(m),2,function(k) max(abs(do.call(`-`,asplit(m[k,],1)))[-k]))
out <- as.dist(r)
which gives
1 2 3 4
2 5
3 14 15
4 18 21 6
5 5 3 16 22
Data
> dput(m)
structure(c(1, 7, 13, 19, 9, 5, 8, 14, 20, 10, 3, 4, 15, 21,
1, 2, 4, 16, 22, 2, 8, 3, 17, 23, 1, 6, 3, 18, 24, 2), .Dim = 5:6)

Related

Using R to solve the Lucky 26 game

I am trying to show my son how coding can be used to solve a problem posed by a game as well as seeing how R handles big data. The game in question is called "Lucky 26". In this game numbers (1-12 with no duplicates) are positioned on 12 points on a star of david (6 vertex, 6 intersections) and the 6 lines of 4 numbers must all add to 26. Of the approximately 479 million possibilities (12P12) there are apparently 144 solutions. I tried to code this in R as follows but memory is an issue it seems. I would greatly appreciate any advice to advance the answer if members have time. Thanking members in advance.
library(gtools)
x=c()
elements <- 12
for (i in 1:elements)
{
x[i]<-i
}
soln=c()
y<-permutations(n=elements,r=elements,v=x)
j<-nrow(y)
for (i in 1:j)
{
L1 <- y[i,1] + y[i,3] + y[i,6] + y[i,8]
L2 <- y[i,1] + y[i,4] + y[i,7] + y[i,11]
L3 <- y[i,8] + y[i,9] + y[i,10] + y[i,11]
L4 <- y[i,2] + y[i,3] + y[i,4] + y[i,5]
L5 <- y[i,2] + y[i,6] + y[i,9] + y[i,12]
L6 <- y[i,5] + y[i,7] + y[i,10] + y[i,12]
soln[i] <- (L1 == 26)&(L2 == 26)&(L3 == 26)&(L4 == 26)&(L5 == 26)&(L6 == 26)
}
z<-which(soln)
z
There are actually 960 solutions. Below we make use of Rcpp, RcppAlgos*, and the parallel package to obtain the solution in just over 6 seconds using 4 cores. Even if you choose to use a single threaded approach with base R's lapply, the solution is returned in around 25 seconds.
First, we write a simple algorithm in C++ that checks a particular permutation. You will note that we use one array to store all six lines. This is for performance as we utilize cache memory more effectively than using 6 individual arrays. You will also have to keep in mind that C++ uses zero based indexing.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
constexpr int index26[24] = {0, 2, 5, 7,
0, 3, 6, 10,
7, 8, 9, 10,
1, 2, 3, 4,
1, 5, 8, 11,
4, 6, 9, 11};
// [[Rcpp::export]]
IntegerVector DavidIndex(IntegerMatrix mat) {
const int nRows = mat.nrow();
std::vector<int> res;
for (int i = 0; i < nRows; ++i) {
int lucky = 0;
for (int j = 0, s = 0, e = 4;
j < 6 && j == lucky; ++j, s += 4, e += 4) {
int sum = 0;
for (int k = s; k < e; ++k)
sum += mat(i, index26[k]);
lucky += (sum == 26);
}
if (lucky == 6) res.push_back(i);
}
return wrap(res);
}
Now, using the lower and upper arguments in permuteGeneral, we can generate chunks of permutations and test these individually to keep memory in check. Below, I have chosen to test about 4.7 million permutations at a time. The output gives the lexicographical indices of the permutations of 12! such that the Lucky 26 condition is satisfied.
library(RcppAlgos)
## N.B. 4790016L evenly divides 12!, so there is no need to check
## the upper bound on the last iteration below
system.time(solution <- do.call(c, parallel::mclapply(seq(1L, factorial(12), 4790016L), function(x) {
perms <- permuteGeneral(12, 12, lower = x, upper = x + 4790015)
ind <- DavidIndex(perms)
ind + x
}, mc.cores = 4)))
user system elapsed
13.005 6.258 6.644
## Foregoing the parallel package and simply using lapply,
## we obtain the solution in about 25 seconds:
## user system elapsed
## 18.495 6.221 24.729
Now, we verify using permuteSample and the argument sampleVec which allows you to generate specific permutations (e.g. if you pass 1, it will give you the first permutation (i.e. 1:12)).
system.time(Lucky26 <- permuteSample(12, 12, sampleVec=solution))
user system elapsed
0.001 0.000 0.001
head(Lucky26)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 2 4 12 8 10 6 11 5 3 7 9
[2,] 1 2 6 10 8 12 4 7 3 5 11 9
[3,] 1 2 7 11 6 8 5 10 4 3 9 12
[4,] 1 2 7 12 5 10 4 8 3 6 9 11
[5,] 1 2 8 9 7 11 4 6 3 5 12 10
[6,] 1 2 8 10 6 12 4 5 3 7 11 9
tail(Lucky26)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[955,] 12 11 5 3 7 1 9 8 10 6 2 4
[956,] 12 11 5 4 6 2 9 7 10 8 1 3
[957,] 12 11 6 1 8 3 9 5 10 7 4 2
[958,] 12 11 6 2 7 5 8 3 9 10 4 1
[959,] 12 11 7 3 5 1 9 6 10 8 2 4
[960,] 12 11 9 1 5 3 7 2 8 10 6 4
Finally, we verify our solution with base R rowSums:
all(rowSums(Lucky26[, c(1, 3, 6, 8]) == 26)
[1] TRUE
all(rowSums(Lucky26[, c(1, 4, 7, 11)]) == 26)
[1] TRUE
all(rowSums(Lucky26[, c(8, 9, 10, 11)]) == 26)
[1] TRUE
all(rowSums(Lucky26[, c(2, 3, 4, 5)]) == 26)
[1] TRUE
all(rowSums(Lucky26[, c(2, 6, 9, 12)]) == 26)
[1] TRUE
all(rowSums(Lucky26[, c(5, 7, 10, 12)]) == 26)
[1] TRUE
* I am the author of RcppAlgos
For permutations, rcppalgos is great. Unfortunately, there are 479 million possibilities with 12 fields which means that takes up too much memory for most people:
library(RcppAlgos)
elements <- 12
permuteGeneral(elements, elements)
#> Error: cannot allocate vector of size 21.4 Gb
There are some alternatives.
Take a sample of the permutations. Meaning, only do 1 million instead of 479 million. To do this, you can use permuteSample(12, 12, n = 1e6). See #JosephWood's answer for a somewhat similar approach except he samples out to 479 million permutations ;)
Build a loop in rcpp to evaluate the permutation on creation. This saves memory because you would end up building the function to return only the correct results.
Approach the problem with a different algorithm. I will focus on this option.
New algorithm w/ constraints
Segments should be 26
We know that each line segment in the star above needs to add up to 26. We can add that constraint to generating our permutations - give us only combinations that add up to 26:
# only certain combinations will add to 26
lucky_combo <- comboGeneral(12, 4, comparisonFun = '==', constraintFun = 'sum', limitConstraints = 26L)
ABCD and EFGH groups
In the star above, I have colored three groups differently : ABCD, EFGH, and IJLK. The first two groups also have no points in common and are also on line segments of interest. Therefore, we can add another constraint: for combinations that add up to 26, we need to ensure ABCD and EFGH have no number overlap. IJLK will be assigned the remaining 4 numbers.
library(RcppAlgos)
lucky_combo <- comboGeneral(12, 4, comparisonFun = '==', constraintFun = 'sum', limitConstraints = 26L)
two_combo <- comboGeneral(nrow(lucky_combo), 2)
unique_combos <- !apply(cbind(lucky_combo[two_combo[, 1], ], lucky_combo[two_combo[, 2], ]), 1, anyDuplicated)
grp1 <- lucky_combo[two_combo[unique_combos, 1],]
grp2 <- lucky_combo[two_combo[unique_combos, 2],]
grp3 <- t(apply(cbind(grp1, grp2), 1, function(x) setdiff(1:12, x)))
Permute through the groups
We need to find all permutations of each group. That is, we only have combinations that add up to 26. For example, we need to take 1, 2, 11, 12 and make 1, 2, 12, 11; 1, 12, 2, 11; ....
#create group perms (i.e., we need all permutations of grp1, grp2, and grp3)
n <- 4
grp_perms <- permuteGeneral(n, n)
n_perm <- nrow(grp_perms)
# We create all of the permutations of grp1. Then we have to repeat grp1 permutations
# for all grp2 permutations and then we need to repeat one more time for grp3 permutations.
stars <- cbind(do.call(rbind, lapply(asplit(grp1, 1), function(x) matrix(x[grp_perms], ncol = n)))[rep(seq_len(sum(unique_combos) * n_perm), each = n_perm^2), ],
do.call(rbind, lapply(asplit(grp2, 1), function(x) matrix(x[grp_perms], ncol = n)[rep(1:n_perm, n_perm), ]))[rep(seq_len(sum(unique_combos) * n_perm^2), each = n_perm), ],
do.call(rbind, lapply(asplit(grp3, 1), function(x) matrix(x[grp_perms], ncol = n)[rep(1:n_perm, n_perm^2), ])))
colnames(stars) <- LETTERS[1:12]
Final Calculations
The last step is to do the math. I use lapply() and Reduce() here to do more functional programming - otherwise, a lot of code would be typed six times. See the original solution for a more thorough explanation of the math code.
# creating a list will simplify our math as we can use Reduce()
col_ind <- list(c('A', 'B', 'C', 'D'), #these two will always be 26
c('E', 'F', 'G', 'H'), #these two will always be 26
c('I', 'C', 'J', 'H'),
c('D', 'J', 'G', 'K'),
c('K', 'F', 'L', 'A'),
c('E', 'L', 'B', 'I'))
# Determine which permutations result in a lucky star
L <- lapply(col_ind, function(cols) rowSums(stars[, cols]) == 26)
soln <- Reduce(`&`, L)
# A couple of ways to analyze the result
rbind(stars[which(soln),], stars[which(soln), c(1,8, 9, 10, 11, 6, 7, 2, 3, 4, 5, 12)])
table(Reduce('+', L)) * 2
2 3 4 6
2090304 493824 69120 960
Swapping ABCD and EFGH
At the end of the code above, I took advantage that we can swap ABCD and EFGH to get the remaining permutations. Here is the code to confirm that yes, we can swap the two groups and be correct:
# swap grp1 and grp2
stars2 <- stars[, c('E', 'F', 'G', 'H', 'A', 'B', 'C', 'D', 'I', 'J', 'K', 'L')]
# do the calculations again
L2 <- lapply(col_ind, function(cols) rowSums(stars2[, cols]) == 26)
soln2 <- Reduce(`&`, L2)
identical(soln, soln2)
#[1] TRUE
#show that col_ind[1:2] always equal 26:
sapply(L, all)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
Performance
In the end, we evaluated only 1.3 million of the 479 permutations and only only shuffled through 550 MB of RAM. It takes around 0.7s to run
# A tibble: 1 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl>
1 new_algo 688ms 688ms 1.45 550MB 7.27 1 5
Here is another approach. It's based on a MathWorks blog post by Cleve Moler, the author of the first MATLAB.
In the blog post, to save memory the author permutes only 10 elements, keeping the first element as the apex element and the 7th as the base element. Therefore, only 10! == 3628800 permutations need to be tested.
In the code below,
Generate the permutations of elements 1 to 10. There are a total of 10! == 3628800 of them.
Choose 11 as the apex element and keep it fixed. It really doesn't matter where the assignments start, the other elements will be in the right relative positions.
Then assign the 12th element to the 2nd position, 3rd position, etc, in a for loop.
This should produce most of the solutions, give or take rotations and reflections. But it doesn't guarantee that the solutions are unique. It is also reasonably fast.
elements <- 12
x <- seq_len(elements)
p <- gtools::permutations(n = elements - 2, r = elements - 2, v = x[1:10])
i1 <- c(1, 3, 6, 8)
i2 <- c(1, 4, 7, 11)
i3 <- c(8, 9, 10, 11)
i4 <- c(2, 3, 4, 5)
i5 <- c(2, 6, 9, 12)
i6 <- c(5, 7, 10, 12)
result <- vector("list", elements - 1)
for(i in 0:10){
if(i < 1){
p2 <- cbind(11, 12, p)
}else if(i == 10){
p2 <- cbind(11, p, 12)
}else{
p2 <- cbind(11, p[, 1:i], 12, p[, (i + 1):10])
}
L1 <- rowSums(p2[, i1]) == 26
L2 <- rowSums(p2[, i2]) == 26
L3 <- rowSums(p2[, i3]) == 26
L4 <- rowSums(p2[, i4]) == 26
L5 <- rowSums(p2[, i5]) == 26
L6 <- rowSums(p2[, i6]) == 26
i_sol <- which(L1 & L2 & L3 & L4 & L5 & L6)
result[[i + 1]] <- if(length(i_sol) > 0) p2[i_sol, ] else NA
}
result <- do.call(rbind, result)
dim(result)
#[1] 82 12
head(result)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#[1,] 11 12 1 3 10 5 8 9 7 6 4 2
#[2,] 11 12 1 3 10 8 5 6 4 9 7 2
#[3,] 11 12 1 7 6 4 3 10 2 9 5 8
#[4,] 11 12 3 2 9 8 6 4 5 10 7 1
#[5,] 11 12 3 5 6 2 9 10 8 7 1 4
#[6,] 11 12 3 6 5 4 2 8 1 10 7 9
Here's the solution for the little fella:
numbersToDrawnFrom = 1:12
bling=0
while(T==T){
bling=bling+1
x=sample(numbersToDrawnFrom,12,replace = F)
A<-x[1]+x[2]+x[3]+x[4] == 26
B<-x[4]+x[5]+x[6]+x[7] == 26
C<-x[7] + x[8] + x[9] + x[1] == 26
D<-x[10] + x[2] + x[9] + x[11] == 26
E<-x[10] + x[3] + x[5] + x[12] == 26
F1<-x[12] + x[6] + x[8] + x[11] == 26
vectorTrue <- c(A,B,C,D,E,F1)
if(min(vectorTrue)==1){break}
if(bling == 1000000){break}
}
x
vectorTrue

Extract multiple ranges from a numeric vector

First, I simplify my question. I want to extract certain ranges from a numeric vector. For example, extracting 3 ranges from 1:20 at the same time :
1 < x < 5
8 < x < 12
17 < x < 20
Therefore, the expected output is 2, 3, 4, 9, 10, 11, 18, 19.
I try to use the function findInterval() and control arguments rightmost.closed and left.open to do that, but any arguments sets cannot achieve the goal.
x <- 1:20
v <- c(1, 5, 8, 12, 17, 20)
x[findInterval(x, v) %% 2 == 1]
# [1] 1 2 3 4 8 9 10 11 17 18 19
x[findInterval(x, v, rightmost.closed = T) %% 2 == 1]
# [1] 1 2 3 4 8 9 10 11 17 18 19 20
x[findInterval(x, v, left.open = T) %% 2 == 1]
# [1] 2 3 4 5 9 10 11 12 18 19 20
By the way, the conditions can also be a matrix like that :
[,1] [,2]
[1,] 1 5
[2,] 8 12
[3,] 17 20
I don't want to use for loop if it's not necessary.
I am grateful for any helps.
I'd probably do it using purrr::map2 or Map, passing your lower-bounds and upper-bounds as arguments and filtering your dataset with a custom function
library(purrr)
x <- 1:20
lower_bounds <- c(1, 8, 17)
upper_bounds <- c(5, 12, 20)
map2(
lower_bounds, upper_bounds, function(lower, upper) {
x[x > lower & x < upper]
}
)
You may use data.table::inrange and its incbounds argument. Assuming ranges are in a matrix 'm', as shown in your question:
x[data.table::inrange(x, m[ , 1], m[ , 2], incbounds = FALSE)]
# [1] 2 3 4 9 10 11 18 19
m <- matrix(v, ncol = 2, byrow = TRUE)
You were on the right path, and left.open indeed helps, but rightmost.closed actually concerns only the last interval rather than the right "side" of each interval. Hence, we need to use left.open twice. As you yourself figured out, it looks like an optimal way to do that is
x[findInterval(x, v) %% 2 == 1 & findInterval(x, v, left.open = TRUE) %% 2 == 1]
# [1] 2 3 4 9 10 11 18 19
Clearly there are alternatives. E.g.,
fun <- function(x, v)
if(length(v) > 1) v[1] < x & x < v[2] | fun(x, v[-1:-2]) else FALSE
x[fun(x, v)]
# [1] 2 3 4 9 10 11 18 19
I found an easy way just with sapply() :
x <- 1:20
v <- c(1, 5, 8, 12, 17, 20)
(v.df <- as.data.frame(matrix(v, 3, 2, byrow = T)))
# V1 V2
# 1 1 5
# 2 8 12
# 3 17 20
y <- sapply(x, function(x){
ind <- (x > v.df$V1 & x < v.df$V2)
if(any(ind)) x else NA
})
y[!is.na(y)]
# [1] 2 3 4 9 10 11 18 19

Efficient implementation in computing pairwise differences

Suppose I have a data frame as follows:
> foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
> foo
x id
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
6 6 3
7 7 3
8 8 3
9 9 3
I want a very efficient implementation of h(a, b) that computes sums all (a - xi)*(b - xj) for xi, xj belonging to the same id class. For example, my current implementation is
h(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff%*%t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) + diag(nrow(foo))
return(sum(prod*id.indicator))
}
For example, with (a, b) = (0, 1), here is the output from each step in the function
> a.diff
[1] -1 -2 -3 -4 -5 -6 -7 -8 -9
> b.diff
[1] 0 -1 -2 -3 -4 -5 -6 -7 -8
> prod
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 1 2 3 4 5 6 7 8
[2,] 0 2 4 6 8 10 12 14 16
[3,] 0 3 6 9 12 15 18 21 24
[4,] 0 4 8 12 16 20 24 28 32
[5,] 0 5 10 15 20 25 30 35 40
[6,] 0 6 12 18 24 30 36 42 48
[7,] 0 7 14 21 28 35 42 49 56
[8,] 0 8 16 24 32 40 48 56 64
[9,] 0 9 18 27 36 45 54 63 72
> id.indicator
1 2 3 4 5 6 7 8 9
1 1 1 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0
3 0 0 1 1 1 0 0 0 0
4 0 0 1 1 1 0 0 0 0
5 0 0 1 1 1 0 0 0 0
6 0 0 0 0 0 1 1 1 1
7 0 0 0 0 0 1 1 1 1
8 0 0 0 0 0 1 1 1 1
9 0 0 0 0 0 1 1 1 1
In reality, there can be up to 1000 id clusters, and each cluster will be at least 40, making this method too inefficient because of the sparse entries in id.indicator and extra computations in prod on the off-block-diagonals which won't be used.
I played a round a bit. First, your implementation:
foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
h <- function(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff%*%t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) +
diag(nrow(foo))
return(sum(prod*id.indicator))
}
h(a = 1, b = 0, foo = foo)
#[1] 891
Next, I tried a variant using a proper sparse matrix implementation (via the Matrix package) and functions for the index matrix. I also use tcrossprod which I often find to be a bit faster than a %*% t(b).
library("Matrix")
h2 <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
prod <- tcrossprod(a.diff, b.diff) # the same as a.diff%*%t(b.diff)
id.indicator <- do.call(bdiag, lapply(table(foo$id), function(n) matrix(1,n,n)))
return(sum(prod*id.indicator))
}
h2(a = 1, b = 0, foo = foo)
#[1] 891
Note that this function relies on foo$id being sorted.
Lastly, I tried avoid creating the full n by n matrix.
h3 <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
ids <- unique(foo$id)
res <- 0
for (i in seq_along(ids)) {
indx <- which(foo$id == ids[i])
res <- res + sum(tcrossprod(a.diff[indx], b.diff[indx]))
}
return(res)
}
h3(a = 1, b = 0, foo = foo)
#[1] 891
Benchmarking on your example:
library("microbenchmark")
microbenchmark(h(a = 1, b = 0, foo = foo),
h2(a = 1, b = 0, foo = foo),
h3(a = 1, b = 0, foo = foo))
# Unit: microseconds
# expr min lq mean median uq max neval
# h(a = 1, b = 0, foo = foo) 248.569 261.9530 493.2326 279.3530 298.2825 21267.890 100
# h2(a = 1, b = 0, foo = foo) 4793.546 4893.3550 5244.7925 5051.2915 5386.2855 8375.607 100
# h3(a = 1, b = 0, foo = foo) 213.386 227.1535 243.1576 234.6105 248.3775 334.612 100
Now, in this example, the h3 is the fastest and h2 is really slow. But I guess that both will be faster for larger examples. Probably, h3 will still win for larger examples though. While there is plenty of room of more optimization, h3 should be faster and more memory efficient. So, I think you should go for a variant of h3 which does not create unnecessarily large matrices.
tapply lets you apply a function across groups of a vector, and will simplify the results to a matrix or vector if it can. Using tcrossprod to multiply all the combinations for each group, and on some suitably large data it performs well:
# setup
set.seed(47)
foo = data.frame(x = 1:9, id = c(1, 1, 2, 2, 2, 3, 3, 3, 3))
foo2 <- data.frame(id = sample(1000, 40000, TRUE), x = rnorm(40000))
h_OP <- function(a, b, foo){
a.diff = a - foo$x
b.diff = b - foo$x
prod = a.diff %*% t(b.diff)
id.indicator = as.matrix(ifelse(dist(foo$id, diag = T, upper = T),0,1)) + diag(nrow(foo))
return(sum(prod * id.indicator))
}
h3_AEBilgrau <- function(a, b, foo) {
a.diff <- a - foo$x
b.diff <- b - foo$x
ids <- unique(foo$id)
res <- 0
for (i in seq_along(ids)) {
indx <- which(foo$id == ids[i])
res <- res + sum(tcrossprod(a.diff[indx], b.diff[indx]))
}
return(res)
}
h_d.b <- function(a, b, foo){
sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x))))
}
h_alistaire <- function(a, b, foo){
sum(tapply(foo$x, foo$id, function(x){sum(tcrossprod(a - x, b - x))}))
}
All return the same thing, and are not that different on small data:
h_OP(0, 1, foo)
#> [1] 891
h3_AEBilgrau(0, 1, foo)
#> [1] 891
h_d.b(0, 1, foo)
#> [1] 891
h_alistaire(0, 1, foo)
#> [1] 891
# small data test
microbenchmark::microbenchmark(
h_OP(0, 1, foo),
h3_AEBilgrau(0, 1, foo),
h_d.b(0, 1, foo),
h_alistaire(0, 1, foo)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> h_OP(0, 1, foo) 143.749 157.8895 189.5092 189.7235 214.3115 262.258 100 b
#> h3_AEBilgrau(0, 1, foo) 80.970 93.8195 112.0045 106.9285 125.9835 225.855 100 a
#> h_d.b(0, 1, foo) 355.084 381.0385 467.3812 437.5135 516.8630 2056.972 100 c
#> h_alistaire(0, 1, foo) 148.735 165.1360 194.7361 189.9140 216.7810 287.990 100 b
On bigger data, difference become more stark, though. The original threatened to crash my laptop, but here are benchmarks for the fastest two:
# on 1k groups, 40k rows
microbenchmark::microbenchmark(
h3_AEBilgrau(0, 1, foo2),
h_alistaire(0, 1, foo2)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> h3_AEBilgrau(0, 1, foo2) 336.98199 403.04104 412.06778 410.52391 423.33008 443.8286 100 b
#> h_alistaire(0, 1, foo2) 14.00472 16.25852 18.07865 17.22296 18.09425 96.9157 100 a
Another possibility is to use a data.frame to summarize by group, then sum the appropriate column. In base R you'd do this with aggregate, but dplyr and and data.table are popular for making such an approach simpler with more complicated aggregations.
aggregate is slower than tapply. dplyr is faster than aggregate, but still slower. data.table, which is designed for speed, is almost exactly as fast as tapply.
library(dplyr)
library(data.table)
h_aggregate <- function(a, b, foo){sum(aggregate(x ~ id, foo, function(x){sum(tcrossprod(a - x, b - x))})$x)}
tidy_h <- function(a, b, foo){foo %>% group_by(id) %>% summarise(x = sum(tcrossprod(a - x, b - x))) %>% select(x) %>% sum()}
h_dt <- function(a, b, foo){setDT(foo)[, .(x = sum(tcrossprod(a - x, b - x))), by = id][, sum(x)]}
microbenchmark::microbenchmark(
h_alistaire(1, 0, foo2),
h_aggregate(1, 0, foo2),
tidy_h(1, 0, foo2),
h_dt(1, 0, foo2)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> h_alistaire(1, 0, foo2) 13.30518 15.52003 18.64940 16.48818 18.13686 62.35675 100 a
#> h_aggregate(1, 0, foo2) 93.08401 96.61465 107.14391 99.16724 107.51852 143.16473 100 c
#> tidy_h(1, 0, foo2) 39.47244 42.22901 45.05550 43.94508 45.90303 90.91765 100 b
#> h_dt(1, 0, foo2) 13.31817 15.09805 17.27085 16.46967 17.51346 56.34200 100 a
sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x))))
#[1] 891
#TESTING
foo = data.frame(x = sample(1:9,10000,replace = TRUE),
id = sample(1:3, 10000, replace = TRUE))
system.time(sum(sapply(split(foo, foo$id), function(d) sum(outer(a-d$x, b-d$x)))))
# user system elapsed
# 0.15 0.01 0.17

Quick manipulation of data frame in R

I have the following example data frame:
> a = data.frame(a=c(1, 2, 3), b=c(10, 11, 12), c=c(1, 1, 0))
> a
a b c
1 1 10 1
2 2 11 1
3 3 12 0
I want to do an operation to every row where if a$c == 1, a$a = a$b, otherwise, a$a keeps its value. The final data frame a should look like this:
> a
a b c
1 10 10 1
2 11 11 1
3 3 12 0
What is the fastest way to do this? Of course in my problem I have hundreds of thousands of rows, so looping over the entire data frame and doing one by one is extremely slow.
Thanks!
Easy as 1-2-3:
df = data.frame(a=c(1, 2, 3), b=c(10, 11, 12), c=c(1, 1, 0))
df$a[df$c == 1] <- df$b[df$c == 1]
df
## a b c
## 1 10 10 1
## 2 11 11 1
## 3 3 12 0
It reads: substitute all the elements in a corresponding to c==1 with all the elements in b corresponding to c==1.
A benchmark:
df <- data.frame(a=runif(100000), b=runif(100000), c=sample(c(1,0), 100000, replace=TRUE))
library(microbenchmark)
microbenchmark(df$a[df$c == 1] <- df$b[df$c == 1], df$a <- with(df, ifelse(c == 1, b, a)))
## Unit: milliseconds
## expr min lq median uq max neval
## df$a[df$c == 1] <- df$b[df$c == 1] 13.85375 15.13073 16.61701 74.5387 88.47949 100
## df$a <- with(df, ifelse(c == 1, b, a)) 44.23750 78.85029 103.01894 105.1750 118.09492 100
a$a <- with(a, ifelse(c == 1, b, a))

Add new row to dataframe, at specific row-index, not appended?

The following code combines a vector with a dataframe:
newrow = c(1:4)
existingDF = rbind(existingDF,newrow)
However this code always inserts the new row at the end of the dataframe.
How can I insert the row at a specified point within the dataframe? For example, lets say the dataframe has 20 rows, how can I insert the new row between rows 10 and 11?
Here's a solution that avoids the (often slow) rbind call:
existingDF <- as.data.frame(matrix(seq(20),nrow=5,ncol=4))
r <- 3
newrow <- seq(4)
insertRow <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}
> insertRow(existingDF, newrow, r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
If speed is less important than clarity, then #Simon's solution works well:
existingDF <- rbind(existingDF[1:r,],newrow,existingDF[-(1:r),])
> existingDF
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 1 2 3 4
41 4 9 14 19
5 5 10 15 20
(Note we index r differently).
And finally, benchmarks:
library(microbenchmark)
microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r)
)
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 660.131 678.3675 695.5515 725.2775 928.299
2 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 801.161 831.7730 854.6320 881.6560 10641.417
Benchmarks
As #MatthewDowle always points out to me, benchmarks need to be examined for the scaling as the size of the problem increases. Here we go then:
benchmarkInsertionSolutions <- function(nrow=5,ncol=4) {
existingDF <- as.data.frame(matrix(seq(nrow*ncol),nrow=nrow,ncol=ncol))
r <- 3 # Row to insert into
newrow <- seq(ncol)
m <- microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r),
insertRow2(existingDF,newrow,r)
)
# Now return the median times
mediansBy <- by(m$time,m$expr, FUN=median)
res <- as.numeric(mediansBy)
names(res) <- names(mediansBy)
res
}
nrows <- 5*10^(0:5)
benchmarks <- sapply(nrows,benchmarkInsertionSolutions)
colnames(benchmarks) <- as.character(nrows)
ggplot( melt(benchmarks), aes(x=Var2,y=value,colour=Var1) ) + geom_line() + scale_x_log10() + scale_y_log10()
#Roland's solution scales quite well, even with the call to rbind:
5 50 500 5000 50000 5e+05
insertRow2(existingDF, newrow, r) 549861.5 579579.0 789452 2512926 46994560 414790214
insertRow(existingDF, newrow, r) 895401.0 905318.5 1168201 2603926 39765358 392904851
rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 787218.0 814979.0 1263886 5591880 63351247 829650894
Plotted on a linear scale:
And a log-log scale:
insertRow2 <- function(existingDF, newrow, r) {
existingDF <- rbind(existingDF,newrow)
existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),]
row.names(existingDF) <- 1:nrow(existingDF)
return(existingDF)
}
insertRow2(existingDF,newrow,r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
microbenchmark(
+ rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
+ insertRow(existingDF,newrow,r),
+ insertRow2(existingDF,newrow,r)
+ )
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 513.157 525.6730 531.8715 544.4575 1409.553
2 insertRow2(existingDF, newrow, r) 430.664 443.9010 450.0570 461.3415 499.988
3 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 606.822 625.2485 633.3710 653.1500 1489.216
The .before argument in dplyr::add_row can be used to specify the row.
dplyr::add_row(
cars,
speed = 0,
dist = 0,
.before = 3
)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 0 0
#> 4 7 4
#> 5 7 22
#> 6 8 16
#> ...
You should try dplyr package
library(dplyr)
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- bind_rows(a, b)
}
})
Output
user system elapsed
0.25 0.00 0.25
In contrast with using rbind function
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- rbind(a, b)
}
})
Output
user system elapsed
0.49 0.00 0.49
There is some performance gain.
Insert blank row after five row in data frame and use this library package.
library(berryFunctions)
df <- insertRows(df, 5 , new = "")

Resources