Efficient subsetting in R using 2 dataframes - r

I have a big time series full in one dataframe and a list of timestamps in a different dataframe test. I need to subset full with data points surrounding the timestamps in test. My first instinct (as an R noob) was to write the below, which was wrong
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
Looking at the result I realized that R is looping through both the vectors simultaneously giving the wrong result. My option is to write a loop like the below:
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
I feel that there might be a better way to do loops and this article implores us to avoid R loops as much as possible. The other reason is I might be hitting up against performance issues as this would be at the heart of an optimization algorithm. Any suggestions from gurus would be greatly appreciated.
EDIT:
Here is some reproducible code that shows the wrong approach as well as the approach that works but could be better.
#create a times series
full <- data.frame(seq(1:200),rnorm(200,0,1))
colnames(full)<-c("dt","val")
#my smaller array of points of interest
test <- data.frame(seq(5,200,by=23))
colnames(test)<-c("dt")
# my range around the points of interset
i<-3
#the wrong approach
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
#this works, but not sure this is the best way to go about it
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
EDIT:
I updated the values to better reflect my usecase, and I see #mrdwab 's solution pulling ahead unexpectedly and by a wide margin.
I am using benchmark code from #mrdwab and the initialization is as follows:
set.seed(1)
full <- data.frame(
dt = 1:15000000,
val = floor(rnorm(15000000,0,1))
)
test <- data.frame(dt = floor(runif(24,1,15000000)))
i <- 500
The benchmarks are:
test replications elapsed relative
2 mrdwab 2 1.31 1.00000
3 spacedman 2 69.06 52.71756
1 andrie 2 93.68 71.51145
4 original 2 114.24 87.20611
Totally unexpected. Mind = blown. Can someone please shed some light in this dark corner and enlighten as to what is happening.
Important: As #mrdwab notes below, his solution works only if the vectors are integers. If not, #spacedman has the right solution

Here's a real R way to do it. Functionally. No loops...
Starting with Andrie's example data.
First, an interval comparison function:
> cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
An OR composition function:
> OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
Now there's sort of a loop here, to construct a list of those comparison functions:
> funs = mapply(cf,test$dt-i,test$dt+i)
Now combine all those into one function:
> anyF = Reduce(OR,funs)
And now we apply the OR composition to our interval testing functions:
> head(full[anyF(full$dt),])
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
What you've got now is a function of a single variable that tests if the value is in the ranges you defined.
> anyF(1:10)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
I don't know if this is faster, or better, or what. Someone do some benchmarks!

I don't know if it's any more efficient, but I would think you could also do something like this to get what you want:
subs <- apply(test, 1, function(x) c((x-2):(x+2)))
full[which(full$dt %in% subs), ]
I had to adjust your "3" to "2" since x would be included both ways.
Benchmarking (just for fun)
#Spacedman leads the way!
First, the required data and functions.
## Data
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
## Spacedman's functions
cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
funs = mapply(cf,test$dt-i,test$dt+i)
anyF = Reduce(OR,funs)
Second, the benchmarking.
## Benchmarking
require(rbenchmark)
benchmark(andrie = do.call(rbind,
lapply(test$dt,
function(j) full[full$dt > (j-i) &
full$dt < (j+i), ])),
mrdwab = {subs <- apply(test, 1,
function(x) c((x-(i-1)):(x+(i-1))))
full[which(full$dt %in% subs), ]},
spacedman = full[anyF(full$dt),],
original = {subs <- data.frame()
for (j in test$dt)
subs <- rbind(subs,
subset(full, full$dt > (j-i) &
full$dt < (j+i)))},
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
# test replications elapsed relative
# 3 spacedman 100 0.064 1.000000
# 2 mrdwab 100 0.105 1.640625
# 1 andrie 100 0.520 8.125000
# 4 original 100 1.080 16.875000

There is nothing inherently wrong with your code. To achieve your aim, you need a loop of some sort around a vectorised subset operation.
But here is more R-ish way to do it, which might well be faster:
do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
PS: You can significantly simplify your reproducible example:
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
xx <- do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
head(xx)
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874

one more way using data.tables:
{
temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x")
temp[,index:=1:nrow(temp)]
startpoints <- temp[J(test$dt-i),index]$index
endpoints <- temp[J(test$dt+i),index]$index
allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints))
setkey(x=temp,index)
ans <- temp[J(allpoints)]$x
}
benchmarks:
number of rows in test:9
number of rows in full:10000
test replications elapsed relative
1 spacedman 100 0.406 1.000
2 new 100 1.179 2.904
number of rows in full:100000
test replications elapsed relative
2 new 100 2.374 1.000
1 spacedman 100 3.753 1.581

Related

fast way in R to do two nested for loops [duplicate]

This question already has answers here:
Subtract every element of vector A from every element of vector B
(4 answers)
Closed 5 years ago.
I need to take the difference between any two elements of two vector.
If A<-c(1,2) and B<-c(3,4) then my result R should be c(3-1,3-2,4-1,4-2).
With this snippet
myfunction <- function(N)
{
A = runif(N)
B = runif(N)
R = c()
for(a in A){
for(b in B){
R=c(b-a,R)
}
}
R
}
print(system.time(result <- myfunction(300)))
I get this time
user system elapsed
14.27 0.01 14.39
Is there any faster way to do it?
The fastest base solution is the use of outer:
as.vector(outer(B,A,"-"))
To my own surprise, map2_dbl is actually quite a bit faster than outer:
Not to my surprise, map2_dbl seems faster, but that's because it is not calculating every combination of values in A and B:
test elapsed relative
3 CP(A, B) 7.54 47.125 # using expand.grid
2 JL(A, B) 0.16 1.000 # using map2_dbl
1 JM(A, B) 3.13 19.563 # using outer
But:
> A <- 1:3
> B <- 3:1
> JL(A,B)
[1] -2 0 2
> JM(A,B)
[1] 2 1 0 1 0 -1 0 -1 -2
This is for two vectors of length 1000, and with 100 replications. I didn't include your own solution because that one is ridiculously slow for two reasons:
for loops in R are quite a bit faster than in the old days, but still not as optimal as using functions that have their loops coded in C or equivalent. That's the case for the functions used in the tested code here.
you "grow" your result object. Every loop through the code, that R becomes one value larger, so R has to look for a new place in the memory to store it. That's actually the biggest bottleneck in your code. Try to avoid that kind of construct at all costs, because it's one of the most important causes of terribly slow code.
The benchmark code:
library(tidyverse)
JM <- function(A,B){
as.vector(outer(B,A,"-"))
}
JL <- function(A,B){
map2_dbl(.x = A,
.y = B,
.f = ~ c(.x - .y))
}
CP <- function(A,B){
as.data.frame(expand.grid(A,B)) %>%
mutate(Var3 = Var2-Var1)
}
library(rbenchmark)
A <- runif(1000)
B <- runif(1000)
benchmark(JM(A,B),
JL(A,B),
CP(A,B),
replications = 100,
columns = c("test","elapsed","relative"))
You can use expand.grid to vectorize the approach:
A <- runif(300)
B <- runif(300)
library(dplyr)
R <- as.data.frame(expand.grid(A,B)) %>%
mutate(Var3 = Var2-Var1)
The first 5 lines of output:
Var1 Var2 Var3
1 0.8516676 0.325261 -0.5264066246
2 0.2126453 0.325261 0.1126156694
3 0.5394620 0.325261 -0.2142010126
4 0.1364876 0.325261 0.1887734290
5 0.3248651 0.325261 0.0003958747
This took:
user system elapsed
0.02 0.00 0.02
Your function took:
user system elapsed
42.39 0.43 42.90
Using purrr::map2:
library(tidyverse)
N = 300
A = runif(N)
B = runif(N)
R = c()
print(
system.time(
result <- map(
.x = A,
.f = ~ c(.x - B)) %>% unlist
)
)
Time taken:
user system elapsed
0.02 0 0.02
If I got your attention now, check out this repo for a nice walk through of purrr.

optimizing code in R for vector comparisons in data.table

As part of my program in R, I have to compare a huge number of pair of sentences with some functions (the one im showing here is comparing sentences with the same number of words, and whether there is just exactly one different word between those two sentences)
To make things faster, I have already converted all words into integers so I am dealing with integer vectors so the example function is a very simple one
is_sub_num <- function(a,b){sum(!(a==b))==1}
where a,b are character vectors such as
a = c(1,2,3); b=c(1,4,3)
is_sub_num(a,b)
# [1] TRUE
my data will be stored in a data.table
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ ID: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2:List of 100
..$ : int 4 4 3 4
..$ : int 1 2 3 1
the length of each entry may be different (in the example below, the entries are all of size 4)
I have a table with candidate pair IDs to test the corresponding entries in DT with the function above as follow
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
here is a simplification of what I'm trying to do:
set.seed=234
z = lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
DT <- as.data.table(1:100)
DT$V2 <- z
colnames(DT) <- c("ID","V2")
print(system.time(tmp <-apply(pair_list,1,is_pair_ok)))
this takes around 22 seconds on my laptop although its only 10,000 entries and the functions are very very basic.
Do you have any advice on how to speed up the code ???
i have delved further myself into this issue, and here is my answer.
I think its an important one, and everyone should know it so please vote for this post, it doesn't deserve its bad score !!
The code to the answer is below. I have put some new parameters to make the problem a bit more general.
The key point is to use the unlist function.
Whenever we use apply to a list object, we get very very bad performance in R.
its a bit of a pain in the ass to explode objects and to do manual indexing in a vector, but the speedup is phenomenal.
set.seed=234
N=100
nobs=10000
z = lapply(1:N, function(x) sample(1:4,size=sample(3:5),replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
is_pair_ok1 <- function(pair){
is_sub_num(zzz[pos_table[pair[1]]:(pos_table[pair[1]]+length_table[pair[1]] -1) ],
zzz[pos_table[pair[2]]:(pos_table[pair[2]]+length_table[pair[2]] -1) ]) }
pair_list <- as.data.table(cbind(sample(1:N,nobs,replace=TRUE),sample(1:N,nobs,replace=TRUE)))
DT <- as.data.table(1:N)
DT$V2 <- z
setnames(DT, c("ID","V2"))
setkey(DT, ID)
length_table <- sapply(z,length)
myfun <- function(i){sum(length_table[1:i])}
pos_table <- c(0,sapply(1:N,myfun))+1
zzz=unlist(z)
print(system.time(tmp_ref <- apply(pair_list,1,is_pair_ok)))
print(system.time(tmp <- apply(pair_list,1,is_pair_ok1)))
identical(tmp,tmp_ref)
here is the output
utilisateur système écoulé
20.96 0.00 20.96
utilisateur système écoulé
0.70 0.00 0.71
There were 50 or more warnings (use warnings() to see the first 50)
[1] TRUE
EDIT
it would a bit too long to post here. I tried to draw conclusions from the above and modify the source code of my program by trying to speed it up and using unlist, and manual indexing.
the new implementation actually is slower which came as a surprise to me, and i fail to understand why...
now I have 60% spare of time:
library(data.table)
set.seed(234)
is_sub_num <- function(a,b) sum(!(a==b))==1
is_pair_ok2 <- function(p1, p2) is_sub_num(DT[p1,V2][[1]],DT[p2,V2][[1]])
DT <- as.data.table(1:100)
DT$V2 <- lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
setnames(DT, c("ID","V2"))
setkey(DT, ID)
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
print(system.time(tmp <- mapply(FUN=is_pair_ok2, pair_list$V1, pair_list$V2)))
most effect had setting the key for DT and using fast indexing in is_pair_ok2()
a little bit more (without the function is_sub_num()):
is_pair_ok3 <- function(p1, p2) sum(DT[p1,V2][[1]]!=DT[p2,V2][[1]])==1
print(system.time(tmp <- mapply(FUN=is_pair_ok3, pair_list$V1, pair_list$V2)))

Benchmark analysis and display results of analysis AND benchmark?

Probably, I just missed a parameter... but, maybe someone can point me to it: How can run analysis in R benchmark it and still store the result back somewhere?. I know R functions can only return one single object, but I could either make use of a list here or paste the benchmark results and store the analysis in the function's return value.
But, is there any way to evaluate benchmark (or system.time) and analysis without running it twice like this?:
require(rbenchmark)
bmark <- function(x){
res <- list()
res[[1]] <- benchmark(x^6)
res[[2]] <- x^6
res
}
EDIT: I am sorry I caused some confusion about what I really want to do. Maybe the use case makes it clearer: I don't have a typical benchmark situation where I want to check whether my custom function is faster than some other function. It's rather that I run the same thing with different data on different machines. I don't need this in a test environment, but in production – I just want to let users of a script know how long it took. If that's an hour or more people can plan their lunch break :) .
Here's an example using two functions. The first one uses plyr and the second uses data.table.
# dummy data
require(plyr)
require(data.table)
set.seed(45)
x1 <- data.frame(x=rnorm(1e6), grp = sample(letters[1:26], 1e6, replace=T))
x1.dt <- data.table(x1, key="grp")
# function that uses plyr
DF.FUN <- function(x) {
ddply(x1, .(grp), summarise, m.x = mean(x))
}
# function that uses data.table
DT.FUN <- function(x) {
x1.dt[, list(m.x=mean(x)),by=grp]
}
require(rbenchmark)
> benchmark( s1 <- DF.FUN(), s2 <- DT.FUN(), order="elapsed", replications=2)
# test replications elapsed relative user.self sys.self user.child sys.child
# 2 s2 <- DT.FUN() 2 0.036 1.000 0.031 0.006 0 0
# 1 s1 <- DF.FUN() 2 0.527 14.639 0.363 0.163 0 0
Now, s1 and s2 contain the results from each function, and the benchmarked results will be displayed on screen.
# > head(s1)
# grp m.x
# 1 a 0.0069312201
# 2 b -0.0002422315
# 3 c -0.0129449586
# 4 d -0.0036275338
# 5 e 0.0013438022
# 6 f -0.0015428427
# > head(s2)
# grp m.x
# 1: a 0.0069312201
# 2: b -0.0002422315
# 3: c -0.0129449586
# 4: d -0.0036275338
# 5: e 0.0013438022
# 6: f -0.0015428427
Is this what you were after?
I read the question a bit differently than Arun. This would be the answer to what I thought was being asked:
> bres <- bmark(2)
> bres
[[1]]
test replications elapsed relative user.self sys.self user.child sys.child
1 x^6 100 0.001 1 0.001 0.001 0 0
[[2]]
[1] 64
The bmark function is returning a result with the default 100 replications. It you wanted to annotate the results you could use paste() and if you wanted to add a parameter for number of reps:
bmark2 <- function(x, reps=100){
res <- list()
res[[1]] <- benchmark(x^6, replications=reps)
res[[2]] <- paste(reps, " replications of ", x, "to the 6th in", res[[1]]$elapsed)
res
}
I am unsure of what StackOverflow thinks about answering old questions, but it seems like nobody actually answered after your edit. So here goes:
To time a process in R you can use two methods.
The first one uses system.time(expression) and gives you how much time it took to evaluate the expression within the brackets.
If this is not practical in your case you can get system time with Sys.time() before the operation and after the operation and subtract the two.
If this finally answers your question please accept the solution :)

does the by( ) function make growing list

Does the by function make a list that grows one element at a time?
I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:
> # Make 4M rows of data
> x = data.frame(col1=1:4000000, col2=10000001:14000000)
> # Make a factor
> x[,"f"] = x[,"col1"] - x[,"col1"] %% 5
>
> head(x)
col1 col2 f
1 1 10000001 0
2 2 10000002 0
3 3 10000003 0
4 4 10000004 0
5 5 10000005 5
6 6 10000006 5
Now, a tapply on one of the columns takes a reasonable amount of time:
> t1 = Sys.time()
> z = tapply(x[, 1], x[, "f"], mean)
> Sys.time() - t1
Time difference of 22.14491 secs
But if I do this:
z = by(x[, 1], x[, "f"], mean)
That doesn't finish anywhere near the same time (I gave up after a minute).
Of course, in the above example, tapply could be used, but I actually need to process multiple columns together. What is the better way to do this?
by is slower than tapply because it is wrapping by.
Let's take a look at some benchmarks: tapply in this situation is more than 3x faster than using by
UPDATED to include #Roland's great recomendation:
library(rbenchmark)
library(data.table)
dt <- data.table(x,key="f")
using.tapply <- quote(tapply(x[, 1], x[, "f"], mean))
using.by <- quote(by(x[, 1], x[, "f"], mean))
using.dtable <- quote(dt[,mean(col1),by=key(dt)])
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
#------------------------#
# RESULTS #
#------------------------#
# COMPARING tapply VS by #
#-----------------------------------
# test elapsed relative
# 1 using.tapply 2.453 1.000
# 2 using.by 8.889 3.624
# COMPARING data.table VS tapply VS by #
#------------------------------------------#
# test elapsed relative
# 2 using.dtable 0.168 1.000
# 1 using.tapply 2.396 14.262
# 3 using.by 8.566 50.988
If x$f is a factor, the loss in efficiency between tapply and by is even greater!
Although, notice that they both improve relative to non-factor inputs, while data.table remains approx the same or worse
x[, "f"] <- as.factor(x[, "f"])
dt <- data.table(x,key="f")
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
# test elapsed relative
# 2 using.dtable 0.175 1.000
# 1 using.tapply 1.803 10.303
# 3 using.by 7.854 44.880
As for the why, the short answer is in the documentation itself.
?by :
Description
Function by is an object-oriented wrapper for tapply applied to data frames.
let's take a look at the source for by (or more specificaly, by.data.frame):
by.data.frame
function (data, INDICES, FUN, ..., simplify = TRUE)
{
if (!is.list(INDICES)) {
IND <- vector("list", 1L)
IND[[1L]] <- INDICES
names(IND) <- deparse(substitute(INDICES))[1L]
}
else IND <- INDICES
FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
nd <- nrow(data)
ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)),
data)
attr(ans, "call") <- match.call()
class(ans) <- "by"
ans
}
We see immediately that there is still a call to tapply plus a lot of extras (including calls to deparse(substitute(.)) and an eval(substitute(.)) both of which are relatively slow). Therefore it makes sense that your tapply will be relatively faster than a similar call to by.
Regarding a better way to do this: With 4M rows you should use data.table.
library(data.table)
dt <- data.table(x,key="f")
dt[,mean(col1),by=key(dt)]
dt[,list(mean1=mean(col1),mean2=mean(col2)),by=key(dt)]
dt[,lapply(.SD,mean),by=key(dt)]

Count the number of valid observations (no NA) pairwise in a data frame

Say I have a data frame like this:
Df <- data.frame(
V1 = c(1,2,3,NA,5),
V2 = c(1,2,NA,4,5),
V3 = c(NA,2,NA,4,NA)
)
Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount:
sharedcount <- function(x,...){
nx <- names(x)
alln <- combn(nx,2)
out <- apply(alln,2,
function(y)sum(complete.cases(x[y]))
)
data.frame(t(alln),out)
}
This gives the output:
> sharedcount(Df)
X1 X2 out
1 V1 V2 3
2 V1 V3 1
3 V2 V3 2
All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :
> require(rbenchmark)
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+ columns=c('test','elapsed','relative'),
+ replications=1
+ )
test elapsed relative
2 cor(TestDf, use = "pairwise") 0.25 1.0
1 sharedcount(TestDf) 1.90 7.6
Any tips are appreciated.
Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.
The following is slightly faster:
x <- !is.na(Df)
t(x) %*% x
# test elapsed relative
# cor(Df) 12.345 1.000000
# t(x) %*% x 20.736 1.679708
I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.
shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){
shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}
benchmark(
shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]},
cor(Df,use='pairwise'),
columns=c('test','elapsed','relative'),
replications=100
)
#--------------
test elapsed relative
3 { 0.008 1.0
4 cor(Df, use = "pairwise") 0.020 2.5
2 sharedcount(Df) 0.092 11.5
1 shrcnt(Df) 0.036 4.5
Based on the lovely trick of Vincent and the additional lower.tri() suggestion of DWin, I came up with following function that gives me the same output (i.e. a data frame) as my original one, and runs a whole lot faster :
sharedcount2 <- function(x,stringsAsFactors=FALSE,...){
counts <- crossprod(!is.na(x))
id <- lower.tri(counts)
count <- counts[id]
X1 <- colnames(counts)[col(counts)[id]]
X2 <- rownames(counts)[row(counts)[id]]
data.frame(X1,X2,count)
}
Note the use of crossprod(), as that one gives a small improvement compared to %*%, but it does exactly the same.
The timings :
> benchmark(sharedcount(TestDf),sharedcount2(TestDf),
+ replications=5,
+ columns=c('test','replications','elapsed','relative'))
test replications elapsed relative
1 sharedcount(TestDf) 5 10.00 90.90909
2 sharedcount2(TestDf) 5 0.11 1.00000
Note: I supplied TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, the time increase is a lot more dramatic than when compared using a small data frame.

Resources