R: scan vectors once instead of 4 times? - r

Suppose I have two equal length logical vectors.
Computing the confusion matrix the easy way:
c(sum(actual == 1 & predicted == 1),
sum(actual == 0 & predicted == 1),
sum(actual == 1 & predicted == 0),
sum(actual == 0 & predicted == 0))
requires scanning the vectors 4 times.
Is it possible to do that in a single pass?
PS. I tried table(2*actual+predicted) and table(actual,predicted) but both are obviously much slower.
PPS. Speed is not my main consideration here, I am more interested in understanding the language.

You could try using data.table
library(data.table)
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N
data
set.seed(24)
actual <- sample(0:1, 10 , replace=TRUE)
predicted <- sample(0:1, 10, replace=TRUE)
Benchmarks
Using data.table_1.9.5 and dplyr_0.4.0
library(microbenchmark)
set.seed(245)
actual <- sample(0:1, 1e6 , replace=TRUE)
predicted <- sample(0:1, 1e6, replace=TRUE)
f1 <- function(){
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N}
f2 <- function(){table(actual, predicted)}
f3 <- function() {data_frame(actual, predicted) %>%
group_by(actual, predicted) %>%
summarise(n())}
microbenchmark(f1(), f2(), f3(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
#f1() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a
#f2() 20.818410 22.378995 22.321816 22.56931 22.140855 22.984667 20 b
#f3() 1.262047 1.248396 1.436559 1.21237 1.220109 2.504662 20 a
Including the count from dplyr and tabulate also in the benchmarks on a slightly bigger dataset
set.seed(498)
actual <- sample(0:1, 1e7 , replace=TRUE)
predicted <- sample(0:1, 1e7, replace=TRUE)
f4 <- function() {data_frame(actual, predicted) %>%
count(actual, predicted)}
f5 <- function(){tabulate(4-actual-2*predicted, 4)}
Update
Including another data.table solution (provided by #Arun) also in the benchmarks
f6 <- function() {setDT(list(actual, predicted))[,.N, keyby=.(V1,V2)]$N}
microbenchmark(f1(), f3(), f4(), f5(), f6(), unit='relative', times=20L)
#Unit: relative
#expr min lq mean median uq max neval cld
#f1() 2.003088 1.974501 2.020091 2.015193 2.080961 1.924808 20 c
#f3() 2.488526 2.486019 2.450749 2.464082 2.481432 2.141309 20 d
#f4() 2.388386 2.423604 2.430581 2.459973 2.531792 2.191576 20 d
#f5() 1.034442 1.125585 1.192534 1.217337 1.239453 1.294920 20 b
#f6() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a

Like this:
tabulate(4 - actual - 2*predicted, 4)
(tabulate here is much faster than table because it knows the output will be a vector of length 4).

There is table which computes a cross tabulation and should give similar results if actual and predicted contain only zeros and ones:
table(actual, predicted)
Internally, this works by pasteing the vectors -- horribly inefficient. It seems that the coercion to character also happens when tabulating only one value, and this might be the very reason for the bad performance also of table(actual*2 + predicted).

Related

(Speed Challenge) Any faster way to compute distance matrix in terms of generic Hamming distance?

I am looking for a more efficient way to get the distance matrix in terms of Hamming distance.
Backgrounds
I know there is a function hamming.distance() from package e1071 to compute the distance matrix, but I suspect it might be very slow when involving a large matrix with many rows, since it applied nested for loops for computation.
So far I have a faster way (see methodB) in the code below. However, it is only suitable for in the binary domain, i.e., {0,1}^n. However, it is unavailable when encountering domains consisting of more than 2 elements, i.e., {0,1,2,...,K-1}^n. In this sense, methodB is not for generic hamming distance.
Objective
My objective is to find a approach having the following features:
composed by functions only from base R (not using Rcpp to rewrite function for speeding up)
faster than my approach methodB() for the special case k=2
can be generalized for any positive integer k
outperform the speed of hamming.distance() from package e1071
My code
library(e1071)
# vector length, i.e., number of matrix
n <- 7
# number of elements to consist of domain {0,1,...,k-1}^n
k <- 2
# matrix for computing hamming distances by rows
m <- as.matrix(do.call(expand.grid,replicate(n,list(0:k-1))))
# applying `hamming.distance()` from package "e1071", which is generic so it is available for any positive integer `k`
methodA <- function(M) hamming.distance(M)
# my customized method from base R function `dist()`, which is not available for cases `k >= 2`
methodB <- function(M) as.matrix(round(dist(M,upper = T,diag = T)**2))
and the benchmark gives
microbenchmark::microbenchmark(
methodA(m),
methodB(m),
unit = "relative",
check = "equivalent",
times = 50
)
Unit: relative
expr min lq mean median uq max neval
methodA(m) 33.45844 33.81716 33.963 34.30313 34.92493 14.92111 50
methodB(m) 1.00000 1.00000 1.000 1.00000 1.00000 1.00000 50
Appreciated in advance!
I found this blog which has four posts on calculcating Hamming Matrixes. I don't want to claim any fame for it, but maybe have a look at it.
https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
hamming <- function(X) {
D <- (1 - X) %*% t(X)
D + t(D)
}
> microbenchmark::microbenchmark(
+ methodB(m),
+ hamming(m),
+ unit = "relative",
+ times = 50
+ )
Unit: relative
expr min lq mean median uq max neval
methodB(m) 1.0000 1.000000 1.000000 1.000000 1.000000 1.000000 50
hamming(m) 1.2502 1.299844 1.436486 1.301461 1.302033 4.607748 50
PS: I don't have enough reputation to just leave this as a comment.
methodM <- function(x) {
xt <- t(x)
sapply(1:nrow(x), function(y) colSums(xt != xt[, y]))
}
microbenchmark::microbenchmark(
methodB(m), methodM(m),
unit = "relative", check = "equivalent", times = 50
)
# Unit: relative
# expr min lq mean median uq max neval cld
# methodB(m) 1.00 1.000000 1.000000 1.000000 1.000000 1.000000 50 a
# methodM(m) 1.25 1.224827 1.359573 1.219507 1.292463 4.550159 50 b
Did you try using Rcpp? I had a very similar problem! Please see the answer here: https://stackoverflow.com/a/60067825/3237589

How can I count the number of variables in an R quosure?

Let's say I have a function that takes in a data frame and a varying number of variables from that data frame using non-standard evaluation (NSE). Is there a faster/more straightforward way to count the number of provided variables than select()ing these variables and counting the columns?
# Works but seems non-ideal
nvar <- function(df, vars) {
vars_en <- rlang::enquo(vars)
df_sub <- dplyr::select(df, !!vars_en)
ncol(df_sub)
}
nvar(mtcars, mpg:hp)
#> 4
Highly doubtful (I realize this may receive downvotes) - I think the most sensible alternative is to simply select from the colnames of the data.frame like so - uses tidyselect::vars_select
nvar1 <- function(df, vars) {
vars_en <- rlang::enquo(vars)
ans <- vars_select(names(df), !! vars_en)
length(ans)
}
But even this is slower than select(df) %>% ncol
library(microbenchmark)
library(nycflights13)
library(tidyselect)
nvar <- function(df, vars) {
vars_en <- rlang::enquo(vars)
df_sub <- dplyr::select(df, !!vars_en)
ncol(df_sub)
}
identical(nvar(nycflights13::flights, day:sched_arr_time), nvar1(nycflights13::flights, day:sched_arr_time))
# TRUE
microbenchmark(nvar(nycflights13::flights, day:sched_arr_time), nvar1(nycflights13::flights, day:sched_arr_time), unit='relative', times=100L)
# Unit: relative
# expr min lq mean median uq max neval
# nvar(nycflights13::flights, day:sched_arr_time) 1.000000 1.000000 1.00000 1.000000 1.000000 1.0000000 100
# nvar1(nycflights13::flights, day:sched_arr_time) 1.685793 1.680676 1.60114 1.688626 1.660196 0.9878235 100

Fast way to measure how many nodes are neighbors in a graph matrix

I have a symmetric matrix of distances between nodes:
set.seed(1)
dist.mat <- matrix(runif(10*10,0,1),10,10)
dist.mat[lower.tri(dist.mat)] <- t(dist.mat)[lower.tri(dist.mat)]
In reality this matrix is 40,000 by 40,000
For a given range of radii:
radii <- seq(0,1,0.01)
for each node I'd like to compute what fraction of the total number of nodes are located within that radius from it, and then average that over all nodes.
This is what I'm currently using but I'm looking for something faster.
sapply(radii,function(r)
mean(apply(dist.mat,1,function(x) length(which(x <= r))/ncol(dist.mat)))
)
And here's its performance:
microbenchmark::microbenchmark(sapply(radii,function(r) mean(apply(dist.mat,1,function(x) length(which(x <= r))/ncol(dist.mat)))))
Unit: milliseconds
expr min
sapply(radii, function(r) mean(apply(dist.mat, 1, function(x) length(which(x <= r))/ncol(dist.mat)))) 2.24521
lq mean median uq max neval
2.548021 2.938049 2.748385 3.140852 7.233612 100
Here is a solution without using any *apply.
N <- 10
c(0, cumsum( table(cut(dist.mat, radii)) / (N*N) ))
cut it into the required intervals.
Use table to tabulate frequencies.
Then cumulative sum the result since anything smaller than prev radius is also smaller than the next large radius. Then average over all nodes.
The first 0 is because there is no value less than <= 0. (note that this might need to be improved on)
There is probably an even better solution using just the lower triangular matrix. Maybe someone will come along and provide an even faster solution.
EDIT: update with timings
library(microbenchmark)
set.seed(1L)
N <- 10e2
dist.mat <- matrix(runif(N*N,0,1),N,N)
dist.mat[lower.tri(dist.mat)] <- t(dist.mat)[lower.tri(dist.mat)]
radii <- seq(0,1,0.01)
f1 <- function() {
sapply(radii,function(r)
mean(apply(dist.mat,1,function(x) length(which(x <= r))/ncol(dist.mat)))
)
}
f2 <- function() {
c(0, cumsum( table(cut(dist.mat, radii)) / (N*N) ))
}
microbenchmark(f1(),
f2(),
times=3L,
unit="relative")
#Unit: relative
# expr min lq mean median uq max neval
# f1() 8.580099 8.502072 8.501601 8.427282 8.464298 8.500692 3
# f2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3

aggregate a matrix (or data.frame) by column name groups in R

I have a large matrix with about 3000 columns x 3000 rows. I'd like to aggregate (calculate the mean) grouped by column names for every row. Each column is named similar to this method...(and in random order)
Tree Tree House House Tree Car Car House
I would need the data result (aggregation of mean of every row) to have the following columns:
Tree House Car
the tricky part (at least for me) is that I do not know all the column names and they are all in random order!
You could try
res1 <- vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )
Or
res2 <- sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )
identical(res1,res2)
#[1] TRUE
Another option might be to reshape into long form and then do the aggregation
library(data.table)
res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
identical(res1, as.matrix(res3))
[1] TRUE
Benchmarks
It seems like the first two methods are slightly faster for a 3000*3000 matrix
set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE),
ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
3000,replace=TRUE)))
library(microbenchmark)
f1 <-function() {vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}
f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
Var1:= NULL]}
microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
# Unit: relative
# expr min lq mean median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004 10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531 10
data
set.seed(24)
m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10,
dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))
I came up with my own solution. I first just transpose the matrix (called test_mean) so the columns become rows,then:
# removing numbers from rownames
rownames(test_mean)<-gsub("[0-9.]","",rownames(test_mean))
#aggregate by rownames
test_mean<-aggregate(test_mean, by=list(rownames(test_mean)), FUN=mean)
matrixStats:rowMeans2 with some coercive help from data.table, for the win!
Adding it to benchmarking from #akrun we get:
f4<- function() {
ucn<-unique(colnames(m1))
as.matrix(setnames(setDF(lapply(ucn, function(n) rowMeans2(m1,cols=colnames(m1)==n)))
,ucn))
}
> all.equal(f4(),f1())
[1] TRUE
> microbenchmark(f1(), f2(), f3(), f4(), unit="relative", times=10L)
Unit: relative
expr min lq mean median uq max neval cld
f1() 1.837496 1.841282 1.823375 1.834471 1.818822 1.749826 10 b
f2() 1.760133 1.825352 1.817355 1.826257 1.838439 1.793824 10 b
f3() 15.451106 15.606912 15.847117 15.586192 16.626629 16.104648 10 c
f4() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

identify and remove single valued columns from table in R

I have a reasonably large dataset (~250k rows and 400 cols # .5gb) where a number of columns are single valued (ie they only have one value). To remove these columns from the dataset I use data[, apply(data, 2, function(x) length(unique(x)) != 1)] which works fine. I was wondering if there might be a more efficient way of doing this? This on my pc takes:
> system.time(apply(data, 2, function(x) length(unique(x))))
# user system elapsed
# 34.37 0.71 35.15
Which isnt so bad for one data set, but I'd like to repeat multiple times on different datasets.
You can use lapply instead:
data[, unlist(lapply(data, function(x) length(unique(x)) > 1L))]
Note that I added unlist to convert the resulting list to a vector of TRUE / FALSE values which will be used for the subsetting.
Edit: here's a little benchmark:
library(benchmark)
a <- runif(1e4)
b <- 99
c <- sample(LETTERS, 1e4, TRUE)
df <- data.frame(a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c)
microbenchmark(
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
#Unit: relative
# expr min lq median uq max neval
#apply 41.29383 40.06719 39.72256 39.16569 28.54078 100
#lapply 1.00000 1.00000 1.00000 1.00000 1.00000 100
Note that apply will first convert the data.frame to matrix and then perform the operation, which is less efficient. So in most cases where you're working with data.frames you can (and should) avoid using apply and use e.g. lapply instead.
You may also try:
set.seed(40)
df <- as.data.frame(matrix(sample(letters[1:3], 3*10,replace=TRUE), ncol=10))
Filter(function(x) (length(unique(x))>1), df)
Or
df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)] #still better than `apply`
Including these also in speed comparison (#beginneR's sample data)
microbenchmark(
new ={Filter(function(x) (length(unique(x))>1), df)},
new1={df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)]},
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
# Unit: relative
# expr min lq median uq max neval
# new 1.0000000 1.0000000 1.000000 1.0000000 1.000000 100
# new1 4.3741503 4.5144133 4.063634 3.9591345 1.713178 100
# apply 23.9635826 24.0895813 21.361140 20.7650416 5.757233 100
#lapply 0.9991514 0.9979483 1.002005 0.9958308 1.002603 100

Resources