I'd like to compute the variance for each row in a matrix. For the following matrix A
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 5 6 10
[3,] 50 7 11
[4,] 4 8 12
I would like to get
[1] 16.0000 7.0000 564.3333 16.0000
I know I can achieve this with apply(A,1,var), but is there a faster or better way? From octave, I can do this with var(A,0,2), but I don't get how the Y argument of the var() function in R is to be used.
Edit: The actual dataset of a typical chunk has around 100 rows and 500 columns. The total amount of data is around 50GB though.
You could potentially vectorize var over rows (or columns) using rowSums and rowMeans
RowVar <- function(x, ...) {
rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
}
RowVar(A)
#[1] 16.0000 7.0000 564.3333 16.0000
Using #Richards data, yields in
microbenchmark(apply(m, 1, var), RowVar(m))
## Unit: milliseconds
## expr min lq median uq max neval
## apply(m, 1, var) 343.369091 400.924652 424.991017 478.097573 746.483601 100
## RowVar(m) 1.766668 1.916543 2.010471 2.412872 4.834471 100
You can also create a more general function that will receive a syntax similar to apply but will remain vectorized (the column wise variance will be slower as the matrix needs to be transposed first)
MatVar <- function(x, dim = 1, ...) {
if(dim == 1){
rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
} else if (dim == 2) {
rowSums((t(x) - colMeans(x, ...))^2, ...)/(dim(x)[1] - 1)
} else stop("Please enter valid dimension")
}
MatVar(A, 1)
## [1] 16.0000 7.0000 564.3333 16.0000
MatVar(A, 2)
V1 V2 V3
## 547.333333 1.666667 1.666667
This is one of the main reasons why apply() is useful. It is meant to operate on the margins of an array or matrix.
set.seed(100)
m <- matrix(sample(1e5L), 1e4L)
library(microbenchmark)
microbenchmark(apply(m, 1, var))
# Unit: milliseconds
# expr min lq median uq max neval
# apply(m, 1, var) 270.3746 283.9009 292.2933 298.1297 343.9531 100
Is 300 milliseconds too long to make 10,000 calculations?
Related
I have a list of vectors and I'm trying to select (for example) the 2nd and 4th element in each vector. I can do this using lapply:
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
[[1]]
[1] 2 4
[[2]]
[1] 9 7
[[3]]
[1] 2 4
[[4]]
[1] 9 7
[[5]]
[1] 2 4
But is there a way to do this in a vectorized way -- avoiding one of the apply functions? My problem is that my actual list_of_vec is fairly long, so lapply takes awhile.
Solutions:
Option 1 #Athe's clever solution using do.call?:
do.call(rbind, list_of_vec)[ ,c(2,4)]
Option 2 Using lapply more efficiently:
lapply(list_of_vec, `[`, c(2, 4))
Option 3 A vectorized solution:
starts <- c(0, cumsum(lengths(list_of_vec)[-1]))
matrix(unlist(list_of_vec)[c(starts + 2, starts + 4)], ncol = 2)
Option 4 the lapply solution you wanted to improve:
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
Data:
And a few datasets I will test them on:
# The original data
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
# A long list with short elements
list_of_vec2 <- rep(list_of_vec, 1e5)
# A long list with long elements
list_of_vec3 <- lapply(list_of_vec, rep, 1e3)
list_of_vec3 <- rep(list_of_vec3, 1e4)
Benchmarking:
Original list:
Unit: microseconds
expr min lq mean median uq max neval cld
o1 2.276 2.8450 3.00417 2.845 3.129 10.809 100 a
o2 2.845 3.1300 3.59018 3.414 3.414 23.325 100 a
o3 3.698 4.1250 4.60558 4.267 4.552 20.480 100 a
o4 5.689 5.9735 17.52222 5.974 6.258 1144.606 100 a
Longer list, short elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 146.30778 146.88037 155.04077 149.89164 159.52194 184.92028 10 b
o2 185.40526 187.85717 192.83834 188.42749 190.32103 213.79226 10 c
o3 26.55091 27.27596 28.46781 27.48915 28.84041 32.19998 10 a
o4 407.66430 411.58054 426.87020 415.82161 437.19193 473.64265 10 d
Longer list, long elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 4855.59146 4978.31167 5012.0429 5025.97619 5072.9350 5095.7566 10 c
o2 17.88133 18.60524 103.2154 21.28613 195.0087 311.4122 10 a
o3 855.63128 872.15011 953.8423 892.96193 1069.7526 1106.1980 10 b
o4 37.92927 38.87704 135.6707 124.05127 214.6217 276.5814 10 a
Summary:
Looks like the vectorized solution wins out if the list is long and the elements are short, but lapply is the clear winner for a long list with longer elements. Some of the options output a list, others a matrix. So keep in mind what you want your output to be. Good luck!!!
If your list is composed of vectors of the same length, you could first transform it into a matrix and then get the columns you want.
matrix_of_vec <- do.call(rbind,list_of_vec)
matrix_of_vec[ ,c(2,4)]
Otherwise I'm afraid you'll have to stick to the apply family. The most efficient way to do it is using the parallel package to compute parallely (surprisingly).
corenum <- parallel::detectCores()-1
cl<-parallel::makeCluster(corenum)
parallel::clusterExport(cl,"list_of_vec"))
parallel::parSapply(cl,list_of_vec, '[', c(2,4) )
In this piece of code '[' is the name of the subsetting function and c(2,4) the argument you pass to it.
sort has the argument index.return which is by default FALSE. If you set it to TRUE you get the ordering index... basically the same as when you use order.
My question
Are there cases where it makes sense to use sort with index.return = TRUE instead of order?
order simply gives the indexes, instead sort gives also the values (and with index.return=T a list):
x <- runif(10, 0, 100)
order(x)
# [1] 2 7 1 9 6 5 8 10 4 3
sort(x, index.return=T)
# $`x`
# [1] 0.08140348 0.18272011 0.23575252 0.51493537 0.64281259 0.92121388 0.93759670 0.96221375 0.97646916 0.97863369
#
# $ix
# [1] 2 7 1 9 6 5 8 10 4 3
It seems that order is a little faster with big numbers (longer vector size):
x <- runif(10000000, 0, 100)
microbenchmark::microbenchmark(
sort = {sort(x, index.return=T)},
order = {x[order(x)]},
times = 100
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# sort 63.48221 67.79530 78.33724 70.74215 74.10109 173.1129 100
# order 56.46055 57.18649 60.88239 58.29462 62.13086 155.5481 100
So probably you should pick sort with index.return = TRUE only if you need a list object to be returned. I can't find an example where sort is better than the other.
My suggestions are based on RLave's answer.
You could use the argument method, sort(x,method="quick",index.return=TRUE), and the function might be a little faster than the default. Also if you want a faster (for large vectors) alternative method of this, you can use this function:
sort_order <- function(x){
indices <- order(x) #you can choose a method also but leave default.
list("x"=x[indices],"ix"=indices)
}
Here are some benchmarks.
microbenchmark::microbenchmark(
sort=s<-sort(x,index.return=T),
"quick sort"=sq<-sort(x,method="quick",index.return=T),
"order sort"=so<-sort_order(x),times = 10
times=10
)
Unit: seconds
expr min lq mean median uq max neval
sort 1.493714 1.662791 1.737854 1.708502 1.887993 1.960912 10
quick sort 1.366938 1.374874 1.451778 1.444342 1.480122 1.668693 10
order sort 1.181974 1.344398 1.359209 1.369108 1.424569 1.461862 10
all.equal(so,sq)
[1] TRUE
all.equal(s,so)
[1] TRUE
I was wondering, suppose you want to transform some matrix, for example, and you want something like
Y[i,j] = i*j*X[i,j]
Using a for loop is a lot slower than other options, and apply() doesn't know which i and j it is using.
A solution I can think of is defining a data.frame-like object with columns i,j,X and then use mutate() to get the desired Y values.
I have two questions:
(a) Is it possible to construct the above matrix using apply() instead of a for loop? If there is, is it more efficient to construct it this way, or using the mutate() trick?
(b) Assume I have both matrix and data.frame object representations. Which is more efficient if I want to do operations that include row and col index? For example, if want a statistic like the sum of (Y[i,j] - mean(Y))²/(i*j). I know I can construct matrices with the row and col indexes and then just use vectorized functions, but is it better than using mutate?
Depending on how large and sparse your matrix is, the first thing you may give a try is use row and col function to construct row and column indices respectively, and then use it to transform your matrix:
X <- matrix(1:9, 3, 3)
row(X) * col(X) * X
# [,1] [,2] [,3]
#[1,] 1 8 21
#[2,] 4 20 48
#[3,] 9 36 81
Then for (Y[i,j] - mean(Y))²/(i*j), it's similarly:
Y <- row(X) * col(X) * X
(Y - mean(Y)) ^ 2 / (row(Y) * col(Y))
# [,1] [,2] [,3]
#[1,] 592.11111 150.222222 6.259259
#[2,] 227.55556 7.111111 85.629630
#[3,] 88.92593 18.962963 344.308642
This is fully vectorized approach, requiring no apply or for loop but need additional memory.
I think outer may be faster that row(X) * col(X).
# Define dimensions
n.rows <- n.cols <- 1000
# Define matrix
X <- matrix(runif(n.rows * n.cols), ncol = n.cols)
# Psidom's approach
rowcol.method <- function(X){row(X) * col(X) * X}
# Approach using outer
outer.method <- function(X){outer(1:nrow(X), 1:ncol(X)) * X}
# Benchmark library
library(microbenchmark)
# Test
microbenchmark(
rowcol.method(X),
outer.method(X)
)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
rowcol.method(X) 20.895870 21.154815 23.795695 21.612485 22.584323 62.50660 100 b
outer.method(X) 5.608577 5.729724 6.883643 5.836526 5.977156 50.12508 100 a
Compare output:
identical(rowcol.method(X), outer.method(X))
[1] TRUE
Similarly for the other calculation, although there was a crazy outlier for the outer approach (221.66718 ms).
# Define matrix
Y <- row(X) * col(X) * X
# Psidom's approach
rowcol.method.Y <- function(Y) {(Y - mean(Y)) ^ 2 / (row(Y) * col(Y))}
# Approach using outer
outer.method.Y <- function(Y) {(Y - mean(Y)) ^ 2 / outer(1:nrow(X), 1:ncol(X))}
# Test
microbenchmark(
rowcol.method.Y(Y),
outer.method.Y(Y)
)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
rowcol.method.Y(Y) 27.94405 30.18635 34.63551 33.32627 37.06467 46.58983 100 b
outer.method.Y(Y) 11.27064 12.66349 18.77192 15.66756 18.18864 221.66718 100 a
Compare output:
identical(rowcol.method.Y(Y), outer.method.Y(Y))
[1] TRUE
Given two separate vectors of equal length: f.start and f.end, I would like to construct a sequence (by 1), going from f.start[1]:f.end[1] to f.start[2]:f.end[2], ..., to f.start[n]:f.end[n].
Here is an example with just 6 rows.
f.start f.end
[1,] 45739 122538
[2,] 125469 202268
[3,] 203563 280362
[4,] 281657 358456
[5,] 359751 436550
[6,] 437845 514644
Crudely, a loop can do it, but is extremely slow for larger datasets (rows>2000).
f.start<-c(45739,125469,203563,281657,359751,437845)
f.end<-c(122538,202268,280362,358456,436550,514644)
f.ind<-f.start[1]:f.end[1]
for (i in 2:length(f.start))
{
f.ind.temp<-f.start[i]:f.end[i]
f.ind<-c(f.ind,f.ind.temp)
}
I suspect this can be done with apply(), but I have not worked out how to include two separate arguments in apply, and would appreciate some guidance.
You can try using mapply or Map, which iterates simultaneously on your two vectors. You need to provide the function as first argument:
vec1 = c(1,33,50)
vec2 = c(10,34,56)
unlist(Map(':',vec1, vec2))
# [1] 1 2 3 4 5 6 7 8 9 10 33 34 50 51 52 53 54 55 56
Just replace vec1 and vec2 by f.start and f.end provided all(f.start<=f.end)
Your loop is going to be slow as you are growing the vector
f.ind. You will also get an increase in speed if you pre-allocate
the length of the output vector.
# Some data (of length 3000)
set.seed(1)
f.start <- sample(1:10000, 3000)
f.end <- f.start + sample(1:200, 3000, TRUE)
# Functions
op <- function(L=1) {
f.ind <- vector("list", L)
for (i in 1:length(f.start)) {
f.ind[[i]] <- f.start[i]:f.end[i]
}
unlist(f.ind)
}
op2 <- function() unlist(lapply(seq(f.start), function(x) f.start[x]:f.end[x]))
col <- function() unlist(mapply(':',f.start, f.end))
# check output
all.equal(op(), op2())
all.equal(op(), col())
A few benchmarks
library(microbenchmark)
# Look at the effect of pre-allocating
microbenchmark(op(L=1), op(L=1000), op(L=3000), times=500)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# op(L = 1) 46.760416 48.741080 52.29038 49.636864 50.661506 113.08303 500 c
# op(L = 1000) 41.644123 43.965891 46.20380 44.633016 45.739895 94.88560 500 b
# op(L = 3000) 7.629882 8.098691 10.10698 8.338387 9.963558 60.74152 500 a
# Compare methods - the loop actually performs okay
# I left the original loop out
microbenchmark(op(L=3000), op2(), col(), times=500)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# op(L = 3000) 7.778643 8.123136 10.119464 8.367720 11.402463 62.35632 500 b
# op2() 6.461926 6.762977 8.619154 6.995233 10.028825 57.55236 500 a
# col() 6.656154 6.910272 8.735241 7.137500 9.935935 58.37279 500 a
So a loop should perform okay speed wise, but of course the Colonel's code is a lot cleaner. The *apply functions here wont really give much speed up in the calculation but they do offer tidier code and remove the need for pre-allocation.
Suppose I have a data frame that comes from reading in the following file Foo.csv
A,B,C
1,2,3
2,2,4
1,7,3
I would like to count the number of matching elements between the first row and subsequent rows. For example, the first row matches with the second row in one position, and matches with the third row in two positions. Here is some code that will achieve the desired effect.
foo = read.csv("Foo.csv")
numDiffs = rep(0,dim(foo)[1])
for (i in 2:dim(foo)[1]) {
numDiffs[i] = sum(foo[i,] == foo[1,])
}
print(numDiffs)
My question is, can this be vectorized to kill the loop and possibly reduce the running time? My first attempt is below, but it leaves an error because == is not defined for this type of comparison.
colSums(foo == foo[1,])
> rowSums(sapply(foo, function(x) c(0,x[1] == x[2:nrow(foo)])))
[1] 0 1 2
Or using the automatic recycling of matrix comparisons:
bar <- as.matrix(foo)
c(0, rowSums(t(t(bar[-1, ]) == bar[1, ])))
# [1] 0 1 2
t() is there twice because the recycling is column- rather than row-wise.
As your dataset grows larger, you might get a bit more speed with something like this:
as.vector(c(0, rowSums(foo[rep(1, nrow(foo) - 1), ] == foo[-1, ])))
# [1] 0 1 2
The basic idea is to create a data.frame of the first row the same dimensions of the overall dataset less one row, and use that to check for equivalence with the remaining rows.
Deleting my original update, here are some benchmarks instead. Change "N" to see the effect on different data.frame sizes. The solution from #nacnudus scales best.
set.seed(1)
N <- 10000000
mydf <- data.frame(matrix(sample(10, N, replace = TRUE), ncol = 10))
dim(mydf)
# [1] 1000000 10
fun1 <- function(data) rowSums(sapply(data, function(x) c(0,x[1] == x[2:nrow(data)])))
fun2 <- function(data) as.vector(c(0, rowSums(data[rep(1, nrow(data) - 1), ] == data[-1, ])))
fun3 <- function(data) {
bar <- as.matrix(data)
c(0, rowSums(t(t(bar[-1, ]) == bar[1, ])))
}
library(microbenchmark)
## On your original sample data
microbenchmark(fun1(foo), fun2(foo), fun3(foo))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(foo) 109.903 119.0975 122.5185 127.0085 228.785 100
# fun2(foo) 333.984 354.5110 367.1260 375.0370 486.650 100
# fun3(foo) 233.490 250.8090 264.7070 269.8390 518.295 100
## On the sample data created above--I don't want to run this 100 times!
system.time(fun1(mydf))
# user system elapsed
# 15.53 0.06 15.60
system.time(fun2(mydf))
# user system elapsed
# 2.05 0.01 2.06
system.time(fun3(mydf))
# user system elapsed
# 0.32 0.00 0.33
HOWEVER, if Codoremifa were to change their code to vapply instead of sapply, that answer wins! From 15 seconds down to 0.24 seconds on 1 million rows.
fun4 <- function(data) {
rowSums(vapply(data, function(x) c(0, x[1] == x[2:nrow(data)]),
vector("numeric", length=nrow(data))))
}
microbenchmark(fun3(mydf), fun4(mydf), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun3(mydf) 369.5957 422.9507 438.8742 462.6958 486.3757 20
# fun4(mydf) 238.1093 316.9685 323.0659 328.0969 341.5154 20
eh, I don't see why you can't just do..
c(foo[1,]) == foo
# A B C
#[1,] TRUE TRUE TRUE
#[2,] FALSE TRUE FALSE
#[3,] TRUE FALSE TRUE
.. or even better foo[1,,drop=TRUE] == foo...
Thus the result becomes...
rowSums( c( foo[1,] ) == foo[-1,] )
#[1] 3 1 2
Remember, f[1,] is still a data.frame. Coerce to a vector and == is defined for what you need. This seems to be a little quicker than the vapply answer suggested #AnandaMahto on a big dataframe.
Benchmarking
Comparing this against fun3 and fun4 from #AnandaMahto's answer above I see a small speed improvement when using the larger data.frame, my.df...
microbenchmark(fun3(mydf), fun4(mydf), fun6(mydf) , times = 20)
#Unit: milliseconds
# expr min lq median uq max neval
# fun3(mydf) 320.7485 344.9249 356.1657 365.7576 399.5334 20
# fun4(mydf) 299.6660 313.7105 319.1700 327.8196 555.4625 20
# fun6(mydf) 196.8244 241.4866 252.6311 258.8501 262.7968 20
fun6 is defined as...
fun6 <- function(data) rowSums( c( data[1,] ) == data )