Optimizing calculation of combinations into list - large data set - r

I wonder if someone can figure out a faster way to calculate combinations of elements in vector. My approach works but is slow with about 6 million elements in the vector.
Test vector
test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")
My approach
lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
Expected output
[[1]]
[1] "335261344015" "335261537633" "344015537633"
[[2]]
[1] "22404132858"
[[3]]
[1] "254654355860" "254654488288" "355860488288"
[[4]]
[1] "219943373817"
[[5]]
[1] "331839404477"

Here is an answer that is over 25x faster than the OP's solution on large test cases. It doesn't rely on paste, but rather we take advantage of properties of numbers and vectorized operations. We also use comboGeneral from the RcppAlgos package (I am the author) which is much faster than combn and combnPrim from the linked answer for generating combinations of a vector. First we show the efficiency gains of comboGeneral over the other functions:
## library(gRbase)
library(RcppAlgos)
library(microbenchmark)
microbenchmark(gRbase::combnPrim(300, 2), combn(300, 2),
comboGeneral(300, 2), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
gRbase::combnPrim(300, 2) 5.145654 5.192439 4.83561 7.167839 4.320497 3.98992 100
combn(300, 2) 204.866624 192.559119 143.75540 174.079339 102.733367 539.12325 100
comboGeneral(300, 2) 1.000000 1.000000 1.00000 1.000000 1.000000 1.00000 100
Now, we create a function to create some random reproducible data that will be passed to our test functions:
makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
set.seed(mySeed)
sapply(1:vectorSize, function(x) {
paste(sample(10^6, s1 <- sample(2:elementSize, 1), replace = withRep), collapse = " ")
})
}
makeTestSet(5, 3)
[1] "937076 286140 830446" "519096 736588 134667" "705065 457742 719111"
[4] "255429 462293 940013" "117488 474997 560332"
That looks good. Now, lets see if setting fixed = TRUE gets us any gains (as suggested above by #MichaelChirico):
bigVec <- makeTestSet(10, 100000)
microbenchmark(standard = strsplit(bigVec, " "),
withFixed = strsplit(bigVec, " ", fixed = TRUE),
times = 15, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
standard 4.447413 4.296662 4.133797 4.339537 4.084019 3.415639 15
withFixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 15
#MichaelChirico was spot on. Putting it all together we get:
combPairFast <- function(testVec) {
lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
})
}
## test.vector defined above by OP
combPairFast(test.vector)
[[1]]
[1] 335261344015 335261537633 344015537633
[[2]]
[1] 22404132858
[[3]]
[1] 254654355860 254654488288 355860488288
[[4]]
[1] 219943373817
[[5]]
[1] 331839404477
## OP original code
combPairOP <- function(testVec) {
lapply(strsplit(testVec, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
}
As stated in the comments by the OP, the maximum number is less than a million (600000 to be exact), which means that after we multiply one of the numbers by at most 10^6 and add it to another 6 digit number (equivalent to simply concatenating two strings of numbers), we are guaranteed to be within the numerical precision of base R (i.e. 2^53 - 1). This is good because arithmetic operations on numerical numbers is much more efficient than strings operations.
All that is left is to benchmark:
test.vector <- makeTestSet(100, 50)
microbenchmark(combPairOP(test.vector),
combPairFast(test.vector),
times = 20, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
combPairOP(test.vector) 22.33991 22.4264 21.67291 22.11017 21.729 25.23342 20
combPairFast(test.vector) 1.00000 1.0000 1.00000 1.00000 1.000 1.00000 20
And on larger vectors:
bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)
## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
any(duplicated(x))
}))
[1] TRUE
system.time(t1 <- combPairFast(bigTest.vector))
user system elapsed
0.303 0.011 0.314
system.time(t2 <- combPairOP(bigTest.vector))
user system elapsed
8.820 0.081 8.902 ### 8.902 / 0.314 ~= 28x faster
## results are the same
all.equal(t1, lapply(t2, as.numeric))
[1] TRUE

Related

Given a list, how do I access the last value of each elements with different sizes

I have a list of character vectors, and I would like to access the last value of each element.
mylist<-list(A=c("a"),
B=c("a","b"),
C=c("a","b","c"),
D=c("a","b","c","d"))
At first, (by looking at some related threads in Python), I thought I could do something like:
for(i in 1:length(mylist)){
print(mylist[[i]][-1])
}
# character(0)
# [1] "b"
# [1] "b" "c"
# [1] "b" "c" "d"
I guess this doesn't work. Basically, as a result, I would like
myfunction<-function(mylist){
output<-as.character()
for(i in 1:length(mylist)){
output<-c(output, mylist[[i]][length(mylist[[i]])])}
return(output)
}
myfunction(mylist)
# [1] "a" "b" "c" "d"
Is there a more efficient way?
As Rich Scriven pointed out in the (deleted) comments there are many ways to accomplish this task, one of which is to use sapply and tail with argument n = 1:
sapply(mylist, tail, n = 1)
# A B C D
#"a" "b" "c" "d"
Another, safer and potentially faster variant of the same idea is to use vapply
vapply(mylist, tail, FUN.VALUE = character(1), n = 1)
# or a little shorter
# vapply(mylist, tail, "", 1)
(another) benchmarking
set.seed(1)
mylist <- replicate(1e5, list(sample(letters, size = runif(1, 1, length(letters)))))
benchmark <- microbenchmark(
f1 = {myfunction(mylist)},
f2 = {sapply(mylist, function(l) l[length(l)])},
f3 = {vapply(mylist, function(l) l[length(l)], "")},
f4 = {sapply(mylist, tail, 1)},
f5 = {vapply(mylist, tail, "", 1)},
f6 = {mapply("[", mylist, lengths(mylist))},
f7 = {mapply("[[", mylist, lengths(mylist))}, # added this out of curiosity
f8 = {unlist(mylist)[cumsum(lengths(mylist))]},
times = 100L
)
autoplot(benchmark)
Same result here: Rich's unlist(mylist)[cumsum(lengths(mylist_long))] is the fastest by far. No real difference between sapply and vapply it seems. myfunction() as defined in OP's question.
#benchmark
#Unit: milliseconds
# expr min lq mean median uq max neval
# f1 28797.26121 30462.16785 31836.26875 31191.7762 32950.92537 36586.5477 100
# f2 106.34213 117.75074 127.97763 124.9191 134.82047 176.2058 100
# f3 99.72042 106.87308 119.59811 113.9663 123.63619 465.5335 100
# f4 1242.11950 1291.38411 1409.35750 1350.3460 1505.76089 1880.6537 100
# f5 1189.22615 1274.48390 1366.07234 1333.8885 1418.75394 1942.2803 100
# f6 112.27316 123.73429 132.39888 129.8220 138.33851 191.2509 100
# f7 107.27392 118.19201 128.06681 123.1317 133.29827 208.8425 100
# f8 28.03948 28.84125 31.19637 30.3115 32.94077 40.9624 100
Benchmarking the solutions proposed in the comments we find that Rich's proposal using unlist is the fastest.
By inspecting the code and tweaking the parameters we can make it even faster.
The slowness of tail is discussed there: https://stackoverflow.com/a/37238415/2270475
On OP's sample data:
library(microbenchmark)
microbenchmark(
r2evans = sapply(mylist, function(l) l[length(l)]),
markus = sapply(mylist, tail, 1),
Rich1 = mapply("[", mylist, lengths(mylist)),
Rich2 = unlist(mylist)[cumsum(lengths(mylist))],
markus2 = vapply(mylist, tail, character(1), 1),
mm = .Internal(unlist(mylist,FALSE,FALSE))[cumsum(lengths(mylist,FALSE))],
unit = "relative"
)
# Unit: relative
# expr min lq mean median uq max neval
# r2evans 16.083333 12.764706 25.545957 12.368421 13.133333 122.1428571 100
# markus 82.333333 59.294118 50.937673 60.342105 60.644444 10.2253968 100
# Rich1 19.583333 15.294118 13.368047 15.394737 15.622222 2.7492063 100
# Rich2 4.166667 3.705882 3.211045 3.789474 3.911111 0.7650794 100
# markus2 73.166667 53.176471 44.669822 50.263158 54.155556 10.4857143 100
# mm 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
On a 1000 times longer list:
mylist_long <- do.call(c,replicate(1000,mylist,simplify = F))
length(mylist_long) # [1] 4000
microbenchmark(
r2evans = sapply(mylist_long, function(l) l[length(l)]),
markus = sapply(mylist_long, tail, 1),
Rich1 = mapply("[", mylist_long, lengths(mylist_long)),
Rich2 = unlist(mylist_long)[cumsum(lengths(mylist_long))],
markus2 = vapply(mylist_long, tail, character(1), 1),
mm = .Internal(unlist(mylist_long,FALSE,FALSE))[cumsum(lengths(mylist_long,FALSE))],
unit = "relative"
)
# Unit: relative
# expr min lq mean median uq max neval
# r2evans 26.14882 27.20436 27.07436 28.13731 28.54701 27.23846 100
# markus 679.57251 698.84828 668.00160 715.30180 674.71067 443.42502 100
# Rich1 27.53607 28.80581 29.82736 29.00353 31.02343 38.79978 100
# Rich2 22.39863 21.79129 20.41467 21.53371 20.70750 13.03032 100
# markus2 667.97494 702.14882 676.91881 718.41899 696.11934 633.17181 100
# mm 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100

Count sequence of blanks in a string with R

I want to retrieve the sequence of consecutive blanks in a string. Like :
mystring="lalalal lalalal lalala lalalala "
retrieve_sequence_of_consecutive_blanks(mystring)
[1] 2 1 3 1
Actually, I got a solution, with this
sequence_of_blanks=function(vectors_of_strings){
tokens=strsplit(x = mystring,split = "",fixed = TRUE)
sequence=lapply(X = tokens,FUN = rle)
resultats=lapply(sequence, function(item){
resultats=item$lengths[which(item$values==" ")]
})
}
My question is about performance, do you think if there is better way to do it? What about a regex solution? What about a python solution?
You may match all space chunks and get their lengths, e.g.
library(stringr)
nchar(unlist(str_extract_all(mystring, " +")))
Or the base R equivalent:
nchar(unlist(regmatches(mystring, gregexpr(" +", mystring))))
Both yield
[1] 2 1 3 1
In Python, you may use
[x.count(" ") for x in re.findall(" +", mystring)]
See the Python demo
If you plan to count any whitespace, replace the literal space with \s. Tweak as per your further requirements.
You could use
myrle <- rle(charToRaw(mystring) == charToRaw(" "))
myrle$lengths[myrle$values]
which is a bit faster:
microbenchmark::microbenchmark(
OP = sequence_of_blanks(mystring),
akrun = tabulate(cumsum(c(TRUE, diff(str_locate_all(mystring, " ")[[1]][,2]) !=1))),
wiktor = nchar(unlist(str_extract_all(mystring, " +"))),
# charToRaw(mystring) == charToRaw(" "),
fprive = { myrle <- rle(charToRaw(mystring) == charToRaw(" ")); myrle$lengths[myrle$values] }
)
Unit: microseconds
expr min lq mean median uq max neval
OP 32.826 37.680 42.97734 42.3940 46.3405 115.239 100
akrun 40.718 44.874 48.40903 48.4360 50.7050 78.991 100
wiktor 24.166 29.753 34.73199 35.0955 36.7370 129.626 100
fprive 23.258 25.877 29.50010 28.6000 31.6730 43.721 100
If you really need performance, designing some Rcpp function for your particular use giving as arguments charToRaw(mystring) and charToRaw(" ") would improve performance.
If you want a bit more of performance using simple base R:
length_seq_blanks <- function(string) {
x <- nchar(unlist(strsplit(string, "[a-z]+")))
x[x > 0]
}
length_seq_blanks(mystring)
[1] 2 1 3 1
Benchmark
microbenchmark::microbenchmark(
snoram = {
length_seq_blanks <- function(string) {
x <- nchar(unlist(strsplit(string, "[a-z]+")))
x[x > 0]
}
length_seq_blanks(mystring)
},
fprive = {
myrle <- rle(charToRaw(mystring) == charToRaw(" "))
myrle$lengths[myrle$values]
},
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval
snoram 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
fprive 1.866597 1.818247 1.734015 1.684211 1.634093 1.20812 100

R: scan vectors once instead of 4 times?

Suppose I have two equal length logical vectors.
Computing the confusion matrix the easy way:
c(sum(actual == 1 & predicted == 1),
sum(actual == 0 & predicted == 1),
sum(actual == 1 & predicted == 0),
sum(actual == 0 & predicted == 0))
requires scanning the vectors 4 times.
Is it possible to do that in a single pass?
PS. I tried table(2*actual+predicted) and table(actual,predicted) but both are obviously much slower.
PPS. Speed is not my main consideration here, I am more interested in understanding the language.
You could try using data.table
library(data.table)
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N
data
set.seed(24)
actual <- sample(0:1, 10 , replace=TRUE)
predicted <- sample(0:1, 10, replace=TRUE)
Benchmarks
Using data.table_1.9.5 and dplyr_0.4.0
library(microbenchmark)
set.seed(245)
actual <- sample(0:1, 1e6 , replace=TRUE)
predicted <- sample(0:1, 1e6, replace=TRUE)
f1 <- function(){
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N}
f2 <- function(){table(actual, predicted)}
f3 <- function() {data_frame(actual, predicted) %>%
group_by(actual, predicted) %>%
summarise(n())}
microbenchmark(f1(), f2(), f3(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
#f1() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a
#f2() 20.818410 22.378995 22.321816 22.56931 22.140855 22.984667 20 b
#f3() 1.262047 1.248396 1.436559 1.21237 1.220109 2.504662 20 a
Including the count from dplyr and tabulate also in the benchmarks on a slightly bigger dataset
set.seed(498)
actual <- sample(0:1, 1e7 , replace=TRUE)
predicted <- sample(0:1, 1e7, replace=TRUE)
f4 <- function() {data_frame(actual, predicted) %>%
count(actual, predicted)}
f5 <- function(){tabulate(4-actual-2*predicted, 4)}
Update
Including another data.table solution (provided by #Arun) also in the benchmarks
f6 <- function() {setDT(list(actual, predicted))[,.N, keyby=.(V1,V2)]$N}
microbenchmark(f1(), f3(), f4(), f5(), f6(), unit='relative', times=20L)
#Unit: relative
#expr min lq mean median uq max neval cld
#f1() 2.003088 1.974501 2.020091 2.015193 2.080961 1.924808 20 c
#f3() 2.488526 2.486019 2.450749 2.464082 2.481432 2.141309 20 d
#f4() 2.388386 2.423604 2.430581 2.459973 2.531792 2.191576 20 d
#f5() 1.034442 1.125585 1.192534 1.217337 1.239453 1.294920 20 b
#f6() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
Like this:
tabulate(4 - actual - 2*predicted, 4)
(tabulate here is much faster than table because it knows the output will be a vector of length 4).
There is table which computes a cross tabulation and should give similar results if actual and predicted contain only zeros and ones:
table(actual, predicted)
Internally, this works by pasteing the vectors -- horribly inefficient. It seems that the coercion to character also happens when tabulating only one value, and this might be the very reason for the bad performance also of table(actual*2 + predicted).

identify and remove single valued columns from table in R

I have a reasonably large dataset (~250k rows and 400 cols # .5gb) where a number of columns are single valued (ie they only have one value). To remove these columns from the dataset I use data[, apply(data, 2, function(x) length(unique(x)) != 1)] which works fine. I was wondering if there might be a more efficient way of doing this? This on my pc takes:
> system.time(apply(data, 2, function(x) length(unique(x))))
# user system elapsed
# 34.37 0.71 35.15
Which isnt so bad for one data set, but I'd like to repeat multiple times on different datasets.
You can use lapply instead:
data[, unlist(lapply(data, function(x) length(unique(x)) > 1L))]
Note that I added unlist to convert the resulting list to a vector of TRUE / FALSE values which will be used for the subsetting.
Edit: here's a little benchmark:
library(benchmark)
a <- runif(1e4)
b <- 99
c <- sample(LETTERS, 1e4, TRUE)
df <- data.frame(a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c)
microbenchmark(
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
#Unit: relative
# expr min lq median uq max neval
#apply 41.29383 40.06719 39.72256 39.16569 28.54078 100
#lapply 1.00000 1.00000 1.00000 1.00000 1.00000 100
Note that apply will first convert the data.frame to matrix and then perform the operation, which is less efficient. So in most cases where you're working with data.frames you can (and should) avoid using apply and use e.g. lapply instead.
You may also try:
set.seed(40)
df <- as.data.frame(matrix(sample(letters[1:3], 3*10,replace=TRUE), ncol=10))
Filter(function(x) (length(unique(x))>1), df)
Or
df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)] #still better than `apply`
Including these also in speed comparison (#beginneR's sample data)
microbenchmark(
new ={Filter(function(x) (length(unique(x))>1), df)},
new1={df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)]},
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
# Unit: relative
# expr min lq median uq max neval
# new 1.0000000 1.0000000 1.000000 1.0000000 1.000000 100
# new1 4.3741503 4.5144133 4.063634 3.9591345 1.713178 100
# apply 23.9635826 24.0895813 21.361140 20.7650416 5.757233 100
#lapply 0.9991514 0.9979483 1.002005 0.9958308 1.002603 100

How to get row wise standard deviation over specific columns [duplicate]

I'd like to compute the variance for each row in a matrix. For the following matrix A
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 5 6 10
[3,] 50 7 11
[4,] 4 8 12
I would like to get
[1] 16.0000 7.0000 564.3333 16.0000
I know I can achieve this with apply(A,1,var), but is there a faster or better way? From octave, I can do this with var(A,0,2), but I don't get how the Y argument of the var() function in R is to be used.
Edit: The actual dataset of a typical chunk has around 100 rows and 500 columns. The total amount of data is around 50GB though.
You could potentially vectorize var over rows (or columns) using rowSums and rowMeans
RowVar <- function(x, ...) {
rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
}
RowVar(A)
#[1] 16.0000 7.0000 564.3333 16.0000
Using #Richards data, yields in
microbenchmark(apply(m, 1, var), RowVar(m))
## Unit: milliseconds
## expr min lq median uq max neval
## apply(m, 1, var) 343.369091 400.924652 424.991017 478.097573 746.483601 100
## RowVar(m) 1.766668 1.916543 2.010471 2.412872 4.834471 100
You can also create a more general function that will receive a syntax similar to apply but will remain vectorized (the column wise variance will be slower as the matrix needs to be transposed first)
MatVar <- function(x, dim = 1, ...) {
if(dim == 1){
rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
} else if (dim == 2) {
rowSums((t(x) - colMeans(x, ...))^2, ...)/(dim(x)[1] - 1)
} else stop("Please enter valid dimension")
}
MatVar(A, 1)
## [1] 16.0000 7.0000 564.3333 16.0000
MatVar(A, 2)
V1 V2 V3
## 547.333333 1.666667 1.666667
This is one of the main reasons why apply() is useful. It is meant to operate on the margins of an array or matrix.
set.seed(100)
m <- matrix(sample(1e5L), 1e4L)
library(microbenchmark)
microbenchmark(apply(m, 1, var))
# Unit: milliseconds
# expr min lq median uq max neval
# apply(m, 1, var) 270.3746 283.9009 292.2933 298.1297 343.9531 100
Is 300 milliseconds too long to make 10,000 calculations?

Resources