I have a list of vectors lis which I need to match to another vector vec
lis <- list(c(2,0,0),c(1,1,0), c(1,0,1), c(0,2,0), c(0,1,1), c(0,0,2))
vec <- c(1,1,0)
So either I would get a logical output
[1] FALSE TRUE FALSE FALSE FALSE FALSE
Or just the position within lis of the match
I've been trying things along these lines:
unlist(lis) %in% vec
but the problem is the position of the number is important i.e. distinguish between c(1,1,0) and c(1,0,1) which I haven't been able to do. I would like to avoid for loops as this needs to be quite efficient (fast).
The answers by #agstudy and #Julius involve a loop over the (long) lis object; here's an alternative assuming that all elements of lis are the same length as vec to allow vector comparison of the unlisted reference
shortloop <- function(x, lst)
colSums(matrix(unlist(lst) == x, length(x))) == length(x)
which is fast when lis is long compared to vec.
longloop <- function(x, lst)
sapply(lst, identical, x)
l1 = rep(lis, 1000)
microbenchmark(shortloop(vec, l1), longloop(vec, l1))
## Unit: microseconds
## expr min lq median uq max neval
## shortloop(vec, l1) 793.009 808.2175 896.299 905.8795 1058.79 100
## longloop(vec, l1) 18732.338 21649.0770 21797.646 22107.7805 24788.86 100
Interestingly, using for is not that bad from a performance perspective compared to the implicit loop in lapply (though more complicated and error-prone)
longfor <- function(x, lst) {
res <- logical(length(lst))
for (i in seq_along(lst))
res[[i]] = identical(x, lst[[i]])
res
}
library(compiler)
longforc = cmpfun(longfor)
microbenchmark(longloop(vec, l1), longfor(vec, l1), longforc(vec, l1))
## Unit: milliseconds
## expr min lq median uq max neval
## longloop(vec, l1) 18.92824 21.20457 21.71295 21.80009 23.23286 100
## longfor(vec, l1) 23.64756 26.73481 27.43815 27.61699 28.33454 100
## longforc(vec, l1) 17.40998 18.28686 20.47844 20.75303 21.49532 100
sapply(lis, identical, vec)
# [1] FALSE TRUE FALSE FALSE FALSE FALSE
Benchmark:
l1 <- list(1:1000)[rep(1, 10000)]
v1 <- sample(1000)
AG <- function() sapply(l1,function(x)all(x==v1))
J <- function() sapply(l1, identical, v1)
microbenchmark(AG(), J())
# Unit: milliseconds
# expr min lq median uq max neval
# AG() 76.42732 84.26958 103.99233 111.62671 148.2824 100
# J() 32.14965 37.54198 47.34538 50.93195 104.4036 100
sapply(lis,function(x)all(x==vec))
[1] FALSE TRUE FALSE FALSE FALSE FALSE
Related
I have a list of character vectors, and I would like to access the last value of each element.
mylist<-list(A=c("a"),
B=c("a","b"),
C=c("a","b","c"),
D=c("a","b","c","d"))
At first, (by looking at some related threads in Python), I thought I could do something like:
for(i in 1:length(mylist)){
print(mylist[[i]][-1])
}
# character(0)
# [1] "b"
# [1] "b" "c"
# [1] "b" "c" "d"
I guess this doesn't work. Basically, as a result, I would like
myfunction<-function(mylist){
output<-as.character()
for(i in 1:length(mylist)){
output<-c(output, mylist[[i]][length(mylist[[i]])])}
return(output)
}
myfunction(mylist)
# [1] "a" "b" "c" "d"
Is there a more efficient way?
As Rich Scriven pointed out in the (deleted) comments there are many ways to accomplish this task, one of which is to use sapply and tail with argument n = 1:
sapply(mylist, tail, n = 1)
# A B C D
#"a" "b" "c" "d"
Another, safer and potentially faster variant of the same idea is to use vapply
vapply(mylist, tail, FUN.VALUE = character(1), n = 1)
# or a little shorter
# vapply(mylist, tail, "", 1)
(another) benchmarking
set.seed(1)
mylist <- replicate(1e5, list(sample(letters, size = runif(1, 1, length(letters)))))
benchmark <- microbenchmark(
f1 = {myfunction(mylist)},
f2 = {sapply(mylist, function(l) l[length(l)])},
f3 = {vapply(mylist, function(l) l[length(l)], "")},
f4 = {sapply(mylist, tail, 1)},
f5 = {vapply(mylist, tail, "", 1)},
f6 = {mapply("[", mylist, lengths(mylist))},
f7 = {mapply("[[", mylist, lengths(mylist))}, # added this out of curiosity
f8 = {unlist(mylist)[cumsum(lengths(mylist))]},
times = 100L
)
autoplot(benchmark)
Same result here: Rich's unlist(mylist)[cumsum(lengths(mylist_long))] is the fastest by far. No real difference between sapply and vapply it seems. myfunction() as defined in OP's question.
#benchmark
#Unit: milliseconds
# expr min lq mean median uq max neval
# f1 28797.26121 30462.16785 31836.26875 31191.7762 32950.92537 36586.5477 100
# f2 106.34213 117.75074 127.97763 124.9191 134.82047 176.2058 100
# f3 99.72042 106.87308 119.59811 113.9663 123.63619 465.5335 100
# f4 1242.11950 1291.38411 1409.35750 1350.3460 1505.76089 1880.6537 100
# f5 1189.22615 1274.48390 1366.07234 1333.8885 1418.75394 1942.2803 100
# f6 112.27316 123.73429 132.39888 129.8220 138.33851 191.2509 100
# f7 107.27392 118.19201 128.06681 123.1317 133.29827 208.8425 100
# f8 28.03948 28.84125 31.19637 30.3115 32.94077 40.9624 100
Benchmarking the solutions proposed in the comments we find that Rich's proposal using unlist is the fastest.
By inspecting the code and tweaking the parameters we can make it even faster.
The slowness of tail is discussed there: https://stackoverflow.com/a/37238415/2270475
On OP's sample data:
library(microbenchmark)
microbenchmark(
r2evans = sapply(mylist, function(l) l[length(l)]),
markus = sapply(mylist, tail, 1),
Rich1 = mapply("[", mylist, lengths(mylist)),
Rich2 = unlist(mylist)[cumsum(lengths(mylist))],
markus2 = vapply(mylist, tail, character(1), 1),
mm = .Internal(unlist(mylist,FALSE,FALSE))[cumsum(lengths(mylist,FALSE))],
unit = "relative"
)
# Unit: relative
# expr min lq mean median uq max neval
# r2evans 16.083333 12.764706 25.545957 12.368421 13.133333 122.1428571 100
# markus 82.333333 59.294118 50.937673 60.342105 60.644444 10.2253968 100
# Rich1 19.583333 15.294118 13.368047 15.394737 15.622222 2.7492063 100
# Rich2 4.166667 3.705882 3.211045 3.789474 3.911111 0.7650794 100
# markus2 73.166667 53.176471 44.669822 50.263158 54.155556 10.4857143 100
# mm 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
On a 1000 times longer list:
mylist_long <- do.call(c,replicate(1000,mylist,simplify = F))
length(mylist_long) # [1] 4000
microbenchmark(
r2evans = sapply(mylist_long, function(l) l[length(l)]),
markus = sapply(mylist_long, tail, 1),
Rich1 = mapply("[", mylist_long, lengths(mylist_long)),
Rich2 = unlist(mylist_long)[cumsum(lengths(mylist_long))],
markus2 = vapply(mylist_long, tail, character(1), 1),
mm = .Internal(unlist(mylist_long,FALSE,FALSE))[cumsum(lengths(mylist_long,FALSE))],
unit = "relative"
)
# Unit: relative
# expr min lq mean median uq max neval
# r2evans 26.14882 27.20436 27.07436 28.13731 28.54701 27.23846 100
# markus 679.57251 698.84828 668.00160 715.30180 674.71067 443.42502 100
# Rich1 27.53607 28.80581 29.82736 29.00353 31.02343 38.79978 100
# Rich2 22.39863 21.79129 20.41467 21.53371 20.70750 13.03032 100
# markus2 667.97494 702.14882 676.91881 718.41899 696.11934 633.17181 100
# mm 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
In a string variable I would like to remove both parts of a duplicates; so that I only select the unique strings. That is:
I have a string
MyString <- c("aaa", "bbb", "ccc", "ddd", "aaa", "ddd")
I would like to remove both pair of a duplicate; and thus select:
[1] "bbb" "ccc"
With not luck I tried:
unique((MyString)
x <- table(MyString)
names(x[x==1])
[1] "bbb" "ccc"
also:
MyString[ !duplicated(MyString) & !duplicated(MyString,fromLast = T) ]
[1] "bbb" "ccc"
Find the set of duplicates
dups = MyString[ duplicated(MyString) ]
and drop all occurrences in the set
MyString[ !MyString %in% dups ]
Alternative:
setdiff(MyString, dups)
The table-based solution from #Moody_Mudskipper provides more flexibility, e.g., to choose strings that occur twice. An alternative (probably faster than but analogous to table()-solutions, when MyString is long), create a index into the unique strings, find the number of times each unique string is matched (tabulate() == 1) and use these to subset the unique strings:
UString = unique(MyString)
UString[ tabulate(match(MyString, UString)) == 1 ]
or save the need to create UString
MyString[ which(tabulate(match(MyString, MyString)) == 1) ]
Alternative: sort and then find runs of length 1.
r = rle(sort(MyString))
r$values[ r$lengths == 1 ]
For performance, here are some functions implementing the various solutions
f0 = function(x) x[ !x %in% x[duplicated(x)] ]
f1 = function(x) setdiff( x, x[duplicated(x)] )
f2 = function(x) { ux = unique(x); ux[ tabulate(match(x, ux)) == 1 ] }
f3 = function(x) x[ which( tabulate( match(x, x) ) == 1 ) ]
f4 = function(x) { r = rle(sort(x)); r$values[ r$lengths == 1] }
f5 = function(x) { x = table(x); names(x)[x==1] }
f6 = function(x) x[ !duplicated(x) & !duplicated(x, fromLast = TRUE) ]
evidence that they produce identical results
> identical(f0(x), f1(x))
[1] TRUE
> identical(f0(x), f2(x))
[1] TRUE
> identical(f0(x), f3(x))
[1] TRUE
> identical(f0(x), f4(x))
[1] TRUE
> identical(f0(x), f5(x))
[1] TRUE
> identical(f0(x), f6(x))
[1] TRUE
f5() (also the original implementation) fails for x = character(0)
> f1(character(0))
character(0)
> f5(character(0))
NULL
f4() and f5() return values in alphabetical order, whereas the others preserve the order in the input, like unique(). All methods but f5() work with vectors of other type, e.g., integer() (f5() always returns a character vector, the others return a vector with the same type as the input). f4() and f5() do not recognize unique occurrences of NA.
And timings:
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 9.195 10.9730 12.35724 11.8120 13.0580 29.100 100
f1(x) 20.471 22.6625 50.15586 24.6750 25.9915 2600.307 100
f2(x) 13.708 15.2265 58.58714 16.8180 18.4685 4180.829 100
f3(x) 7.533 8.8775 52.43730 9.9855 11.0060 4252.063 100
f4(x) 74.333 79.4305 124.26233 83.1505 87.4455 4091.371 100
f5(x) 147.744 154.3080 196.05684 158.4880 163.6625 3721.522 100
f6(x) 12.458 14.2335 58.11869 15.4805 17.0440 4250.500 100
Here's performance with 10,000 unique words
> x = readLines("/usr/share/dict/words", 10000)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 848.086 871.359 880.8841 873.637 899.669 916.528 10
f1(x) 1440.904 1460.704 1556.7154 1589.405 1607.048 1640.347 10
f2(x) 2143.997 2257.041 2288.1878 2288.329 2334.494 2372.639 10
f3(x) 1420.144 1548.055 1547.8093 1562.927 1596.574 1601.176 10
f4(x) 11829.680 12141.870 12369.5407 12311.334 12716.806 12952.950 10
f5(x) 15796.546 15833.650 16176.2654 15858.629 15913.465 18604.658 10
f6(x) 1219.036 1356.807 1354.3578 1363.276 1372.831 1407.077 10
And with substantial duplication
> x = sample(head(x, 1000), 10000, TRUE)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 1.914699 1.922925 1.992511 1.945807 2.030469 2.246022 100
f1(x) 1.888959 1.909469 2.097532 1.948002 2.031083 5.310342 100
f2(x) 1.396825 1.404801 1.447235 1.420777 1.479277 1.820402 100
f3(x) 1.248126 1.257283 1.295493 1.285652 1.329139 1.427220 100
f4(x) 24.075280 24.298454 24.562576 24.459281 24.700579 25.752481 100
f5(x) 4.044137 4.120369 4.307893 4.174639 4.283030 7.740830 100
f6(x) 1.221024 1.227792 1.264572 1.243201 1.295888 1.462007 100
f0() seems to be the speed winner when duplicates are rare
> x = readLines("/usr/share/dict/words", 100000)
> microbenchmark(f0(x), f1(x), f3(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 11.03298 11.17124 12.17688 11.36114 11.62769 19.83124 100
f1(x) 21.16154 21.33792 22.76237 21.67234 22.26473 31.99544 100
f3(x) 21.15801 21.49355 22.60749 21.77821 22.54203 31.17288 100
f6(x) 18.72260 18.97623 20.29060 19.46875 19.94892 28.17551 100
f3() and f6() look correct and fast; f6() is probably easier to understand (but only handles the special case of keeping words that occur exactly once).
Any generic way of doing the following R code faster? For example in python dict comprehension (see equivalent below) would be a nice faster alternative.
R:
l1 <- 1:3
l2 <- c("a", "b", "c")
foo <- function(x) {return(5*x)}
bar <- list()
for (i in 1:length(l1)) bar[l2[i]] <- foo(l1[i])
Python
l1 = range(1, 4)
l2 = ["a", "b", "c"]
def foo(x):
return 5*x
{b: foo(a) for a, b in zip(l1, l2)}
We're talking about speed, so let's do some benchmarking:
library(microbenchmark)
microbenchmark(op = {for (i in 1:length(l1)) bar[l2[i]] <- foo(l1[i])},
lapply = setNames(lapply(l1,foo),l2),
vectorised = setNames(as.list(foo(l1)), l2))
Unit: microseconds
expr min lq mean median uq max neval
op 7.982 9.122 10.81052 9.693 10.548 36.206 100
lapply 5.987 6.557 7.73159 6.842 7.270 55.877 100
vectorised 4.561 5.132 6.72526 5.417 5.987 80.964 100
But these small values don't mean much, so I pumped up the vector length to 10,000 where you'll really see a difference:
l <- 10000
l1 <- seq_len(l)
l2 <- sample(letters, l, replace = TRUE)
microbenchmark(op = {bar <- list(); for (i in 1:length(l1)) bar[l2[i]] <- foo(l1[i])},
lapply = setNames(lapply(l1,foo),l2),
vectorised = setNames(as.list(foo(l1)), l2),
times = 100)
Unit: microseconds
expr min lq mean median uq max neval
op 30122.865 33325.788 34914.8339 34769.8825 36721.428 41515.405 100
lapply 13526.397 14446.078 15217.5309 14829.2320 15351.933 19241.767 100
vectorised 199.559 259.997 349.0544 296.9155 368.614 3189.523 100
But tacking onto what everyone else said, it doesn't have to be a list. If you remove the list requirement:
microbenchmark(setNames(foo(l1), l2))
Unit: microseconds
expr min lq mean median uq max neval
setNames(foo(l1), l2) 22.522 23.8045 58.06888 25.0875 48.322 1427.417 100
I have a list of large matrices. All these matrices have the same number of rows and I want to "unlist" them and bind all their columns together. Below is a piece of code that I wrote, but I am not sure if this is the best I can achieve in terms of computational efficiency.
# simulate
n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}
> system.time( res <- matrix( as.numeric( unlist(test) ) ,nr,nc*n) )
user system elapsed
0.114 0.006 0.120
To work on a list and call a function on all objects, do.call is my usual first idea, along with cbind here to bind by column all objects.
For n=100 (with others answers for sake of completeness):
n <- 10
nr <- 24
nc <- 8000
test <- list()
set.seed(1234)
for (i in 1:n){
test[[i]] <- matrix(rnorm(nr*nc),nr,nc)
}
require(data.table)
ori <- function() { matrix( as.numeric( unlist(test) ) ,nr,nc*n) }
Tensibai <- function() { do.call(cbind,test) }
BrodieG <- function() { `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) }
nicola <- function() { setattr(unlist(test),"dim",c(nr,nc*n)) }
library(microbenchmark)
microbenchmark(r1 <- ori(),
r2 <- Tensibai(),
r3 <- BrodieG(),
r4 <- nicola(), times=10)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
r1 <- ori() 23.834673 24.287391 39.49451 27.066844 29.737964 93.74249 10 a
r2 <- Tensibai() 17.416232 17.706165 18.18665 17.873083 18.192238 21.29512 10 a
r3 <- BrodieG() 6.009344 6.145045 21.63073 8.690869 10.323845 77.95325 10 a
r4 <- nicola() 5.912984 6.106273 13.52697 6.273904 6.678156 75.40914 10 a
As for the why (in comments), #nicola did give the answer about it, there's less copy than original method.
All methods gives the same result:
> identical(r1,r2,r3,r4)
[1] TRUE
It seems that do.call beats the other method due to a copy made during the matrix call. What is interesting is that you can avoid that copy using the data.table::setattr function which allows to set attributes by reference, avoiding any copy. I omitted also the as.numeric part, since it is not necessary (unlist(test) is already numeric). So:
require(microbenchmark)
require(data.table)
f1<-function() setattr(unlist(test),"dim",c(nr,nc*n))
f2<-function() do.call(cbind,test)
microbenchmark(res <-f1(),res2 <- f2(),times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# res <- f1() 4.088455 4.183504 7.540913 4.44109 4.988605 35.05378 10
#res2 <- f2() 18.325302 18.379328 18.776834 18.66857 19.100681 19.47415 10
identical(res,res2)
#[1] TRUE
I think I have a better one. We can avoid some of the overhead from cbind since we know these all have the same number of rows and columns. Instead, we use c knowing that the underlying vector nature of the matrices will allow us to re-wrap them into the correct dimensions:
microbenchmark(
x <- `attr<-`(do.call(c, test), "dim", c(nr, nc * n)),
y <- do.call(cbind, test)
)
# Unit: milliseconds
# expr min lq
# x <- `attr<-`(do.call(c, test), "dim", c(nr, nc * n)) 4.435943 4.699006
# y <- do.call(cbind, test) 19.339477 19.567063
# mean median uq max neval cld
# 12.76214 5.209938 9.095001 379.77856 100 a
# 21.64878 20.000279 24.210848 26.02499 100 b
identical(x, y)
# [1] TRUE
If you have varying number of columns you can probably still do this with some care in computing the total number of columns.
Lets say I have vector x that:
is very large ( > 200 000 )
is integer
is sorted
all of it's values are unique
I would like to check if an integer value y is in this vector, and if it is, I would like to get the index of it. I would like to take advantage of the fact, that vector is sorted, so it can be done fast.
How would I accomplish such thing?
Here's some data
set.seed(123)
x = sort(unique(floor(runif(1e6, 1, 1e7))))
y = sample(1e7, 10000)
And a couple of approaches
f0 = function(y, vec) y %in% vec
f1 = function(y, vec) vec[findInterval(y, vec)] == y
The %in% does a full scan; findInterval() does a binary search (I think). They generate the same result
> identical(f0(y, x), f1(y, x))
[1] TRUE
And have approximately similar amortized performance
> library(microbenchmark)
> microbenchmark(f0(y, x), f1(y, x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y, x) 99.35425 100.87319 102.32160 102.20107 103.67718 105.70854 10
f1(y, x) 94.83219 95.05068 95.93625 95.77822 96.72601 97.50961 10
But findInterval() is I think faster for small queries
> microbenchmark(f0(y[1:10], x), f1(y[1:10], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(y[1:10], x) 83.441578 85.116818 86.264751 86.07515 87.13516 89.430801 10
f1(y[1:10], x) 7.731606 7.734207 7.757201 7.75199 7.77210 7.810957 10
David suggests (I think)
f2 = function(x, vec) vec[which.max(x == vec)] == x
which.max() is only good for scalar y, which is seldom (saying this for the benefit of OP) a good use of R. It appears less performant than findInterval()
> microbenchmark(f1(x[1000], x), f2(x[1000], x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f1(x[1000], x) 7.707420 7.709047 7.714576 7.711979 7.718953 7.729688 10
f2(x[1000], x) 9.353225 9.358874 9.381781 9.378680 9.400808 9.426102 10
Contrary to #Laterow I don't see any particular performance difference between which() and which.max() (in current R-devel or R-3-2-branch; also, the results aren't the same, so it's an apples-to-oranges comparison). I have a vague recollection of an R-devel conversation about this in the last 6 months...
> set.seed(123) ; x <- sample(2e5, replace = TRUE)
> microbenchmark(which.max(x == 1e7), which(x == 1e7)[1])
Unit: milliseconds
expr min lq mean median uq max
which.max(x == 1e+07) 4.240606 4.266470 5.975966 5.015947 5.217903 43.78467
which(x == 1e+07)[1] 4.060040 4.132667 5.550078 4.986287 5.059128 43.88074
neval
100
100
Performance of which versus which.max might have changed with this commit, where previously which.max() would coerce logical to numeric vectors before the scan, triggering a copy.