R offers max and min, but I do not see a really fast way to find another value in the order, apart from sorting the whole vector and then picking a value x from this vector.
Is there a faster way to get the second highest value, for example?
Use the partial argument of sort(). For the second highest value:
n <- length(x)
sort(x,partial=n-1)[n-1]
Slightly slower alternative, just for the records:
x <- c(12.45,34,4,0,-234,45.6,4)
max( x[x!=max(x)] )
min( x[x!=min(x)] )
Rfast has a function called nth_element that does exactly what you ask.
Further the methods discussed above that are based on partial sort, don't support finding the k smallest values
Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755
Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.
Rfast::nth(x, 5, descending = T)
Will return the 5th largest element of x, while
Rfast::nth(x, 5, descending = F)
Will return the 5th smallest element of x
Benchmarks below against most popular answers.
For 10 thousand numbers:
N = 10000
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: microseconds
expr min lq mean median uq max neval
Rfast 160.364 179.607 202.8024 194.575 210.1830 351.517 100
maxN 396.419 423.360 559.2707 446.452 487.0775 4949.452 100
order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148 100
For 1 million numbers:
N = 1e6
x = rnorm(N)
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: milliseconds
expr min lq mean median uq max neval
Rfast 89.7722 93.63674 114.9893 104.6325 120.5767 204.8839 100
maxN 150.2822 207.03922 235.3037 241.7604 259.7476 336.7051 100
order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129 100
I wrapped Rob's answer up into a slightly more general function, which can be used to find the 2nd, 3rd, 4th (etc.) max:
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
maxN(1:10)
Here is an easy way to find the indices of N smallest/largest values in a vector(Example for N = 3):
N <- 3
N Smallest:
ndx <- order(x)[1:N]
N Largest:
ndx <- order(x, decreasing = T)[1:N]
So you can extract the values as:
x[ndx]
For nth highest value,
sort(x, TRUE)[n]
Here you go... kit is the obvious winner!
N = 1e6
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 12.311168 12.473771 16.36982 12.702134 16.110779 102.749873 100
# maxN 12.922118 13.124358 17.49628 18.977537 20.053139 28.928694 100
# order 50.443100 50.926975 52.54067 51.270163 52.323116 66.561606 100
# kit 1.177202 1.216371 1.29542 1.240228 1.297286 2.771715 100
Edit: I forgot that kit::topn has hasna option...let's do another run.
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]],
kit2 = x[kit::topn(x, 5L,decreasing = T,hasna = F)[5L]],
unit = "ms"
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 13.194314 13.358787 14.7227116 13.4560340 14.551194 24.524105 100
# maxN 7.378960 7.527661 10.0747803 7.7119715 12.217756 67.409526 100
# order 50.088927 50.488832 52.4714347 50.7415680 52.267003 70.062662 100
# kit 1.180698 1.217237 1.2975441 1.2429790 1.278243 3.263202 100
# kit2 0.842354 0.876329 0.9398055 0.9109095 0.944407 2.135903 100
Here is the simplest way I found,
num <- c(5665,1615,5154,65564,69895646)
num <- sort(num, decreasing = F)
tail(num, 1) # Highest number
head(tail(num, 2),1) # Second Highest number
head(tail(num, 3),1) # Third Highest number
head(tail(num, n),1) # Generl equation for finding nth Highest number
I found that removing the max element first and then do another max runs in comparable speed:
system.time({a=runif(1000000);m=max(a);i=which.max(a);b=a[-i];max(b)})
user system elapsed
0.092 0.000 0.659
system.time({a=runif(1000000);n=length(a);sort(a,partial=n-1)[n-1]})
user system elapsed
0.096 0.000 0.653
dplyr has the function nth, where the first argument is the vector and the second is which place you want. This goes for repeating elements as well.
For example:
x = c(1,2, 8, 16, 17, 20, 1, 20)
Finding the second largest value:
nth(unique(x),length(unique(x))-1)
[1] 17
When I was recently looking for an R function returning indexes of top N max/min numbers in a given vector, I was surprised there is no such a function.
And this is something very similar.
The brute force solution using base::order function seems to be the easiest one.
topMaxUsingFullSort <- function(x, N) {
sort(x, decreasing = TRUE)[1:min(N, length(x))]
}
But it is not the fastest one in case your N value is relatively small compared to length of the vector x.
On the other side if the N is really small, you can use base::whichMax function iteratively and in each iteration you can replace found value by -Inf
# the input vector 'x' must not contain -Inf value
topMaxUsingWhichMax <- function(x, N) {
vals <- c()
for(i in 1:min(N, length(x))) {
idx <- which.max(x)
vals <- c(vals, x[idx]) # copy-on-modify (this is not an issue because idxs is relative small vector)
x[idx] <- -Inf # copy-on-modify (this is the issue because data vector could be huge)
}
vals
}
I believe you see the problem - the copy-on-modify nature of R. So this will perform better for very very very small N (1,2,3) but it will rapidly slow down for larger N values. And you are iterating over all elements in vector x N times.
I think the best solution in clean R is to use partial base::sort.
topMaxUsingPartialSort <- function(x, N) {
N <- min(N, length(x))
x[x >= -sort(-x, partial=N)[N]][1:N]
}
Then you can select the last (Nth) item from the result of functions defiend above.
Note: functions defined above are just examples - if you want to use them, you have to check/sanity inputs (eg. N > length(x)).
I wrote a small article about something very similar (get indexes of top N max/min values of a vector) at http://palusga.cz/?p=18 - you can find here some benchmarks of similar functions I defined above.
head(sort(x),..) or tail(sort(x),...) should work
This will find the index of the N'th smallest or largest value in the input numeric vector x. Set bottom=TRUE in the arguments if you want the N'th from the bottom, or bottom=FALSE if you want the N'th from the top. N=1 and bottom=TRUE is equivalent to which.min, N=1 and bottom=FALSE is equivalent to which.max.
FindIndicesBottomTopN <- function(x=c(4,-2,5,-77,99),N=1,bottom=FALSE)
{
k1 <- rank(x)
if(bottom==TRUE){
Nindex <- which(k1==N)
Nindex <- Nindex[1]
}
if(bottom==FALSE){
Nindex <- which(k1==(length(x)+1-N))
Nindex <- Nindex[1]
}
return(Nindex)
}
topn = function(vector, n){
maxs=c()
ind=c()
for (i in 1:n){
biggest=match(max(vector), vector)
ind[i]=biggest
maxs[i]=max(vector)
vector=vector[-biggest]
}
mat=cbind(maxs, ind)
return(mat)
}
this function will return a matrix with the top n values and their indices.
hope it helps
VDevi-Chou
You can identify the next higher value with cummax(). If you want the location of the each new higher value for example you can pass your vector of cummax() values to the diff() function to identify locations at which the cummax() value changed. say we have the vector
v <- c(4,6,3,2,-5,6,8,12,16)
cummax(v) will give us the vector
4 6 6 6 6 6 8 12 16
Now, if you want to find the location of a change in cummax() you have many options I tend to use sign(diff(cummax(v))). You have to adjust for the lost first element because of diff(). The complete code for vector v would be:
which(sign(diff(cummax(v)))==1)+1
You can use the sort keyword like this:
sort(unique(c))[1:N]
Example:
c <- c(4,2,44,2,1,45,34,2,4,22,244)
sort(unique(c), decreasing = TRUE)[1:5]
will give the first 5 max numbers.
Related
I have a very large data set with categorical labels a and a vector b that contains all possible labels in the data set:
a <- c(1,1,3,2) # artificial data
b <- c(1,2,3,4) # fixed categories
Now I want to find for each observation in a the set of all remaining categories (that is, the elements of b excluding the given observation in a). From these remaining categories, I want to sample one at random.
My approach using a loop is
goal <- numeric() # container for results
for(i in 1:4){
d <- setdiff(b, a[i]) # find the categories except the one observed in the data
goal[i] <- sample(d,1) # sample one of the remaining categories randomly
}
goal
[1] 4 4 1 1
However, this has to be done a large number of times and applied to very large data sets. Does anyone have a more efficient version that leads to the desired result?
EDIT:
The function by akrun is unfortunately slower than the original loop. If anyone has a creative idea with a competitive result, I'm happy to hear it!
We can use vapply
vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1))
set.seed(24)
a <- sample(c(1:4), 10000, replace=TRUE)
b <- 1:4
system.time(vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)))
# user system elapsed
# 0.208 0.007 0.215
It turns out that resampling the labels that are equal to the labels in the data is an even faster approach, using
test = sample(b, length(a), replace=T)
resample = (a == test)
while(sum(resample>0)){
test[resample] = sample(b, sum(resample), replace=T)
resample = (a == test)
}
Updated Benchmarks for N=10,000:
Unit: microseconds
expr min lq mean median uq max neval
loop 14337.492 14954.595 16172.2165 15227.010 15585.5960 24071.727 100
akrun 14899.000 15507.978 16271.2095 15736.985 16050.6690 24085.839 100
resample 87.242 102.423 113.4057 112.473 122.0955 174.056 100
shree(data = a, labels = b) 5195.128 5369.610 5472.4480 5454.499 5574.0285 5796.836 100
shree_mapply(data = a, labels = b) 1500.207 1622.516 1913.1614 1682.814 1754.0190 10449.271 100
Update: Here's a fast version with mapply. This method avoids calling sample() for every iteration so is a bit faster. -
mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
Here's a version without setdiff (setdiff can be a bit slow) although I think even more optimization is possible. -
vapply(a, function(x) sample(b[!b == x], 1), numeric(1))
Benchmarks -
set.seed(24)
a <- sample(c(1:4), 1000, replace=TRUE)
b <- 1:4
microbenchmark::microbenchmark(
akrun = vapply(a, function(x) sample(setdiff(b, x), 1), numeric(1)),
shree = vapply(a, function(x) sample(b[!b == x], 1), numeric(1)),
shree_mapply = mapply(function(x, y) b[!b == x][y], a, sample(length(b) - 1, length(a), replace = T))
)
Unit: milliseconds
expr min lq mean median uq max neval
akrun 28.7347 30.66955 38.319655 32.57875 37.45455 237.1690 100
shree 5.6271 6.05740 7.531964 6.47270 6.87375 45.9081 100
shree_mapply 1.8286 2.01215 2.628989 2.14900 2.54525 7.7700 100
Actually a really nice problem to which I came up with a solution (see below), which is, however, not beautiful:
Assume you have a vector x and a matrix A which contains the start of an interval in the first column and the end of the interval in the second.
How can I get the elements of A, which fall into the intervals given by A?
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
Expected output:
[1] 4 15
You could you the following information, if this would be helpful for increasing the performance:
Both, the vector and the rows of the matrix are ordered and the intervals don't overlap. All intervals have the same length. All numbers are integers, but can be huge.
Now I did not want to be lazy and came up with the following solution, which is too slow for long vectors and matrices:
x <- c(4, 7, 15) # Define input vector
A <- cbind(c(3, 9, 14), c(5, 11, 16)) # Define matrix with intervals
b <- vector()
for (i in 1:nrow(A)) {
b <- c(b, A[i, 1]:A[i, 2])
}
x[x %in% b]
I know that loops in R can be slow, but I did not know how to write the operation without one (maybe there is a way with apply).
We can use sapply to loop over each element of x and find if it lies in the range of any of those matrix values.
x[sapply(x, function(i) any(i > A[, 1] & i < A[,2]))]
#[1] 4 15
In case, if length(x) and nrow(A) are same then we don't even need the sapply loop and we can use this comparison directly.
x[x > A[, 1] & x < A[,2]]
#[1] 4 15
Here is a method that does not use an explicit loop or an apply function. outer is sometimes much faster.
x[rowSums(outer(x, A[,1], `>=`) & outer(x, A[,2], `<=`)) > 0]
[1] 4 15
This answer is late, but today I had the same problem to solve and my answer is maybe helpful for future readers. My solution was the following:
f3 <- function(x,A) {
Reduce(f = "|",
x = lapply(1:NROW(A),function(k) x>A[k,1] & x<A[k,2]),
init = logical(length(x)))
}
This function return a logical vector of length(x) indicating whether the corresponding value in x can be found in the intervals or not. If I want to get the elements I simply have to write
x[f3(x,A)]
I did some benchmarks and my function seems to work very well, also while testing with larger data.
Lets define the other solutions suggested here in this post:
f1 <- function(x,A) {
sapply(x, function(i) any(i > A[, 1] & i < A[,2]))
}
f2 <- function(x,A) {
rowSums(outer(x, A[,1], `>`) & outer(x, A[,2], `<`)) > 0
}
Now they are also returning a logical vector.
The benchmarks on my machine are following:
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A))
#Unit: microseconds
# expr min lq mean median uq max neval
#f1(x, A) 21.5 23.20 25.023 24.30 25.40 61.8 100
#f2(x, A) 18.8 21.20 23.606 22.75 23.70 75.4 100
#f3(x, A) 13.9 15.85 18.682 18.30 19.15 52.2 100
It seems like there is no big difference, but the follwoing example will make it more obvious:
x <- seq(1,100,length.out = 1e6)
A <- cbind(20:70,(20:70)+0.5)
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A), times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
#f1(x, A) 4176.172 4227.6709 4419.6010 4484.2946 4539.9668 4569.7412 10
#f2(x, A) 1418.498 1511.5647 1633.4659 1571.0249 1703.6651 1987.8895 10
#f3(x, A) 614.556 643.4138 704.3383 672.5385 770.7751 873.1291 10
That the functions all return the same result can be checked e.g. via:
all(f1(x,A)==f3(x,A))
I have an array a with some matrices in it. Now i need to efficiently check how many different matrices I have and what indices (in ascending order) they have in the array. My approach is the following: Paste the columns of the matrixes as character vectors and have a look at the frequency table like this:
n <- 10 #observations
a <- array(round(rnorm(2*2*n),1),
c(2,2,n))
paste_a <- apply(a, c(3), paste, collapse=" ") #paste by column
names(paste_a) <- 1:n
freq <- as.numeric( table(paste_a) ) # frequencies of different matrices (in ascending order)
indizes <- as.numeric(names(sort(paste_a[!duplicated(paste_a)])))
nr <- length(freq) #number of different matrices
However, as you increase n to large numbers, this gets very inefficient (it's mainly paste() that's getting slower and slower). Does anyone have a better solution?
Here is a "real" dataset with 100 observations where some matrices are actual duplicates (as opposed to my example above): https://pastebin.com/aLKaSQyF
Thank you very much.
Since your actual data is made up of the integers 0,1,2,3, why not take advantage of base 4? Integers are much faster to compare than entire matrix objects. (All occurrences of a below are of the data found in the real data set from the link.)
Base4Approach <- function() {
toBase4 <- sapply(1:dim(a)[3], function(x) {
v <- as.vector(a[,,x])
pows <- which(v > 0)
coefs <- v[pows]
sum(coefs*(4^pows))
})
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
And since the question is about efficiency, let's benchmark:
MartinApproach <- function() {
### commented this out for comparison reasons
# dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3]
a
}
identical(MartinApproach(), Base4Approach())
[1] TRUE
microbenchmark(Base4Approach(), MartinApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 291.658 303.525 339.2712 325.4475 352.981 636.361 100
MartinApproach() 983.855 1000.958 1160.4955 1071.9545 1187.321 3545.495 100
The approach by #d.b. doesn't really do the same thing as the previous two approaches (it simply identifies and doesn't remove duplicates).
DBApproach <- function() {
a[, , 9] = a[, , 1]
#Convert to list
mylist = lapply(1:dim(a)[3], function(i) a[1:dim(a)[1], 1:dim(a)[2], i])
temp = sapply(mylist, function(x) sapply(mylist, function(y) identical(x, y)))
temp2 = unique(apply(temp, 1, function(x) sort(which(x))))
#The indices in 'a' where the matrices are same
temp2[lengths(temp2) > 1]
}
However, Base4Approach still dominates:
microbenchmark(Base4Approach(), MartinApproach(), DBApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 298.764 324.0555 348.8534 338.899 356.0985 476.475 100
MartinApproach() 1012.601 1087.9450 1204.1150 1110.662 1162.9985 3224.299 100
DBApproach() 9312.902 10339.4075 11616.1644 11438.967 12413.8915 17065.494 100
Update courtesy of #alexis_laz
As mentioned in the comments by #alexis_laz, we can do much better.
AlexisBase4Approach <- function() {
toBase4 <- colSums(a * (4 ^ (0:(prod(dim(a)[1:2]) - 1))), dims = 2)
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
microbenchmark(Base4Approach(), MartinApproach(), DBApproach(), AlexisBase4Approach(), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Base4Approach() 11.67992 10.55563 8.177654 8.537209 7.128652 5.288112 100
MartinApproach() 39.60408 34.60546 27.930725 27.870019 23.836163 22.488989 100
DBApproach() 378.91510 342.85570 262.396843 279.190793 231.647905 108.841199 100
AlexisBase4Approach() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000 100
## Still gives accurate results
identical(MartinApproach(), AlexisBase4Approach())
[1] TRUE
My first attempt was actually really slow. So here is slightly changed version of yours:
dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3] #number of different matrices
idx <- dimnames(a)[[3]] # indices of left over matrices
I don't know if this is exactly what you want but here is a way you can extract indices where the matrices are same. More processing may be necessary to get what you want
#DATA
n <- 10
a <- array(round(rnorm(2*2*n),1), c(2,2,n))
a[, , 9] = a[, , 1]
temp = unique(apply(X = sapply(1:dim(a)[3], function(i)
sapply(1:dim(a)[3], function(j) identical(a[, , i], a[, , j]))),
MARGIN = 1,
FUN = function(x) sort(which(x))))
temp[lengths(temp) > 1]
#[[1]]
#[1] 1 9
I have 3 vectors and I want to apply separately on each of them the 'which()' function.
I'm trying to find the max index of values less than some given number.
How can I operate this task using vectorization?
my 3 vectors (may have various lengths)
vec1 <- c(1,2,3,4,5)
vec2 <- c(11,12,13)
vec3 <- c(1,2,3,4,5,6,7,8)
How can I vectorize it?
max(which(vec1<3))
max(which(vec2<12.3))
max(which(vec3<5.7))
The expected result is:
2
2
5
One way to get a speedup would be to use Rcpp to search for elements smaller than your cutoff, starting from the right side of the vector and moving left. You can return as soon as you find the element that meets your criteria, which means that if your target is near the right side of the vector you might avoid looking at most of the vector's elements (meanwhile, which looks at all vector elements and max looks at all values returned by which). The speedup would be largest for long vectors where the target element is close to the end.
library(Rcpp)
rightmost.small <- cppFunction(
'double rightmostSmall(NumericVector x, const double cutoff) {
for (int i=x.size()-1; i >= 0; --i) {
if (x[i] < cutoff) return i+1; // 1-index
}
return 0; // None found
}')
rightmost.small(vec1, 3)
# [1] 2
rightmost.small(vec2, 12.3)
# [1] 2
rightmost.small(vec3, 5.7)
# [1] 5
Let's look at the performance for a vector where we expect this to give us a big speedup:
set.seed(144)
vec.large <- rnorm(1000000)
all.equal(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# [1] TRUE
library(microbenchmark)
microbenchmark(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# Unit: microseconds
# expr min lq mean median uq max neval
# max(which(vec.large < -1)) 4912.016 8097.290 12816.36406 9189.0685 9883.9775 60405.585 100
# rightmost.small(vec.large, -1) 1.643 2.476 8.54274 8.8915 12.8375 58.152 100
For this vector of length 1 million, we see a speedup of about 1000x using the Rcpp code.
This speedup should carry directly over to the case where you have many vectors stored in a list; you can use #JoshO'Brien's mapply code and observe a speedup when you switch from max(which(...)) to the Rcpp code:
f <- function(v,m) max(which(v < m))
l <- list(vec.large)[rep(1, 100)]
m <- rep(-1, 100)
microbenchmark(mapply(f, l, m), mapply(rightmost.small, l, m))
Unit: microseconds
expr min lq mean median uq max neval
mapply(f, l, m) 865287.828 907893.8505 931448.1555 918637.343 935632.0505 1133909.950 100
mapply(rightmost.small, l, m) 253.573 281.6855 344.5437 303.094 335.1675 712.897 100
We see a 3000x speedup by using the Rcpp code here.
l <- list(vec1,vec2,vec3)
m <- c(3, 12.3, 5.7)
f <- function(v,m) max(which(v < m))
mapply(f,l,m)
# [1] 2 2 5
Here's an example of what I mean, this code outputs the right thing:
list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
compare <- function(list.t, matrix.t) {
pairs <- 0
for (i in 1:nrow(matrix.t)) {
for (j in 1:length(list.t)) {
if (length(intersect(matrix.t[i,], list.t[[j]])) == 2) {
pairs <- pairs + 1
}
}
}
return(pairs / nrow(matrix.t))
}
compare(list1, matrix1)
# = 0.33333
I hope that makes sense. I'm trying to take an nx2 matrix, and see if the two elements of each row of the matrix are also found in each section of the list. So, in the example above, the first row of the matrix is (1,2), and this pair is found in the first section of the list. The (1,5) or the (8,10) pairs are not found in any section of the list. So that's why I'm outputting 0.3333 (1/3).
I'm wondering if anyone knows a way that doesn't use two for-loops to compare each row to each section? I have larger matrices and lists, and so this is too slow.
Thank you for any help!
Wouldn't this work just the same? You could call sapply over the list and compare with all rows of the matrix simultaneously.
> list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
> matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
> s <- sapply(seq_along(list1), function(i){
length(intersect(list1[[i]], matrix1)) == 2
})
> sum(s)/nrow(matrix1)
# [1] 0.3333333
If we call your function f1(), and this sapply version of the same function f2(), we get the following difference in speed.
> library(microbenchmark)
> microbenchmark(f1(), f2())
# Unit: microseconds
# expr min lq median uq max neval
# f1() 245.017 261.2240 268.843 281.7350 1265.706 100
# f2() 113.727 117.7045 125.478 135.6945 268.310 100
Hopefully that's the increase in efficiency you're looking for.
This is offered in the spirit of your R golf challenge for your problem, a compact bu potentially inscrutable solution:
mean( apply(matrix1, 1,
function(x) any( {lapply(list1, function(z) {all(x %in% z) } )}) )
)
[1] 0.3333333
The inner lapply tests whether a particular element of list1 has both of the items in the two-element vector pass as a row from matrix1. Then the any function tests whether any of the 4 elements met the challenge for a particular row. The intermediate logical vector c(TRUE,FALSE,FALSE) is converted into a fraction by the mean. (It still really two nested loops.)