If, for argument's sake, I want the last five elements of a 10-length vector in Python, I can use the - operator in the range index like so:
>>> x = range(10)
>>> x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x[-5:]
[5, 6, 7, 8, 9]
>>>
What is the best way to do this in R? Is there a cleaner way than my current technique, which is to use the length() function?
> x <- 0:9
> x
[1] 0 1 2 3 4 5 6 7 8 9
> x[(length(x) - 4):length(x)]
[1] 5 6 7 8 9
>
The question is related to time series analysis btw where it is often useful to work only on recent data.
see ?tail and ?head for some convenient functions:
> x <- 1:10
> tail(x,5)
[1] 6 7 8 9 10
For the argument's sake : everything but the last five elements would be :
> head(x,n=-5)
[1] 1 2 3 4 5
As #Martin Morgan says in the comments, there are two other possibilities which are faster than the tail solution, in case you have to carry this out a million times on a vector of 100 million values. For readibility, I'd go with tail.
test elapsed relative
tail(x, 5) 38.70 5.724852
x[length(x) - (4:0)] 6.76 1.000000
x[seq.int(to = length(x), length.out = 5)] 7.53 1.113905
benchmarking code :
require(rbenchmark)
x <- 1:1e8
do.call(
benchmark,
c(list(
expression(tail(x,5)),
expression(x[seq.int(to=length(x), length.out=5)]),
expression(x[length(x)-(4:0)])
), replications=1e6)
)
The disapproval of tail here based on speed alone doesn't really seem to emphasize that part of the slower speed comes from the fact that tail is safer to work with, if you don't for sure that the length of x will exceed n, the number of elements you want to subset out:
x <- 1:10
tail(x, 20)
# [1] 1 2 3 4 5 6 7 8 9 10
x[length(x) - (0:19)]
#Error in x[length(x) - (0:19)] :
# only 0's may be mixed with negative subscripts
Tail will simply return the max number of elements instead of generating an error, so you don't need to do any error checking yourself. A great reason to use it. Safer cleaner code, if extra microseconds/milliseconds don't matter much to you in its use.
You can do exactly the same thing in R with two more characters:
x <- 0:9
x[-5:-1]
[1] 5 6 7 8 9
or
x[-(1:5)]
How about rev(x)[1:5]?
x<-1:10
system.time(replicate(10e6,tail(x,5)))
user system elapsed
138.85 0.26 139.28
system.time(replicate(10e6,rev(x)[1:5]))
user system elapsed
61.97 0.25 62.23
Here is a function to do it and seems reasonably fast.
endv<-function(vec,val)
{
if(val>length(vec))
{
stop("Length of value greater than length of vector")
}else
{
vec[((length(vec)-val)+1):length(vec)]
}
}
USAGE:
test<-c(0,1,1,0,0,1,1,NA,1,1)
endv(test,5)
endv(LETTERS,5)
BENCHMARK:
test replications elapsed relative
1 expression(tail(x, 5)) 100000 5.24 6.469
2 expression(x[seq.int(to = length(x), length.out = 5)]) 100000 0.98 1.210
3 expression(x[length(x) - (4:0)]) 100000 0.81 1.000
4 expression(endv(x, 5)) 100000 1.37 1.691
I just add here something related. I was wanted to access a vector with backend indices, ie writting something like tail(x, i) but to return x[length(x) - i + 1] and not the whole tail.
Following commentaries I benchmarked two solutions:
accessRevTail <- function(x, n) {
tail(x,n)[1]
}
accessRevLen <- function(x, n) {
x[length(x) - n + 1]
}
microbenchmark::microbenchmark(accessRevLen(1:100, 87), accessRevTail(1:100, 87))
Unit: microseconds
expr min lq mean median uq max neval
accessRevLen(1:100, 87) 1.860 2.3775 2.84976 2.803 3.2740 6.755 100
accessRevTail(1:100, 87) 22.214 23.5295 28.54027 25.112 28.4705 110.833 100
So it appears in this case that even for small vectors, tail is very slow comparing to direct access
Related
Given a numeric vector, I'd like to find the smallest absolute difference in combinations of size 2. However, the point of friction comes with the use of combn to create the matrix holding the pairs. How would one handle issues when a matrix/vector is too large?
When the number of resulting pairs (number of columns) using combn is too large, I get the following error:
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
This post states that the size limit of a matrix is roughly one billion rows and two columns.
Here is the code I've used. Apologies for the use of cat in my function output -- I'm solving the Minimum Absolute Difference in an Array Greedy Algorithm problem in HackerRank and R outputs are only counted as correct if they're given using cat:
minimumAbsoluteDifference <- function(arr) {
combos <- combn(arr, 2)
cat(min(abs(combos[1,] - combos[2,])))
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
# This fails
inputFail <- rpois(10e4, 1)
minimumAbsoluteDifference(inputFail)
#Error in matrix(r, nrow = len.r, ncol = count) :
# invalid 'ncol' value (too large or NA)
TL;DR
No need for combn or the like, simply:
min(abs(diff(sort(v))))
The Nitty Gritty
Finding the difference between every possible combinations is O(n^2). So when we get to vectors of length 1e5, the task is burdensome both computationally and memory-wise.
We need a different approach.
How about sorting and taking the difference only with its neighbor?
By first sorting, for any element vj, the difference min |vj - vj -/+ 1| will be the smallest such difference involving vj. For example, given the sorted vector v:
v = -9 -8 -6 -4 -2 3 8
The smallest distance from -2 is given by:
|-2 - 3| = 5
|-4 - -2| = 2
There is no need in checking any other elements.
This is easily implemented in base R as follows:
getAbsMin <- function(v) min(abs(diff(sort(v))))
I'm not going to use rpois as with any reasonably sized vector, duplicates will be produces, which will trivially give 0 as an answer. A more sensible test would be with runif or sample (minimumAbsoluteDifference2 is from the answer provided by #RuiBarradas):
set.seed(1729)
randUnif100 <- lapply(1:100, function(x) {
runif(1e3, -100, 100)
})
randInts100 <- lapply(1:100, function(x) {
sample(-(1e9):(1e9), 1e3)
})
head(sapply(randInts100, getAbsMin))
[1] 586 3860 2243 2511 5186 3047
identical(sapply(randInts100, minimumAbsoluteDifference2),
sapply(randInts100, getAbsMin))
[1] TRUE
options(scipen = 99)
head(sapply(randUnif100, getAbsMin))
[1] 0.00018277206 0.00020549633 0.00009834766 0.00008395873 0.00005299225 0.00009313226
identical(sapply(randUnif100, minimumAbsoluteDifference2),
sapply(randUnif100, getAbsMin))
[1] TRUE
It's very fast as well:
library(microbenchmark)
microbenchmark(a = getAbsMin(randInts100[[50]]),
b = minimumAbsoluteDifference2(randInts100[[50]]),
times = 25, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
a 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000 25
b 117.9799 113.2221 105.5144 107.6901 98.55391 81.05468 25
Even for very large vectors, the result is instantaneous;
set.seed(321)
largeTest <- sample(-(1e12):(1e12), 1e6)
system.time(print(getAbsMin(largeTest)))
[1] 3
user system elapsed
0.083 0.003 0.087
Something like this?
minimumAbsoluteDifference2 <- function(x){
stopifnot(length(x) >= 2)
n <- length(x)
inx <- rep(TRUE, n)
m <- NULL
for(i in seq_along(x)[-n]){
inx[i] <- FALSE
curr <- abs(x[i] - x[which(inx)])
m <- min(c(m, curr))
}
m
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
minimumAbsoluteDifference2(input0) #returns 3
set.seed(2020)
input1 <- rpois(1e3, 1)
minimumAbsoluteDifference(input1) #returns 0
minimumAbsoluteDifference2(input1) #returns 0
inputFail <- rpois(1e5, 1)
minimumAbsoluteDifference(inputFail) # This fails
minimumAbsoluteDifference2(inputFail) # This does not fail
I've looked this over and I can't quite understand why this is giving me NA values appended onto the vector I want. Prompt below:
"The function should return a vector where the first element is the sum of the first n elements of the input vector, and the rest of the vector is a copy of the other elements of the input vector. For example, if the input vector is (2, 3, 6, 7, 8) and n = 2, then the output should be the vector (5, 6, 7, 8)"
testA<- c(1,2,3,4,5)
myFunction <- function(vector1, n)
{
sum1=0
for(i in 1:n)
{
sum1<-sum1+vector1[i]
newVector<-c(sum1,vector1[n+1:length(vector1)])
}
return(newVector)
}
print(newVector)
myFunction(testA, 3)
Output is: [1] 6 4 5 NA NA NA when it should just be 6 4 5
There is no need for a for loop here; you can do something like this
test <- c(2, 3, 6, 7, 8)
myfunction <- function(x, n) c(sum(x[1:n]), x[-(1:n)])
myfunction(test, 2)
#[1] 5 6 7 8
testA <- c(1,2,3,4,5)
myfunction(testA, 3)
#[1] 6 4 5
Explanation: sum(x[1:n]) calculates the sum of the first n elements of x and x[-(1:n)] returns x with the first n elements removed.
It can be done with head and tail
n <- 2
c(sum(head(test, 2)), tail(test, -2))
#[1] 5 6 7 8
data
test <- c(2, 3, 6, 7, 8)
Here I try to compare the efficiency of above two functions, which are answer post https://stackoverflow.com/a/52472214/3806250 with the question post.
> testA <- 1:5
> myFunction <- function(vector1, n) {
+ sum1 <- 0
+ for(i in 1:n) {
+ sum1 <- sum1 + vector1[i]
+ newVector <- c(sum1, vector1[n+1:length(vector1)])
+ }
+ newVector <- newVector[!is.na(newVector)]
+ return(newVector)
+ }
>
> microbenchmark::microbenchmark(myFunction(testA, 3))
Unit: microseconds
expr min lq mean median uq max neval
myFunction(testA, 3) 3.592 4.1055 77.37798 4.106 4.619 7292.85 100
>
> myfunction <- function(x, n) c(sum(x[1:n]), x[-(1:n)])
>
> microbenchmark::microbenchmark(myfunction(testA, 2))
Unit: microseconds
expr min lq mean median uq max neval
myfunction(testA, 2) 1.539 1.54 47.04373 2.053 2.053 4462.644 100
Thank you for everyone's answers! I was really tired last night and couldn't come up with this simple solution:
function(vector1, n)
{
sum1=0
for(i in 1:n) #scans input vector from first element to input 'n' element
{
sum1<-sum1+vector1[i]#Find sum of numbers scanned
newVector<-c(sum1,vector1[n+1:length(vector1)])#new output vector starting with the sum found then concatonates rest of the original vector after 'n' element
length(newVector)<-(length(newVector)-(n)) #NA values were returned, length needs to be changed with respect to 'n'
}
return(newVector)
print(newVector)
}
There are already great solutions, but here is another option which does minimum modifications on you original code:
testA<- c(1,2,3,4,5)
myFunction <- function(vector1, n)
{
sum1=0
for(i in 1:n)
{
sum1<-sum1+vector1[i]
}
newVector<-c(sum1,vector1[(n+1):length(vector1)]) # we take this line out of the for loop
# and put the n+1 in between parenthesis
return(newVector)
}
newVector <- myFunction(testA, 3)
print(newVector)
The problem on the original code/example was that n+1:length(vector1) was supossed to return [1] 4 5, in order to do the appropiate subsetting (obtaining the last elements in the vector which weren't included in the sum of the first n elements), but it is actually returning [1] 4 5 6 7 8. Since there are no elements in positions 6:8 in testA, this is the reason why there are appearing missing values/NAs.
What n+1:length(vector1) is actually doing is first obtaining the secuence 1:length(vector1) and then adding n to each element. Here is an example of this behaviour using values:
3+1:5
#> [1] 4 5 6 7 8
We can solve this by putting n+1 between parenthesis on the original code. In our example using values:
(3+1):5
#> [1] 4 5
Also, taking the assignment of newVector out of the loop improves performance, because the binding between sum1 and the subsetted vector only needs to be done once the sum of the first n elements is completed.
My aim is to randomly generate a vector of integers using R, which is populated by numbers between 1-8. However, I want to keep growing the vector until all the numbers between 1:8 are represented at least once, e.g. 1,4,6,2,2,3,5,1,4,7,6,8.
I am able to generate single numbers or a sequence of numbers using sample
x=sample(1:8,1, replace=T)
>x
[1] 6
I have played around with the repeat function to see how it might work with sample and I can at least get the generation to stop when one specific number occurs, e.g.
repeat {
print(x)
x = sample(1:8, 1, replace=T)
if (x == 3){
break
}
}
Which gives:
[1] 3
[1] 6
[1] 6
[1] 6
[1] 6
[1] 6
[1] 2
I am struggling now to work out how to stop number generation once all numbers between 1:8 are present. Additionally, I know that the above code is only printing the sequence as it is generated and not storing it as a vector. Any advice pointing me in the right direction would be really appreciated!
This is fine for 1:8 but might not always be a good idea.
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:8, 1))
if(all(1:8 %in% foo)) break
}
foo
# [1] 8 8 3 7 6 5 6 2 6 6 4 6 8 3 4 8 8 1
If you have more than 1:8, it may be better to obtain the average number of tries (N) required to get all the numbers at least once and then sample N numbers such that all numbers are sampled at least once.
set.seed(42)
vec = 1:8
N = ceiling(sum(length(vec)/(1:length(vec))))
foo = sample(c(vec, sample(vec, N - length(vec), TRUE)))
foo
# [1] 3 6 8 3 8 8 6 4 5 6 1 6 4 6 6 3 5 7 2 2 7 8
Taking cue off of d.b, here's a slightly more verbose method that is more memory-efficient (and a little faster too, though I doubt speed is your issue):
Differences:
pre-allocate memory in chunks (size 100 here), mitigates the problem with extend-by-one vector work; allocating and extending 100 (or even 1000) at a time is much lower cost
compare only the newest number instead of all numbers each time (the first n-1 numbers have already been tabulated, no need to do that again)
Code:
microbenchmark(
r2evans = {
emptyvec100 <- integer(100)
counter <- 0
out <- integer(0)
unseen <- seq_len(n)
set.seed(42)
repeat {
if (counter %% 100 == 0) out <- c(out, emptyvec100)
counter <- counter+1
num <- sample(n, size=1)
unseen <- unseen[unseen != num]
out[counter] <- num
if (!length(unseen)) break
}
out <- out[1:counter]
},
d.b = {
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:n, 1))
if(all(1:n %in% foo)) break
}
}, times = 100, unit = 'us')
# Unit: microseconds
# expr min lq mean median uq max neval
# r2evans 1090.007 1184.639 1411.531 1228.947 1320.845 11344.24 1000
# d.b 1242.440 1372.264 1835.974 1441.916 1597.267 14592.74 1000
(This is intended neither as code-golf nor speed-optimization. My primary goal is to argue against extend-by-one vector work, and suggest a more efficient comparison technique.)
As d.b further suggested, this works fine for 1:8 but may run into trouble with larger numbers. If we extend n up:
(Edit: with d.b's code changes, the execution times are much closer, and not nearly as exponential looking. Apparently the removal of unique had significant benefits to his code.)
I want to check if a list (or a vector, equivalently) is contained into another one, not if it is a subset of thereof. Let us assume we have
r <- c(1,1)
s <- c(5,2)
t <- c(1,2,5)
The function should behave as follows:
is.contained(r,t)
[1] FALSE
# as (1,1) is not contained in (1,2,5) since the former
# contains two 1 whereas the latter only one.
is.contained(s,t)
[1] TRUE
The operator %in% checks for subsets, hence it would return TRUE in both cases, likewise all or any. I am sure there is a one-liner but I just do not see it.
How about using a loop. I iterate over the first vector and check if it is present in the second vector. If it is there i remove it from second vector. And the process continues.
is.contained=function(vec1,vec2){
x=vector(length = length(vec1))
for (i in 1:length(vec1)) {
x[i] = vec1[i] %in% vec2
if(length(which(vec1[i] %in% vec2)) == 0) vec2 else
vec2=vec2[-match(vec1[i], vec2)]
}
y=all(x==T)
return(y)
}
The sets functions (e.g. intersect, union, etc.) from base R give results consistent with set theory. Sets technically don't have repeating elements, thus the vector c(1,1,2) and c(1,2) are considered the same when it comes to sets (see Set (Mathematics)). This is the main problem this question faces and thus why some of the solutions posted here fail (including my previous attempts). The solution to the OP's question is found somewhere between understanding sets and sequences. Although sequences allow repetition, order matters, and here we don't care (order doesn't matter in sets).
Below, I have provided a vector intersect function (VectorIntersect) that returns all of the common elements between two vectors regardless of order or presence of duplicates. Also provided is a containment function called is.contained, which calls VectorIntersect, that will determine if all of the elements in one vector are present in another vector.
VectorIntersect <- function(v,z) {
unlist(lapply(unique(v[v%in%z]), function(x) rep(x,min(sum(v==x),sum(z==x)))))
}
is.contained <- function(v,z) {length(VectorIntersect(v,z))==length(v)}
Let's look at a simple example:
r <- c(1, 1)
s <- c(rep(1, 5), rep("a", 5))
s
[1] "1" "1" "1" "1" "1" "a" "a" "a" "a" "a"
VectorIntersect(r, s)
[1] 1 1
is.contained(r, s) ## r is contained in s
[1] TRUE
is.contained(s, r) ## s is not contained in r
[1] FALSE
is.contained(s, s) ## s is contained in itself.. more on this later
[1] TRUE
Now, let's look at #Gennaro's clever recursive approach which gives correct results (Many apologies and also many Kudos... on earlier tests, I was under the impression that it was checking to see if b was contained in s and not the other way around):
fun.contains(s, r) ## s contains r
[1] TRUE
fun.contains(r, s) ## r does not contain s
[1] FALSE
fun.contains(s, s) ## s contains s
[1] TRUE
We will now step through the other set-based algorithms and see how they handle r and s above. I have added print statements to each function for clarity. First, #Jilber's function.
is.containedJilber <- function(x,y){
temp <- intersect(x,y)
print(temp); print(length(x)); print(length(temp)); print(all.equal(x, temp))
out <- ifelse(length(x)==length(temp), all.equal(x, temp), FALSE)
return(out)
}
is.containedJilber(r, s) ## should return TRUE but does not
[1] "1" ## result of intersect
[1] 2 ## length of r
[1] 1 ## length of temp
## results from all.equal.. gives weird results because lengths are different
[1] "Modes: numeric, character" "Lengths: 2, 1" "target is numeric, current is character"
[1] FALSE ## results from the fact that 2 does not equal 1
is.containedJilber(s, s) ## should return TRUE but does not
[1] "1" "a" ## result of intersect
[1] 10 ## length of s
[1] 2 ## length of temp
## results from all.equal.. again, gives weird results because lengths are different
[1] "Lengths (10, 2) differ (string compare on first 2)" "1 string mismatch"
[1] FALSE ## results from the fact that 10 does not equal 2
Here is #Simon's:
is.containedSimon <- function(x, y) {
print(setdiff(x, y))
z <- x[x %in%setdiff(x, y)]
print(z); print(length(x)); print(length(y)); print(length(z))
length(z) == length(x) - length(y)
}
is.containedSimon(s, r) ## should return TRUE but does not
[1] "a" ## result of setdiff
[1] "a" "a" "a" "a" "a" ## the elements in s that match the result of setdiff
[1] 10 ## length of s
[1] 2 ## length of r
[1] 5 ## length of z
[1] FALSE ## result of 5 not being equal to 10 - 2
Hopefully this illustrates the pitfalls of applying strict set operations in this setting.
Let's test for efficiency and equality. Below, we build many test vectors and check to see if they are contained in either the vector testContainsNum (if it's a number vector) or testContainsAlpha (if it is a character vector):
set.seed(123)
testContainsNum <- sample(20:40, 145, replace=TRUE) ## generate large vector with random numbers
testContainsAlpha <- sample(letters, 175, replace=TRUE) ## generate large vector with random letters
tVec <- lapply(1:1000, function(x) { ## generating test data..
if (x%%2==0) {
sample(20:40, sample(50:100, 1), replace=TRUE) ## even indices will contain numbers
} else {
sample(letters, sample(50:90, 1), replace=TRUE) ## odd indices will contain characters
}
})
tContains <- lapply(1:1000, function(x) if (x%%2==0) {testContainsNum} else {testContainsAlpha})
## First check equality
tJoe <- mapply(is.contained, tVec, tContains)
tGennaro <- mapply(fun.contains, tContains, tVec)
tSimon <- mapply(is.containedSimon, tContains, tVec)
tJilber <- mapply(is.containedJilber, tVec, tContains)
all(tJoe==tGennaro) ## Give same results
[1] TRUE
## Both Jilber's and Simon's solution don't return any TRUE values
any(tJilber)
[1] FALSE
any(tSimon)
[1] FALSE
## There should be 170 TRUEs
sum(tJoe)
[1] 170
Let's take a closer look to determine if is.contained and fun.contains are behaving correctly.
table(tVec[[3]])
a b c e f g h i j k l m n o p q r t u v w x y z
3 4 5 2 2 1 5 3 5 3 2 1 7 3 1 2 4 3 5 5 2 4 3 3
table(tContains[[3]])
a b c d e f g h i j k l m n o p q r s t u v w x y z
4 11 4 3 7 8 13 4 4 9 13 3 10 7 7 4 8 7 8 6 7 5 9 4 4 6
## Note above that tVec[[3]] has 1 more c and h than tContains[[3]],
## thus tVec[[3]] is not contained in tContains[[3]]
c(tJoe[3], tGennaro[3])
[1] FALSE FALSE ## This is correct!!!!
table(tVec[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 4 7 6 3 4 6 3 5 4 4 6 4 4 2 2 5 3 1 4
table(tContains[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 10 7 6 4 10 6 8 10 5 5 6 9 8 5 7 5 11 4 9
## Note above that every element in tVec[[14]] is present in
## tContains[[14]] and the number of occurences is less than or
## equal to the occurences in tContains[[14]]. Thus, tVec[[14]]
## is contained in tContains[[14]]
c(tJoe[14], tGennaro[14])
[1] TRUE TRUE ## This is correct!!!!
Here are the benchmarks:
library(microbenchmark)
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 165.0310 172.7817 181.3722 178.7014 187.0826 230.2806 100 a
Gennaro 249.8495 265.4022 279.0866 273.5923 288.1159 336.8464 100 b
Side Note about VectorIntersect()
After spending a good bit of time with this problem, it became increasingly clear that separating VectorIntersect from is.contained is tremendously valuable. I know many times in my own work, obtaining the intersection without duplicates being removed surfaced frequently. Oftentimes, the method implemented was messy and probably not reliable (easy to see why after this!). This is why VectorIntersect is a great utility function in additon to is.contained.
Update
Actually #Gennaro's solution can be improved quite a bit by calculating s[duplicated(s)] only one time as opposed to 3 times (similarly for b and length(s), we only calculate them once vs 2 times).
fun.containsFAST <- function(b, s){
dupS <- s[duplicated(s)]; dupB <- b[duplicated(b)]
lenS <- length(dupS)
all(s %in% b) && lenS <= length(dupB) &&
(if(lenS>0) fun.containsFAST(dupB,dupS) else 1)
}
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
GenFAST = mapply(fun.containsFAST, tContains, tVec),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 163.3529 172.1050 182.3634 177.2324 184.9622 293.8185 100 b
GenFAST 145.3982 157.7183 169.3290 164.7898 173.4063 277.1561 100 a
Gennaro 243.2416 265.8270 281.1472 273.5323 284.8820 403.7249 100 c
Update 2
What about testing containment for really big vectors? The function I provided is not likely to perform well as building the "intersection" (with duplicates etc.) by essentially looping over the true set intersection isn't very efficient. The modified #Gennaro's function won't be fast as well, because for very large vectors with many duplicates, the function calls could get nested pretty deep. With this in mind, I built yet another containment function that is specifically built for dealing with large vectors. It utilizes vectorized base R functions, especially of note pmin.int, which returns the parallel minima of multiple vectors. The inner function myL is taken directly from the guts of the rle function in base R (although slightly modified for this specific use).
is.containedBIG <- function(v, z) { ## v and z must be sorted
myL <- function(x) {LX <- length(x); diff(c(0L, which(x[-1L] != x[-LX]), LX))}
sum(pmin.int(myL(v[v %in% z]), myL(z[z %in% v])))==length(v)
}
Note that on smaller exmaples is.contained and fun.containsFAST are faster (this is mostly due to the time it takes to repeatedly sort.. as you will see, if the data is sorted is.containedBIG is much faster). Observe (for thoroughness we will also show the verification of #Chirayu's function and test's its efficiency):
## we are using tVec and tContains as defined above in the original test
tChirayu <- mapply(is.containedChirayu, tVec, tContains)
tJoeBIG <- sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]])))
all(tChirayu==tJoe) ## #Chirayu's returns correct results
[1] TRUE
all(tJoeBIG==tJoe) ## specialized alogrithm returns correct results
[1] TRUE
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVec[[x]], tContains[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]]))),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContains[[x]], tVec[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVec[[x]], tContains[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.6158 165.5861 176.3019 175.4786 180.1299 313.7974 100 a
JoeBIG 269.1460 282.9347 294.1568 289.0174 299.4687 445.5222 100 b ## about 2x as slow as GenFAST
GenFAST 140.8219 150.5530 156.2019 155.8306 162.0420 178.7837 100 a
Chirayu 1213.8962 1238.5666 1305.5392 1256.7044 1294.5307 2619.5370 100 c ## about 8x as slow as GenFAST
Now, with sorted data, the results are quite astonishing. is.containedBIG shows a 3 fold improvement in speed whereas the other functions return almost identical timings.
## with pre-sorted data
tVecSort <- lapply(tVec, sort)
tContainsSort <- lapply(tContains, sort)
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVecSort[[x]], tContainsSort[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(tVecSort[[x]], tContainsSort[[x]])),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContainsSort[[x]], tVecSort[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVecSort[[x]], tContainsSort[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.74771 166.46926 173.45399 172.92374 177.09029 297.7758 100 c
JoeBIG 83.69259 87.35881 94.48476 92.07183 98.37235 221.6014 100 a ## now it's the fastest
GenFAST 139.19631 151.23654 159.18670 157.05911 162.85636 275.7158 100 b
Chirayu 1194.15362 1241.38823 1280.10058 1260.09439 1297.44847 1454.9805 100 d
For very large vectors, we have the following (only showing GenFAST and JoeBIG as the other functions will take too long):
set.seed(97)
randS <- sample(10^9, 8.5*10^5)
testBigNum <- sample(randS, 2*10^7, replace = TRUE)
tVecBigNum <- lapply(1:20, function(x) {
sample(randS, sample(1500000:2500000, 1), replace=TRUE)
})
system.time(tJoeBigNum <- sapply(1:20, function(x) is.containedBIG(sort(tVecBigNum[[x]]), sort(testBigNum))))
user system elapsed
74.016 11.351 85.409
system.time(tGennaroBigNum <- sapply(1:20, function(x) fun.containsFAST(testBigNum, tVecBigNum[[x]])))
user system elapsed
662.875 54.238 720.433
sum(tJoeBigNum)
[1] 13
all(tJoeBigNum==tGennaroBigNum)
[1] TRUE
## pre-sorted data
testBigSort <- sort(testBigNum)
tVecBigSort <- lapply(tVecBigNum, sort)
system.time(tJoeBigSort <- sapply(1:20, function(x) is.containedBIG(tVecBigSort[[x]], testBigSort)))
user system elapsed
33.910 10.302 44.289
system.time(tGennaroBigSort <- sapply(1:20, function(x) fun.containsFAST(testBigSort, tVecBigSort[[x]])))
user system elapsed
196.546 54.923 258.079
identical(tJoeBigSort, tGennaroBigSort, tJoeBigNum)
[1] TRUE
Regardless if your data is sorted or not, the point of this last test is to show that is.containedBIG is much faster on larger data. An interesting take away from this last test was the fact that fun.containsFAST showed a very large improvement in time when the data was sorted. I was under the impression that duplicated (which is the workhorse of fun.containsFAST), did not depend on whether a vector was sorted or not. Earlier test confirmed this sentiment (the unsorted test timings were practically identical to the sorted test timings (see above)). More research is needed.
How about a recursive method checking for length of the duplicates for each list?
fun.contains <- function(b, s){
all(s %in% b) && length(s[duplicated(s)]) <= length(b[duplicated(b)]) &&
(if(length(s[duplicated(s)])>0) fun.contains(b[duplicated(b)],s[duplicated(s)]) else 1 )
}
The idea is that a list is contained into another one if and only if so is the list of the respective duplicates, unless there are no duplicates (in that case the recursion defaults to TRUE).
Another custom-function version, checking whether the number of elements (length()) of the non-equal elements (setdiff) is equal to the difference in the vectors' length:
# Does vector x contain vector y?
is.contained <- function(x, y) {
z <- x[x %in%setdiff(x, y)]
length(z) == length(x) - length(y)
}
r <- c(1,1)
s <- c(1,1,5)
t <- c(1,2,5)
is.contained(r, t)
#> [1] FALSE
is.contained(s, r)
#> [1] TRUE
is.contained(r, s)
#> [1] FALSE
I have the following sorted vector:
> v
[1] -1 0 1 2 4 5 2 3 4 5 7 8 5 6 7 8 10 11
How can I remove the -1, 0, and 11 entries without looping over the whole vector, either with a user loop or implicitly with a language keyword? That is, I want to trim the vector at each edge and only at each edge, such that the sorted sequence is within my min,max parameters 1 and 10. The solution should assume that the vector is sorted to avoid checking every element.
This kind of solutions can come handy in vectorized operations for very large vectors, when we want to use the items in the vector as indexes in another object. For one application see this thread.
To include elements in a vector by index:
v [2:10]
to exclude certain elements
v [-c (1, 11) ]
to only include a certain range:
v <- v [v>=1 & v <=10]
If I'm allowed to assume that, like in your example, the number of elements to be trimmed << the number of elements in the vector, then I think I can beat the binary search:
> n<-1e8
> v<--3:(n+3)
>
> min <- 1
> max <- length(v)
>
> calcMin <- function(v, minVal){
+ while(v[min] < minVal){
+ min <- min + 1
+ }
+ min
+ }
>
> calcMax <- function(v, maxVal){
+ while(v[max] > maxVal){
+ max <- max - 1
+ }
+ max
+ }
>
> #Compute the min and max indices and create a sequence
> system.time(a <- v[calcMin(v, 1):calcMax(v,n)])
user system elapsed
1.030 0.269 1.298
>
> #do a binary search to find the elements (as suggested by #nograpes)
> system.time(b <- v[do.call(seq,as.list(findInterval(c(1,n),v)))])
user system elapsed
2.208 0.631 2.842
>
> #use negative indexing to remove elements
> system.time(c <- v[-c(1:(calcMin(v, 1)-1), (calcMax(v,n)+1):length(v))])
user system elapsed
1.449 0.256 1.704
>
> #use head and tail to trim the vector
> system.time(d <- tail(head(v, n=(calcMax(v,n)-length(v))), n=-calcMin(v, 1)+1))
user system elapsed
2.994 0.877 3.871
>
> identical(a, b)
[1] TRUE
> identical(a, c)
[1] TRUE
> identical(a, d)
[1] TRUE
There are many ways to do it, here's some:
> v <- -1:11 # creating your vector
> v[v %in% 1:10]
[1] 1 2 3 4 5 6 7 8 9 10
> setdiff(v, c(-1,0,11))
[1] 1 2 3 4 5 6 7 8 9 10
> intersect(v, 1:10)
[1] 1 2 3 4 5 6 7 8 9 10
Two more options, not so elegant.
> na.omit(match(v, 1:10))
> na.exclude(match(v, 1:10))
All of the previous solutions implicitly check every element of the vector. As #Robert Kubrick points out, this does not take advantage of the fact that the vector is already sorted.
To take advantage of the sorted nature of the vector, you can use binary search (through findInterval) to find the start and end indexes without looking at every element:
n<-1e9
v<--3:(n+3)
system.time(a <- v [v>=1 & v <=n]) # 68 s
system.time(b <- v[do.call(seq,as.list(findInterval(c(1,n),v)))]) # 15s
identical(a,b) # TRUE
It is a little clumsy, and there is some discussion that the binary search in findInterval may not be entirely efficient, but the general idea is there.
As was pointed out in the comments, the above only works when the index is in the vector. Here is a function that I think will work:
in.range <- function(x, lo = -Inf, hi = +Inf) {
lo.idx <- findInterval(lo, x, all.inside = TRUE)
hi.idx <- findInterval(hi, x)
lo.idx <- lo.idx + x[lo.idx] >= lo
x[seq(lo.idx, hi.idx)]
}
system.time(b <- in.range(v, 1, n) # 15s
You can use %in% also :
vv <- c(-1, 0 ,1 ,2 ,4 ,5, 2 ,3 ,4, 5, 7 ,8, 5, 6, 7, 8, 10, 11)
vv[vv %in% 1:10]
[1] 1 2 4 5 2 3 4 5 7 8 5 6 7 8 10