Check if list contains another list in R - r

I want to check if a list (or a vector, equivalently) is contained into another one, not if it is a subset of thereof. Let us assume we have
r <- c(1,1)
s <- c(5,2)
t <- c(1,2,5)
The function should behave as follows:
is.contained(r,t)
[1] FALSE
# as (1,1) is not contained in (1,2,5) since the former
# contains two 1 whereas the latter only one.
is.contained(s,t)
[1] TRUE
The operator %in% checks for subsets, hence it would return TRUE in both cases, likewise all or any. I am sure there is a one-liner but I just do not see it.

How about using a loop. I iterate over the first vector and check if it is present in the second vector. If it is there i remove it from second vector. And the process continues.
is.contained=function(vec1,vec2){
x=vector(length = length(vec1))
for (i in 1:length(vec1)) {
x[i] = vec1[i] %in% vec2
if(length(which(vec1[i] %in% vec2)) == 0) vec2 else
vec2=vec2[-match(vec1[i], vec2)]
}
y=all(x==T)
return(y)
}

The sets functions (e.g. intersect, union, etc.) from base R give results consistent with set theory. Sets technically don't have repeating elements, thus the vector c(1,1,2) and c(1,2) are considered the same when it comes to sets (see Set (Mathematics)). This is the main problem this question faces and thus why some of the solutions posted here fail (including my previous attempts). The solution to the OP's question is found somewhere between understanding sets and sequences. Although sequences allow repetition, order matters, and here we don't care (order doesn't matter in sets).
Below, I have provided a vector intersect function (VectorIntersect) that returns all of the common elements between two vectors regardless of order or presence of duplicates. Also provided is a containment function called is.contained, which calls VectorIntersect, that will determine if all of the elements in one vector are present in another vector.
VectorIntersect <- function(v,z) {
unlist(lapply(unique(v[v%in%z]), function(x) rep(x,min(sum(v==x),sum(z==x)))))
}
is.contained <- function(v,z) {length(VectorIntersect(v,z))==length(v)}
Let's look at a simple example:
r <- c(1, 1)
s <- c(rep(1, 5), rep("a", 5))
s
[1] "1" "1" "1" "1" "1" "a" "a" "a" "a" "a"
VectorIntersect(r, s)
[1] 1 1
is.contained(r, s) ## r is contained in s
[1] TRUE
is.contained(s, r) ## s is not contained in r
[1] FALSE
is.contained(s, s) ## s is contained in itself.. more on this later
[1] TRUE
Now, let's look at #Gennaro's clever recursive approach which gives correct results (Many apologies and also many Kudos... on earlier tests, I was under the impression that it was checking to see if b was contained in s and not the other way around):
fun.contains(s, r) ## s contains r
[1] TRUE
fun.contains(r, s) ## r does not contain s
[1] FALSE
fun.contains(s, s) ## s contains s
[1] TRUE
We will now step through the other set-based algorithms and see how they handle r and s above. I have added print statements to each function for clarity. First, #Jilber's function.
is.containedJilber <- function(x,y){
temp <- intersect(x,y)
print(temp); print(length(x)); print(length(temp)); print(all.equal(x, temp))
out <- ifelse(length(x)==length(temp), all.equal(x, temp), FALSE)
return(out)
}
is.containedJilber(r, s) ## should return TRUE but does not
[1] "1" ## result of intersect
[1] 2 ## length of r
[1] 1 ## length of temp
## results from all.equal.. gives weird results because lengths are different
[1] "Modes: numeric, character" "Lengths: 2, 1" "target is numeric, current is character"
[1] FALSE ## results from the fact that 2 does not equal 1
is.containedJilber(s, s) ## should return TRUE but does not
[1] "1" "a" ## result of intersect
[1] 10 ## length of s
[1] 2 ## length of temp
## results from all.equal.. again, gives weird results because lengths are different
[1] "Lengths (10, 2) differ (string compare on first 2)" "1 string mismatch"
[1] FALSE ## results from the fact that 10 does not equal 2
Here is #Simon's:
is.containedSimon <- function(x, y) {
print(setdiff(x, y))
z <- x[x %in%setdiff(x, y)]
print(z); print(length(x)); print(length(y)); print(length(z))
length(z) == length(x) - length(y)
}
is.containedSimon(s, r) ## should return TRUE but does not
[1] "a" ## result of setdiff
[1] "a" "a" "a" "a" "a" ## the elements in s that match the result of setdiff
[1] 10 ## length of s
[1] 2 ## length of r
[1] 5 ## length of z
[1] FALSE ## result of 5 not being equal to 10 - 2
Hopefully this illustrates the pitfalls of applying strict set operations in this setting.
Let's test for efficiency and equality. Below, we build many test vectors and check to see if they are contained in either the vector testContainsNum (if it's a number vector) or testContainsAlpha (if it is a character vector):
set.seed(123)
testContainsNum <- sample(20:40, 145, replace=TRUE) ## generate large vector with random numbers
testContainsAlpha <- sample(letters, 175, replace=TRUE) ## generate large vector with random letters
tVec <- lapply(1:1000, function(x) { ## generating test data..
if (x%%2==0) {
sample(20:40, sample(50:100, 1), replace=TRUE) ## even indices will contain numbers
} else {
sample(letters, sample(50:90, 1), replace=TRUE) ## odd indices will contain characters
}
})
tContains <- lapply(1:1000, function(x) if (x%%2==0) {testContainsNum} else {testContainsAlpha})
## First check equality
tJoe <- mapply(is.contained, tVec, tContains)
tGennaro <- mapply(fun.contains, tContains, tVec)
tSimon <- mapply(is.containedSimon, tContains, tVec)
tJilber <- mapply(is.containedJilber, tVec, tContains)
all(tJoe==tGennaro) ## Give same results
[1] TRUE
## Both Jilber's and Simon's solution don't return any TRUE values
any(tJilber)
[1] FALSE
any(tSimon)
[1] FALSE
## There should be 170 TRUEs
sum(tJoe)
[1] 170
Let's take a closer look to determine if is.contained and fun.contains are behaving correctly.
table(tVec[[3]])
a b c e f g h i j k l m n o p q r t u v w x y z
3 4 5 2 2 1 5 3 5 3 2 1 7 3 1 2 4 3 5 5 2 4 3 3
table(tContains[[3]])
a b c d e f g h i j k l m n o p q r s t u v w x y z
4 11 4 3 7 8 13 4 4 9 13 3 10 7 7 4 8 7 8 6 7 5 9 4 4 6
## Note above that tVec[[3]] has 1 more c and h than tContains[[3]],
## thus tVec[[3]] is not contained in tContains[[3]]
c(tJoe[3], tGennaro[3])
[1] FALSE FALSE ## This is correct!!!!
table(tVec[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 4 7 6 3 4 6 3 5 4 4 6 4 4 2 2 5 3 1 4
table(tContains[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 10 7 6 4 10 6 8 10 5 5 6 9 8 5 7 5 11 4 9
## Note above that every element in tVec[[14]] is present in
## tContains[[14]] and the number of occurences is less than or
## equal to the occurences in tContains[[14]]. Thus, tVec[[14]]
## is contained in tContains[[14]]
c(tJoe[14], tGennaro[14])
[1] TRUE TRUE ## This is correct!!!!
Here are the benchmarks:
library(microbenchmark)
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 165.0310 172.7817 181.3722 178.7014 187.0826 230.2806 100 a
Gennaro 249.8495 265.4022 279.0866 273.5923 288.1159 336.8464 100 b
Side Note about VectorIntersect()
After spending a good bit of time with this problem, it became increasingly clear that separating VectorIntersect from is.contained is tremendously valuable. I know many times in my own work, obtaining the intersection without duplicates being removed surfaced frequently. Oftentimes, the method implemented was messy and probably not reliable (easy to see why after this!). This is why VectorIntersect is a great utility function in additon to is.contained.
Update
Actually #Gennaro's solution can be improved quite a bit by calculating s[duplicated(s)] only one time as opposed to 3 times (similarly for b and length(s), we only calculate them once vs 2 times).
fun.containsFAST <- function(b, s){
dupS <- s[duplicated(s)]; dupB <- b[duplicated(b)]
lenS <- length(dupS)
all(s %in% b) && lenS <= length(dupB) &&
(if(lenS>0) fun.containsFAST(dupB,dupS) else 1)
}
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
GenFAST = mapply(fun.containsFAST, tContains, tVec),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 163.3529 172.1050 182.3634 177.2324 184.9622 293.8185 100 b
GenFAST 145.3982 157.7183 169.3290 164.7898 173.4063 277.1561 100 a
Gennaro 243.2416 265.8270 281.1472 273.5323 284.8820 403.7249 100 c
Update 2
What about testing containment for really big vectors? The function I provided is not likely to perform well as building the "intersection" (with duplicates etc.) by essentially looping over the true set intersection isn't very efficient. The modified #Gennaro's function won't be fast as well, because for very large vectors with many duplicates, the function calls could get nested pretty deep. With this in mind, I built yet another containment function that is specifically built for dealing with large vectors. It utilizes vectorized base R functions, especially of note pmin.int, which returns the parallel minima of multiple vectors. The inner function myL is taken directly from the guts of the rle function in base R (although slightly modified for this specific use).
is.containedBIG <- function(v, z) { ## v and z must be sorted
myL <- function(x) {LX <- length(x); diff(c(0L, which(x[-1L] != x[-LX]), LX))}
sum(pmin.int(myL(v[v %in% z]), myL(z[z %in% v])))==length(v)
}
Note that on smaller exmaples is.contained and fun.containsFAST are faster (this is mostly due to the time it takes to repeatedly sort.. as you will see, if the data is sorted is.containedBIG is much faster). Observe (for thoroughness we will also show the verification of #Chirayu's function and test's its efficiency):
## we are using tVec and tContains as defined above in the original test
tChirayu <- mapply(is.containedChirayu, tVec, tContains)
tJoeBIG <- sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]])))
all(tChirayu==tJoe) ## #Chirayu's returns correct results
[1] TRUE
all(tJoeBIG==tJoe) ## specialized alogrithm returns correct results
[1] TRUE
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVec[[x]], tContains[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]]))),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContains[[x]], tVec[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVec[[x]], tContains[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.6158 165.5861 176.3019 175.4786 180.1299 313.7974 100 a
JoeBIG 269.1460 282.9347 294.1568 289.0174 299.4687 445.5222 100 b ## about 2x as slow as GenFAST
GenFAST 140.8219 150.5530 156.2019 155.8306 162.0420 178.7837 100 a
Chirayu 1213.8962 1238.5666 1305.5392 1256.7044 1294.5307 2619.5370 100 c ## about 8x as slow as GenFAST
Now, with sorted data, the results are quite astonishing. is.containedBIG shows a 3 fold improvement in speed whereas the other functions return almost identical timings.
## with pre-sorted data
tVecSort <- lapply(tVec, sort)
tContainsSort <- lapply(tContains, sort)
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVecSort[[x]], tContainsSort[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(tVecSort[[x]], tContainsSort[[x]])),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContainsSort[[x]], tVecSort[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVecSort[[x]], tContainsSort[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.74771 166.46926 173.45399 172.92374 177.09029 297.7758 100 c
JoeBIG 83.69259 87.35881 94.48476 92.07183 98.37235 221.6014 100 a ## now it's the fastest
GenFAST 139.19631 151.23654 159.18670 157.05911 162.85636 275.7158 100 b
Chirayu 1194.15362 1241.38823 1280.10058 1260.09439 1297.44847 1454.9805 100 d
For very large vectors, we have the following (only showing GenFAST and JoeBIG as the other functions will take too long):
set.seed(97)
randS <- sample(10^9, 8.5*10^5)
testBigNum <- sample(randS, 2*10^7, replace = TRUE)
tVecBigNum <- lapply(1:20, function(x) {
sample(randS, sample(1500000:2500000, 1), replace=TRUE)
})
system.time(tJoeBigNum <- sapply(1:20, function(x) is.containedBIG(sort(tVecBigNum[[x]]), sort(testBigNum))))
user system elapsed
74.016 11.351 85.409
system.time(tGennaroBigNum <- sapply(1:20, function(x) fun.containsFAST(testBigNum, tVecBigNum[[x]])))
user system elapsed
662.875 54.238 720.433
sum(tJoeBigNum)
[1] 13
all(tJoeBigNum==tGennaroBigNum)
[1] TRUE
## pre-sorted data
testBigSort <- sort(testBigNum)
tVecBigSort <- lapply(tVecBigNum, sort)
system.time(tJoeBigSort <- sapply(1:20, function(x) is.containedBIG(tVecBigSort[[x]], testBigSort)))
user system elapsed
33.910 10.302 44.289
system.time(tGennaroBigSort <- sapply(1:20, function(x) fun.containsFAST(testBigSort, tVecBigSort[[x]])))
user system elapsed
196.546 54.923 258.079
identical(tJoeBigSort, tGennaroBigSort, tJoeBigNum)
[1] TRUE
Regardless if your data is sorted or not, the point of this last test is to show that is.containedBIG is much faster on larger data. An interesting take away from this last test was the fact that fun.containsFAST showed a very large improvement in time when the data was sorted. I was under the impression that duplicated (which is the workhorse of fun.containsFAST), did not depend on whether a vector was sorted or not. Earlier test confirmed this sentiment (the unsorted test timings were practically identical to the sorted test timings (see above)). More research is needed.

How about a recursive method checking for length of the duplicates for each list?
fun.contains <- function(b, s){
all(s %in% b) && length(s[duplicated(s)]) <= length(b[duplicated(b)]) &&
(if(length(s[duplicated(s)])>0) fun.contains(b[duplicated(b)],s[duplicated(s)]) else 1 )
}
The idea is that a list is contained into another one if and only if so is the list of the respective duplicates, unless there are no duplicates (in that case the recursion defaults to TRUE).

Another custom-function version, checking whether the number of elements (length()) of the non-equal elements (setdiff) is equal to the difference in the vectors' length:
# Does vector x contain vector y?
is.contained <- function(x, y) {
z <- x[x %in%setdiff(x, y)]
length(z) == length(x) - length(y)
}
r <- c(1,1)
s <- c(1,1,5)
t <- c(1,2,5)
is.contained(r, t)
#> [1] FALSE
is.contained(s, r)
#> [1] TRUE
is.contained(r, s)
#> [1] FALSE

Related

R keep randomly generating numbers until all numbers within specified range are present

My aim is to randomly generate a vector of integers using R, which is populated by numbers between 1-8. However, I want to keep growing the vector until all the numbers between 1:8 are represented at least once, e.g. 1,4,6,2,2,3,5,1,4,7,6,8.
I am able to generate single numbers or a sequence of numbers using sample
x=sample(1:8,1, replace=T)
>x
[1] 6
I have played around with the repeat function to see how it might work with sample and I can at least get the generation to stop when one specific number occurs, e.g.
repeat {
print(x)
x = sample(1:8, 1, replace=T)
if (x == 3){
break
}
}
Which gives:
[1] 3
[1] 6
[1] 6
[1] 6
[1] 6
[1] 6
[1] 2
I am struggling now to work out how to stop number generation once all numbers between 1:8 are present. Additionally, I know that the above code is only printing the sequence as it is generated and not storing it as a vector. Any advice pointing me in the right direction would be really appreciated!
This is fine for 1:8 but might not always be a good idea.
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:8, 1))
if(all(1:8 %in% foo)) break
}
foo
# [1] 8 8 3 7 6 5 6 2 6 6 4 6 8 3 4 8 8 1
If you have more than 1:8, it may be better to obtain the average number of tries (N) required to get all the numbers at least once and then sample N numbers such that all numbers are sampled at least once.
set.seed(42)
vec = 1:8
N = ceiling(sum(length(vec)/(1:length(vec))))
foo = sample(c(vec, sample(vec, N - length(vec), TRUE)))
foo
# [1] 3 6 8 3 8 8 6 4 5 6 1 6 4 6 6 3 5 7 2 2 7 8
Taking cue off of d.b, here's a slightly more verbose method that is more memory-efficient (and a little faster too, though I doubt speed is your issue):
Differences:
pre-allocate memory in chunks (size 100 here), mitigates the problem with extend-by-one vector work; allocating and extending 100 (or even 1000) at a time is much lower cost
compare only the newest number instead of all numbers each time (the first n-1 numbers have already been tabulated, no need to do that again)
Code:
microbenchmark(
r2evans = {
emptyvec100 <- integer(100)
counter <- 0
out <- integer(0)
unseen <- seq_len(n)
set.seed(42)
repeat {
if (counter %% 100 == 0) out <- c(out, emptyvec100)
counter <- counter+1
num <- sample(n, size=1)
unseen <- unseen[unseen != num]
out[counter] <- num
if (!length(unseen)) break
}
out <- out[1:counter]
},
d.b = {
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:n, 1))
if(all(1:n %in% foo)) break
}
}, times = 100, unit = 'us')
# Unit: microseconds
# expr min lq mean median uq max neval
# r2evans 1090.007 1184.639 1411.531 1228.947 1320.845 11344.24 1000
# d.b 1242.440 1372.264 1835.974 1441.916 1597.267 14592.74 1000
(This is intended neither as code-golf nor speed-optimization. My primary goal is to argue against extend-by-one vector work, and suggest a more efficient comparison technique.)
As d.b further suggested, this works fine for 1:8 but may run into trouble with larger numbers. If we extend n up:
(Edit: with d.b's code changes, the execution times are much closer, and not nearly as exponential looking. Apparently the removal of unique had significant benefits to his code.)

Create groups from vector of 0,1 and NA

Background:
I'm trying to strip out a corpus where the speaker is identified. I've reduced the problem of removing a particular speaker from the corpuse to the following stream of 1,0, and NA (x). 0 means that person is speaking, 1 someone else is speaking, NA means that whoever was the last speaker is still speaking.
Here's a visual example:
0 1 S0: Hello, how are you today?
1 2 S1: I'm great thanks for asking!
NA 3 I'm a little tired though!
0 4 S0: I'm sorry to hear that. Are you ready for our discussion?
1 5 S1: Yes, I have everything I need.
NA 7 Let's begin.
So from this frame, I'd like to take 2,3,5, and 7. Or,. I would like the result to be 0,1,1,0,1,1.
How do I pull the positions of each run of 1 and NA up to the position before the next 0 in a vector.
Here is an example, and my desired output:
Example input:
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)
Example output:
These are the positions that I want because they identify that "speaker 1" is talking (1, or 1 followed by NA up to the next 0)
pos <- c(6,8,9,10,11,15,16,17)
An alternative output would be a filling:
fill <- c(0,0,0,0,0,1,0,1,1,1,1,0,0,0,1,1,1,0)
Where the NA values of the previous 1 or 0 are filled up to the next new value.
s <- which(x==1);
e <- c(which(x!=1),length(x)+1L);
unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L));
## [1] 6 8 9 10 11 15 16 17
Every occurrence of a 1 in the input vector is the start of a sequence of position indexes applicable to speaker 1. We capture this in s with which(x==1).
For each start index, we must find the length of its containing sequence. The length is determined by the closest forward occurrence of a 0 (or, more generally, any non-NA value other than 1, if such was possible). Hence, we must first compute which(x!=1) to get these indexes. Because the final occurrence of a 1 may not have a forward occurrence of a 0, we must append an extra virtual index one unit past the end of the input vector, which is why we must call c() to combine length(x)+1L. We store this as e, reflecting that these are (potential) end indexes. Note that these are exclusive end indexes; they are not actually part of the (potential) preceding speaker 1 sequence.
Finally, we must generate the actual sequences. To do this, we must make one call to seq() for each element of s, also passing its corresponding end index from e. To find the end index we can use findInterval() to find the index into e whose element value (that is, the end index into x) falls just before each respective element of s. (The reason why it is just before is that the algorithm used by findInterval() is v[i[j]] ≤ x[j] < v[i[j]+1] as explained on the doc page.) We must then add one to it to get the index into e whose element value falls just after each respective element of s. We then index e with it, giving us the end indexes into x that follow each respective element of s. We must subtract one from that because the sequence we generate must exclude the (exclusive) end element. The easiest way to make the calls to seq() is to Map() the two endpoint vectors to it, returning a list of each sequence, which we can unlist() to get the required output.
s <- which(!is.na(x));
rep(c(0,x[s]),diff(c(1L,s,length(x)+1L)));
## [1] 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0
Every occurrence of a non-NA value in the input vector is the start of a segment which, in the output, must become a repetition of the element value at that start index. We capture these indexes in s with which(!is.na(x));.
We must then repeat each start element a sufficient number of times to reach the following segment. Hence we can call rep() on x[s] with a vectorized times argument whose values consist of diff() called on s. To handle the final segment, we must append an index one unit past the end of the input vector, length(x)+1L. Also, to deal with the possible case of NAs leading the input vector, we must prepend a 0 to x[s] and a 1 to the diff() argument, which will repeat 0 a sufficient number of times to cover the leading NAs, if such exist.
Benchmarking (Position)
library(zoo);
library(microbenchmark);
library(stringi);
marat <- function(x) { v <- na.locf(zoo(x)); index(v)[v==1]; };
rawr <- function(x) which(zoo::na.locf(c(0L, x))[-1L] == 1L);
jota1 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); unlist(gregexpr("1", stringx)); };
jota2 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); newx <-unlist(strsplit(stringx, "")); which(newx == 1); };
jota3 <- function(x) {x[is.na(x)] <- "N"; stringx <- stri_flatten(x); ones <- stri_locate_all_regex(stringx, "1N*")[[1]]; unlist(lapply(seq_along(ones[, 1]), function(ii) seq.int(ones[ii, "start"], ones[ii, "end"]))); };
bgoldst <- function(x) { s <- which(x==1); e <- c(which(x!=1),length(x)+1L); unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L)); };
## OP's test case
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0);
ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE
microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: microseconds
## expr min lq mean median uq max neval
## marat(x) 411.830 438.5580 503.24486 453.7400 489.2345 2299.915 100
## rawr(x) 115.466 143.0510 154.64822 153.5280 163.7920 276.692 100
## jota1(x) 448.180 469.7770 484.47090 479.6125 491.1595 835.633 100
## jota2(x) 440.911 464.4315 478.03050 472.1290 484.3170 661.579 100
## jota3(x) 53.885 65.4315 74.34808 71.2050 76.9785 158.232 100
## bgoldst(x) 34.212 44.2625 51.54556 48.5395 55.8095 139.843 100
## scale test, high probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE
microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## marat(x) 189.34479 196.70233 226.72926 233.39234 237.45738 293.95154 100
## rawr(x) 24.46984 27.46084 43.91167 29.92112 68.86464 79.53008 100
## jota1(x) 154.91450 157.09231 161.73505 158.18326 160.42694 206.04889 100
## jota2(x) 149.47561 151.68187 155.92497 152.93682 154.79668 201.13302 100
## jota3(x) 82.30768 83.89149 87.35308 84.99141 86.95028 129.94730 100
## bgoldst(x) 80.94261 82.94125 87.80780 84.02107 86.10844 130.56440 100
## scale test, low probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE
microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## marat(x) 178.93359 189.56032 216.68963 226.01940 234.06610 294.6927 100
## rawr(x) 17.75869 20.39367 36.16953 24.44931 60.23612 79.5861 100
## jota1(x) 100.10614 101.49238 104.11655 102.27712 103.84383 150.9420 100
## jota2(x) 94.59927 96.04494 98.65276 97.20965 99.26645 137.0036 100
## jota3(x) 193.15175 202.02810 216.68833 209.56654 227.94255 295.5672 100
## bgoldst(x) 253.33013 266.34765 292.52171 292.18406 311.20518 387.3093 100
Benchmarking (Fill)
library(microbenchmark);
bgoldst <- function(x) { s <- which(!is.na(x)); rep(c(0,x[s]),diff(c(1L,s,length(x)+1L))); };
user31264 <- function(x) { x[is.na(x)]=2; x.rle=rle(x); val=x.rle$v; if (val[1]==2) val[1]=0; ind = (val==2); val[ind]=val[which(ind)-1]; rep(val,x.rle$l); };
## OP's test case
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0);
ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE
microbenchmark(bgoldst(x),user31264(x));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(x) 10.264 11.548 14.39548 12.403 13.258 73.557 100
## user31264(x) 31.646 32.930 35.74805 33.785 35.068 84.676 100
## scale test, high probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE
microbenchmark(bgoldst(x),user31264(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(x) 10.94491 11.21860 12.50473 11.53015 12.28945 50.25899 100
## user31264(x) 17.18649 18.35634 22.50400 18.91848 19.53708 65.02668 100
## scale test, low probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE
microbenchmark(bgoldst(x),user31264(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(x) 5.24815 6.351279 7.723068 6.635454 6.923264 45.04077 100
## user31264(x) 11.79423 13.063710 22.367334 13.986584 14.908603 55.45453 100
You can make use of na.locf from the zoo package:
library(zoo)
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)
v <- na.locf(zoo(x))
index(v)[v==1]
#[1] 6 8 9 10 11 15 16 17
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)
x[is.na(x)]=2
x.rle=rle(x)
val=x.rle$v
if (val[1]==2) val[1]=0
ind = (val==2)
val[ind]=val[which(ind)-1]
rep(val,x.rle$l)
Output:
[1] 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0
Pasting the sequence into a string and using a while loop that checks (with grep) whether there are any NAs preceded by 1s and substitutes (with gsub) such cases with a 1 will do it:
# substitute NA for "N" for later ease of processing and locating 1s by position
x[is.na(x)] <- "N"
# Collapse vector into a string
stringx <- paste(x, collapse = "")
while(grepl("(?<=1)N", stringx, perl = TRUE)) {
stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE)
}
Then you can use gregexpr to get the indices of 1s.
unlist(gregexpr("1", stringx))
#[1] 6 8 9 10 11 15 16 17
Or you can split the string and look through to find the indices of 1s in the resulting vector:
newx <-unlist(strsplit(stringx, ""))
#[1] "N" "N" "N" "N" "0" "1" "0" "1" "1" "1" "1" "0" "N" "N" "1" "1" "1" "0"
which(newx == "1")
#[1] 6 8 9 10 11 15 16 17
Using stri_flatten from the stringi package instead of paste and stri_locate_all_fixed rather than gregexpr or a string splitting route can provide a little bit more performance if it's a larger vector you're processing. If the vector isn't large, no performance gains will result.
library(stringi)
x[is.na(x)] <- "N"
stringx <- stri_flatten(x)
while(grepl("(?<=1)N", stringx, perl = TRUE)) {
stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE)
}
stri_locate_all_fixed(stringx, "1")[[1]][,"start"]
The following approach is fairly straightforward and performs relatively well (based on bgoldst's excellent benchmarking example) on small and large samples (very well on bgoldst's high probability of NA example)
x[is.na(x)] <- "N"
stringx <- stri_flatten(x)
ones <- stri_locate_all_regex(stringx, "1N*")[[1]]
#[[1]]
#
# start end
#[1,] 6 6
#[2,] 8 11
#[3,] 15 17
unlist(lapply(seq_along(ones[, 1]),
function(ii) seq.int(ones[ii, "start"], ones[ii, "end"])))
#[1] 6 8 9 10 11 15 16 17

Periodicity of variables in array in R

Let's say that I have alpha
alpha = c(a,a,a,b,c,c,c,a,c,c)
How can I find the periodicity such that I can construct another array beta
beta = c(3,1,3,1,2)
without using the contents of alpha in the code? Is there a way to use lead or lag to answer this?
Here is an option using tabulate and rleid
library(data.table)
tabulate(rleid(alpha))
#[1] 3 1 3 1 2
Just for fun, here's a convoluted solution:
alpha <- c('a','a','a','b','c','c','c','a','c','c');
diff(c(0L,which(c(alpha[-1L]!=alpha[-length(alpha)],T))));
## [1] 3 1 3 1 2
Explanation
alpha[-1L]!=alpha[-length(alpha)];
## [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
First compute a logical vector representing which adjacent pairs of input elements constitute breaks in value equality, and which don't. The index of each element in the logical vector corresponds to the index of the first element of the pair in the input vector.
c(alpha[-1L]!=alpha[-length(alpha)],T);
## [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
Append a TRUE value to create a pseudo-break at the end of the vector. See next step for clarification.
which(c(alpha[-1L]!=alpha[-length(alpha)],T));
## [1] 3 4 7 8 10
Convert the logical vector to an index vector representing the endpoints of the run lengths in the input vector. Now it should be clear why we had to append TRUE in the previous step; the endpoint of the final run length would otherwise be omitted.
c(0L,which(c(alpha[-1L]!=alpha[-length(alpha)],T)));
## [1] 0 3 4 7 8 10
Prepend a zero. This can conceptually be thought of as transforming the index vector into a "boundaries vector", with each element representing either the internal or external boundaries of the input vector run lengths. See next step for clarification.
diff(c(0L,which(c(alpha[-1L]!=alpha[-length(alpha)],T))));
## [1] 3 1 3 1 2
Take the difference between consecutive boundaries. This provides the desired run lengths.
Benchmarking
library(data.table);
library(microbenchmark);
bgoldst <- function(alpha) diff(c(0L,which(c(alpha[-1L]!=alpha[-length(alpha)],T))));
akrun <- function(alpha) tabulate(rleid(alpha));
bethany <- function(alpha) { if (length(alpha)==0L) return(integer()); res <- integer(); last <- alpha[1L]; cnt <- 1L; i <- 2L; while (i<=length(alpha)) { if (alpha[i]==last) cnt <- cnt+1L else { res[length(res)+1L] <- cnt; last <- alpha[i]; cnt <- 1L; }; i <- i+1L; }; res[length(res)+1L] <- cnt; res; };
flick <- function(alpha) rle(alpha)$lengths;
alpha <- c('a','a','a','b','c','c','c','a','c','c');
expected <- c(3L,1L,3L,1L,2L);
identical(expected,bgoldst(alpha));
## [1] TRUE
identical(expected,akrun(alpha));
## [1] TRUE
identical(expected,bethany(alpha));
## [1] TRUE
identical(expected,flick(alpha));
## [1] TRUE
microbenchmark(bgoldst(alpha),akrun(alpha),bethany(alpha),flick(alpha));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(alpha) 8.553 11.1200 14.85308 12.8300 15.3970 70.136 100
## akrun(alpha) 129.151 144.9745 163.64182 156.7350 171.4895 313.898 100
## bethany(alpha) 20.101 23.9500 30.43242 26.5155 37.8475 70.136 100
## flick(alpha) 20.100 23.9495 30.44956 28.0120 32.2890 62.866 100
set.seed(1L); N <- 1e5L; alpha <- sample(letters[1:3],N,T);
expected <- bgoldst(alpha);
identical(expected,akrun(alpha));
## [1] TRUE
identical(expected,bethany(alpha));
## [1] TRUE
identical(expected,flick(alpha));
## [1] TRUE
microbenchmark(bgoldst(alpha),akrun(alpha),bethany(alpha),flick(alpha),times=10L);
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(alpha) 5.497899 6.469098 11.007558 6.521699 7.297460 49.891634 10
## akrun(alpha) 1.300492 1.370199 1.547461 1.401631 1.464282 2.816091 10
## bethany(alpha) 2865.335271 2891.594408 2941.352229 2924.165053 2997.881411 3024.234204 10
## flick(alpha) 8.060392 9.355323 13.646002 10.055176 10.841843 48.312741 10
If you want to keep the values in alpha blind, you can create a for loop and use logic to run up a counter to if this iteration of alpha equals the last. You will need to set a past alpha to accept the current value and compare the next one against.
Once the two values do not agree the counter number is concatenated to the vector you created outside of the loop and the counter goes back to one.
It is simple enough that you should do this yourself to learn how.
You can read a file into a variable, but not manually type in numbers or ever need to make them visible. IF the data is 'protected' in some way...

Return indices of rows whose elements (columns) all match a reference vector

Using the following code;
c <- NULL
for (a in 1:4){
b <- seq(from = a, to = a + 5)
c <- rbind(c,b)
}
c <- rbind(c,c); rm(a,b)
Results in this matrix,
> c
[,1] [,2] [,3] [,4] [,5] [,6]
b 1 2 3 4 5 6
b 2 3 4 5 6 7
b 3 4 5 6 7 8
b 4 5 6 7 8 9
b 1 2 3 4 5 6
b 2 3 4 5 6 7
b 3 4 5 6 7 8
b 4 5 6 7 8 9
How can I return row indices for rows matching a specific input?
For example, with a search term of,
z <- c(3,4,5,6,7,8)
I need the following returned,
[1] 3 7
This will be used in a fairly large data frame of test data, related to a time step column, to reduce the data by accumulating time steps for matching rows.
Question answered well by others. Due to my dataset size (9.5M rows), I came up with an efficient approach that took a couple steps.
1) Sort the big data frame 'dc' containing time steps to accumulate in column 1.
dc <- dc[order(dc[,2],dc[,3],dc[,4],dc[,5],dc[,6],dc[,7],dc[,8]),]
2) Create a new data frame with unique entries (excluding column 1).
dcU <- unique(dc[,2:8])
3) Write Rcpp (C++) function to loop through unique data frame which iterates through the original data frame accumulating time while rows are equal and indexes to the next for loop step when an unequal row is identified.
require(Rcpp)
getTsrc <-
'
NumericVector getT(NumericMatrix dc, NumericMatrix dcU)
{
int k = 0;
int n = dcU.nrow();
NumericVector tU(n);
for (int i = 0; i<n; i++)
{
while ((dcU(i,0)==dc(k,1))&&(dcU(i,1)==dc(k,2))&&(dcU(i,2)==dc(k,3))&&
(dcU(i,3)==dc(k,4))&&(dcU(i,4)==dc(k,5))&&(dcU(i,5)==dc(k,6))&&
(dcU(i,6)==dc(k,7)))
{
tU[i] = tU[i] + dc(k,0);
k++;
}
}
return(tU);
}
'
cppFunction(getTsrc)
4) Convert function inputs to matrices.
dc1 <- as.matrix(dc)
dcU1 <- as.matrix(dcU)
5) Run the function and time it (returns time vector matching unique data frame)
pt <- proc.time()
t <- getT(dc1, dcU1)
print(proc.time() - pt)
user system elapsed
0.18 0.03 0.20
6) Self high-five and more coffee.
You can use apply.
Here we use apply on c, across rows (the 1), and use a function function(x) all(x == z) on each row.
The which then pulls out the integer positions of the rows.
which(apply(c, 1, function(x) all(x == z)))
b b
3 7
EDIT: If your real data is having problems with this, and is only 9 columns (not too much typing), you could try a fully vectorized solution:
which((c[,1]==z[1] & c[,2]==z[2] & c[,3]==z[3] & c[,4]==z[4]& c[,5]==z[5]& c[,6]==z[6]))
The answer by #jeremycg will definitely work, and is fast if you have many columns and few rows. However, you might be able to go a bit faster if you have lots of rows by avoiding using apply() on the row dimension.
Here's an alternative:
l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
[1] 3 7
It works by first turning each column into a list. Then, it takes each column-list and searches it for the corresponding element in z. In the last step, you find out which rows had all columns with the corresponding match in z. Even though the last step is a row-wise operation, by using .rowSums (mind the . at the front there) we can specify the dimensions of the matrix, and get a speed-up.
Let's test the timings of the two approaches.
The functions
f1 <- function(){
which(apply(c, 1, function(x) all(x == z)))
}
f2 <- function(){
l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
}
With 8 rows (dim in example):
> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 21.147 21.8375 22.86096 22.6845 23.326 30.443 100 a
f2() 42.310 43.1510 45.13735 43.7500 44.438 137.413 100 b
With 80 rows:
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 101.046 103.859 108.7896 105.1695 108.3320 166.745 100 a
f2() 93.631 96.204 104.6711 98.1245 104.7205 236.980 100 a
With 800 rows:
> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 920.146 1011.394 1372.3512 1042.1230 1066.7610 31290.593 100 b
f2() 572.222 579.626 593.9211 584.5815 593.6455 1104.316 100 a
Note that my timing assessment only had 100 replicates each, and although these results are reprsentative, there's a bit a of variability in the number of rows required before the two methods are equal.
Regardless, I think my approach would probably be faster once you have 100+ rows.
Also, note that you can't simply transpose c to make f1() faster. First, the t() takes up time; second, because you're comparing to z, you would then just have to make a column-wise (after the transpose) comparison, so it's no different at that point.
Finally, I'm sure there's an even faster way to do this. My answer was just the first thing that came to mind, and didn't require any packages to install. This could be a lot faster if you wanted to use data.table. Also, if you had a lot of columns, you might even be able to parallelize this procedure (although, to be worthwhile the dataset would have to be immense).
If these timings aren't tolerable for your data, you might consider reporting back with the dimensions of your data set.
In your code c is not a data frame. Try transforming it into one:
c <- data.frame(c)

Memorize the last "correct" value of a sequence (for removing outliers)

I have a little problem in a function.
The aim of it is to remove outliers I've detected in my data.frame. They are detected when there's a too big difference with the previous correct value (e.g c(1,2,3,20,30,4,5,6): "20" and "30" are the outliers). But my data is much more complex than this.
My idea is to consider the first two numeric values of my column as "correct". Then, I want to test each next value:
if the difference between the tested value and the previous one is <20, then it's a new correct one, and the test must start again from this new correct value (and not from the previous correct one)
if the same difference is >20, then it's a wrong one. An index must be put next to the wrong value, and the test must still continue from this same correct value, until a new correct value is detected
Here's an example with my function and a fake DF:
myts <- data.frame(x=c(12,12,35,39,46,45,33,5,26,28,29,34,15,15),z=NA)
test <- function(x){
st1 = NULL
temp <- st1[1] <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if((!is.na(x[i])) & (!is.na(x[i-1]))& (abs((x[i])-(temp)) > 20)){
st1[i] <- 1
} }
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
It does nearly the job, but the problem is that the last correct value is not memorized. It still does the test from the first correct value.
Due to this, rows 9 to 11 in my example are not detected. I let you imagine the problem on a 500,000 rows data.frame.
How can I solve this little problem? The rest of the function may be OK.
You just need to update temp for any indices that aren't outliers:
test <- function(x) {
temp <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if(!is.na(x[i]) & !is.na(x[i-1]) & abs(x[i]-temp) > 20) {
st1[i] <- 1
} else {
temp <- x[i]
}
}
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
myts
# x z
# 1 12 0
# 2 12 0
# 3 35 1
# 4 39 1
# 5 46 1
# 6 45 1
# 7 33 1
# 8 5 0
# 9 26 1
# 10 28 1
# 11 29 1
# 12 34 1
# 13 15 0
# 14 15 0
One thing to note is that for loops in R will be quite slow compared to vectorized functions. However, because each element in your vector depends on a complicated way on the previous ones, it's tough to use R's built-in vectorized functions to efficiently compute your vector. You can convert this code nearly verbatim to C++ and use the Rcpp package to regain the efficiency:
library(Rcpp)
test2 <- cppFunction(
"IntegerVector test2(NumericVector x) {
const int n = x.length();
IntegerVector st1(n, 0);
double temp = x[0];
for (int i=1; i < n; ++i) {
if (!R_IsNA(x[i]) && !R_IsNA(x[i]) && fabs(x[i] - temp) > 20.0) {
st1[i] = 1;
} else {
temp = x[i];
}
}
return st1;
}")
all.equal(test(myts[,1]), test2(myts[,1]))
# [1] TRUE
# Benchmark on large vector with some NA values:
set.seed(144)
large.vec <- c(0, sample(c(1:50, NA), 1000000, replace=T))
all.equal(test(large.vec), test2(large.vec))
# [1] TRUE
library(microbenchmark)
microbenchmark(test(large.vec), test2(large.vec))
# Unit: milliseconds
# expr min lq mean median uq max neval
# test(large.vec) 2343.684164 2468.873079 2667.67970 2604.22954 2747.23919 3753.54901 100
# test2(large.vec) 9.596752 9.864069 10.97127 10.23011 11.68708 16.67855 100
The Rcpp code is about 250x faster on a vector of length 1 million. Depending on your use case this speedup may or may not be important.

Resources