Returning absent values without inducing integer (0) - r

I want to identify which values in one vector are present in another vector. Sometimes, in my application, none of the values of the first vector are present; in such cases I would like NA. My current approach returns integer(0) when this occurs:
l <- 1:3
m <- 2:5
n <- 4:6
l[l %in% m]
1] 2 3
l[l %in% n]
integer(0)
This post discusses how to capture integer(0) using length, but is there a way to avoid integer(0) in the first place, and do this operation in just one step? Answers to the previous question suggest that any could be used but I fail to see how that would work in this example.

You could catch the integer(0) with a custom function:
l <- 1:3
m <- 2:5
n <- 4:6
returnsafe <- function(a, b) {
result <- a[a %in% b]
if(is.integer(result) && length(result) == 0L) {
return(NA)
} else {
return(result)
}
}
> returnsafe(l, n)
[1] NA

You can do:
l[match(l, n)]
[1] NA NA NA
Or:
any(l[match(l, n)])
[1] NA

Related

R programming language LOOPS

y <- vector()
i <- 5
while((2<3)<i){
y[i] <- "Hello World!"
i <- i-1 }
y
So I didn't understand how to while loop works when while((2<3)<i) is the case, 2<3 is true for all conditions and i end up with TRUE<i, what does this mean? Or am I thinking wrong?
I just didn't get how to condition of the while loop works, if I get that I believe I will work it out.
Also another question:
xxx <- function(vec){
n <- length(vec)
}
for(i in 1:n){
x <- vec[i]
if (vec[i]<x){
x <- vec[i]
}
} return(x)
This xxx function is suppose to output the minimum value of the function? okay i see but how?
when we enter the loop we first do x<- vec[i] without doing this we can't pass to the next command the if statement right? so since we do x <- vec[i] earlier if command won't work probably since x==vec[i] all the time.
Please help guys since iI have the exam tomorrow :(
1) ?Comparison says, referring to the two arguments of any comparison operator such as < :
If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of precedence
being character, complex, numeric, integer, logical and raw.
so in this case we have one logical argument and one numeric argument so the the logical argument is coerced to numeric (where FALSE is converted to 0 and TRUE is converted to 1). Thus (2<3)<5 is the same as TRUE < 5 which is the same as 1 < 5 which is TRUE:
(2<3)<5
## [1] TRUE
2) For xxx you probably want this:
xxx <- function(vec) {
x <- Inf
for(i in seq_along(vec)) if (vec[i] < x) x <- vec[i]
x
}
The first statement in the body assigns Inf to x In the second statement in the body seq_along(vec) is 1, 2, ..., length(vec) so the for loop iterates i over 1, 2, ..., length(vec) with each iteration replacing x with vec[i] if vec[i] is less than x. Note that if vec has zero length then the loop is not run at all since seq_along(vec) has zero length.
Testing it out:
> xxx(1:3)
[1] 1
> xxx(3:1)
[1] 1
> xxx(numeric(0)) # zero length input
Inf
Of course R already has the min function which does the same thing.

Find vector overlap from the start

I am looking for an efficient way to get the first k elements that are the same between two vectors in R.
For example:
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
# [1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
# [1] 1 2 3
This is the same as the intersect behavior, but any values after the first mismatch should be dropped.
I also want this to work for strings.
So far, the solution that I have is this:
orderedIntersect <- function(a,b) {
a <- as.vector(a)
NAs <- is.na(match(a, as.vector(b)))
last <- ifelse(any(NAs), min(which(NAs)) - 1, length(a))
a[1:last]
}
I am troubled by the fact that I have to iterate over n input elements 6 times: match, is.na, any, which, min, and the subset [].
Clearly, it would be faster to write an external C function (with a for loop and a break), but I am wondering if there is any clever R trick I can use here.
You can compare the values of your vectors and drop elements when the first FALSE is reached:
orderedIntersect <- function(a,b) {
# check the lengths are equal and if not, "cut" the vectors so they are (to avoid warnings)
l_a <- length(a) ; l_b <- length(b)
if(l_a != l_b) {m_l <- min(l_a, l_b) ; a <- a[1:m_l] ; b <- b[1:m_l]}
# compare the elements : they are equal if both are not NA and have the same value or if both are NA
comp <- (!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))
# return the right vector : nothing if the first elements do not match, everything if all elements match or just the part that match
if(!comp[1]) return(c()) else if (all(comp)) return(a) else return(a[1:(which(!comp)[1]-1)])
}
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
#[1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
#[1] 1 2 3
orderedIntersect(c(1,2,3), c(2,3,4))
#NULL
The simple C solution (for integers) isn't really any longer than the R version, but it would be a little more work to extend to all the other classes.
library(inline)
orderedIntersect <- cfunction(
signature(x='integer', y='integer'),
body='
int i, l = length(x) > length(y) ? length(y) : length(x),
*xx = INTEGER(x), *yy = INTEGER(y);
SEXP res;
for (i = 0; i < l; i++) if (xx[i] != yy[i]) break;
PROTECT(res = allocVector(INTSXP, i));
for (l = 0; l < i; l++) INTEGER(res)[l] = xx[l];
UNPROTECT(1);
return res;'
)
## Tests
a <- c(1L,2L,3L,4L)
b <- c(1L,2L,5L,4L)
c <- c(1L,2L,8L,9L,9L,9L,9L,3L)
d <- c(9L,0L,0L,8L)
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# integer(0)
orderedIntersect(a, integer())
# integer(0)
This might work:
#test data
a <- c(1,2,3,4)
b <- c(1,2,5,4)
c <- c(1,2,8,9,9,9,9,3)
d <- c(9,0,0,8)
empty <- c()
string1 <- c("abc", "def", "ad","k")
string2 <- c("abc", "def", "c", "lds")
#function
orderedIntersect <- function(a, b) {
l <- min(length(a), length(b))
if (l == 0) return(numeric(0))
a1 <- a[1:l]
comp <- a1 != b[1:l]
if (all(!comp)) return(a1)
a1[ 0:(min(which(comp)) - 1) ]
}
#testing
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# numeric(0)
orderedIntersect(a, empty)
# numeric(0)
orderedIntersect(string1,string2)
# [1] "abc" "def"

Matching numbers by their order when in two different vectors

The title does not really do this question justice, but I could not think of any other way to phrase the question. I can best explain the problem with an example.
Let's say we have two vectors of numbers (each of which are always going to be ascending and unique):
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
What I am trying to do is create a function that will take these two vectors and match them in the following way:
1) Start with the first element of vector1 (in this case, 1)
2) Go to vector2 and match the element from #1 with the first element in vector 2 that is bigger than it (in this case, 5)
3) Go back to vector1 and skip all elements less than the value in #2 we found (in this case, we skip 3, and grab 10)
4) Go back to vector2 and skip all elements less than the value in #3 we found (in this case, we skip 9 and grab 15)
5) repeat until we are done with all elements.
The resulting two vectors we should have are:
result1 = c(1,10,24,30)
result2 = c(5,15,28,35)
My current solution goes something like this, but I believe it might be highly inefficient:
# establishes where we start from the vector2 numbers
# just in case we have vector1 <- c(5,8,10)
# and vector2 <- c(1,2,3,4,6,7). We would want to skip the 1,2,3,4 values
i <- 1
while(vector2[i]<vector1[1]){
i <- i+1
}
# starts the result1 vector with the first value from the vector1
result1 <- vector1[1]
# starts the result2 vector empty and will add as we loop through
result2 <- c()
# super complicated and probably hugely inefficient loop within a loop within a loop
# i really want to avoid doing this, but I cannot think of any other way to accomplish this
for(j in 1:length(vector1)){
while(vector1[j] > vector2[i] && (i+1) <= length(vector2)){
result1 <- c(result1,vector1[j])
result2 <- c(result2,vector2[i])
while(vector1[j] > vector2[i+1] && (i+2) <= length(vector2)){
i <- i+1
}
i <- i+1
}
}
## have to add on the last vector2 value cause while loop skips it
## if it doesn't exist (there are no more vector2 values bigger) we put in an NA
if(result1[length(result1)] < vector2[i]){
result2 <- c(result2,vector2[i])
}
else{
### we ran out of vector2 values that are bigger
result2 <- c(result2,NA)
}
This is really difficult to explain. Just call it magic :)
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
## another case
# vector2 <- c(0,9,15,19,21,23,28,35)
## handling the case where vector2 min value(s) are < vector1 min value
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
## interleave the two vectors
tmp <- c(vector1,vector2)[order(c(order(vector1), order(vector2)))]
## if we sort the vectors, which pairwise elements are from the same vector
r <- rle(sort(tmp) %in% vector1)$lengths
## I want to "remove" all the pairwise elements which are from the same vector
## so I again interleave two vectors:
## the first will be all TRUEs because I want the first instance of each *new* vector
## the second will be all FALSEs identifying the elements I want to throw out because
## there is a sequence of elements from the same vector
l <- rep(1, length(r))
ord <- c(l, r - 1)[order(c(order(r), order(l)))]
## create some dummy TRUE/FALSE to identify the ones I want
res <- sort(tmp)[unlist(Map(rep, c(TRUE, FALSE), ord))]
setNames(split(res, res %in% vector2), c('result1', 'result2'))
# $result1
# [1] 1 10 24 30
#
# $result2
# [1] 5 15 28 35
obviously this will only work if both vectors are ascending and unique which you said
EDIT:
works with duplicates:
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
vector2 <- c(0,9,15,19,21,23,28,35)
vector2 <- c(1,3,3,5,7,9,28,35)
f <- function(v1, v2) {
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
vector1 <- paste0(vector1, '.0')
vector2 <- paste0(vector2, '.00')
n <- function(x) as.numeric(x)
tmp <- c(vector1, vector2)[order(n(c(vector1, vector2)))]
m <- tmp[1]
idx <- c(TRUE, sapply(1:(length(tmp) - 1), function(x) {
if (n(tmp[x + 1]) > n(m)) {
if (gsub('^.*\\.','', tmp[x + 1]) == gsub('^.*\\.','', m))
FALSE
else {
m <<- tmp[x + 1]
TRUE
}
} else FALSE
}))
setNames(split(n(tmp[idx]), grepl('\\.00$', tmp[idx])), c('result1','result2'))
}
f(vector1, vector2)
# $result1
# [1] 1 10 30
#
# $result2
# [1] 3 28 35

Vectorized (non-loop) solution returns wrong result (solution with for-loop returns correct result)

I have theoretically identical solutions, one is vectorized solution and another is with for-loop. But vectorized solution returns wrong result and I want to understand why. Solution's logic is simple: need to replace NA with previous non-NA value in the vector.
# vectorized
f1 <- function(x) {
idx <- which(is.na(x))
x[idx] <- x[ifelse(idx > 1, idx - 1, 1)]
x
}
# non-vectorized
f2 <- function(x) {
for (i in 2:length(x)) {
if (is.na(x[i]) && !is.na(x[i - 1])) {
x[i] <- x[i - 1]
}
}
x
}
v <- c(NA,NA,1,2,3,NA,NA,6,7)
f1(v)
# [1] NA NA 1 2 3 3 NA 6 7
f2(v)
# [1] NA NA 1 2 3 3 3 6 7
The two pieces of code are different.
The first one replace NA with the previous element if this one is not NA.
The second one replace NA with the previous element if this one is not NA, but the previous element can be the result of a previous NA substitution.
Which one is correct really depends on you. The second behaviour is more difficult to vectorize, but there are some already implemented functions like zoo::na.locf.
Or, if you only want to use base packages, you could have a look at this answer.
These two solutions are not equivalent. The first function is rather like:
f2_as_f1 <- function(x) {
y <- x # a copy of x
for (i in 2:length(x)) {
if (is.na(y[i])) {
x[i] <- y[i - 1]
}
}
x
}
Note the usage of the y vector.

R split numeric vector at position

I am wondering about the simple task of splitting a vector into two at a certain index:
splitAt <- function(x, pos){
list(x[1:pos-1], x[pos:length(x)])
}
a <- c(1, 2, 2, 3)
> splitAt(a, 4)
[[1]]
[1] 1 2 2
[[2]]
[1] 3
My question: There must be some existing function for this, but I can't find it? Is maybe split a possibility? My naive implementation also does not work if pos=0 or pos>length(a).
An improvement would be:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
which can now take a vector of positions:
splitAt(a, c(2, 4))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2 2
#
# [[3]]
# [1] 3
And it does behave properly (subjective) if pos <= 0 or pos >= length(x) in the sense that it returns the whole original vector in a single list item. If you'd like it to error out instead, use stopifnot at the top of the function.
I tried to use flodel's answer, but it was too slow in my case with a very large x (and the function has to be called repeatedly). So I created the following function that is much faster, but also very ugly and doesn't behave properly. In particular, it doesn't check anything and will return buggy results at least for pos >= length(x) or pos <= 0 (you can add those checks yourself if you're unsure about your inputs and not too concerned about speed), and perhaps some other cases as well, so be careful.
splitAt2 <- function(x, pos) {
out <- list()
pos2 <- c(1, pos, length(x)+1)
for (i in seq_along(pos2[-1])) {
out[[i]] <- x[pos2[i]:(pos2[i+1]-1)]
}
return(out)
}
However, splitAt2 runs about 20 times faster with an x of length 106:
library(microbenchmark)
W <- rnorm(1e6)
splits <- cumsum(rep(1e5, 9))
tm <- microbenchmark(
splitAt(W, splits),
splitAt2(W, splits),
times=10)
tm
Another alternative that might be faster and/or more readable/elegant than flodel's solution:
splitAt <- function(x, pos) {
unname(split(x, findInterval(x, pos)))
}

Resources