R: Count appereances of vector y in x - r

Suppose you have two vectors y and x with length(y) < length(x).
What I want is to count how often all elements of y appear in x in the same order.
I can assume that the elements of y appear at least one time in x in the right order.
Example:
y = c(10,20)
x = c(10,20,20,10,20)
The indexpairs of appereances of y in x are:
(1,2),(1,3),(1,5),(4,5)
so my result should be 4.
I've already written a function:
countAllPositionsOfLCS <- function(y, x) {
potIndexList <- lapply(y, function(k) {
which(k == x)
})
previousIndices <- potIndexList[[1]]
counter <- length(potIndexList[[1]])
if (length(potIndexList) >= 2) {
for (k in 2:length(potIndexList)) {
newIndices <- potIndexList[[k]]
for (i in 1:length(previousIndices)) {
currentFittingInds <- newIndices[which(previousIndices[i] < newIndices)]
counter <- counter + length(currentFittingInds) - 1
}
previousIndices <- newIndices
}
}
counter
}
I tested this function with
c(10,20,30) and c(10,20,20,10,20,20,30,30)
The proper result should be 12, but the function says 10.
Indexpairs are 1,2,7 | 1,2,8 | 1,3,7 | 1,3,8 | 1,5,7 | 1,5,8 | 1,6,7 | 1,6,8 | 4,5,7 | 4,5,8 | 4,6,7 | 4,6,8.
My idea was following:
First of all I look where the values of y appear in x and for each value of y I store the indices in the list potIndexList.
then I go through all elements of this list, say we are at element potIndexList[[k]], so we have all the indices there, where the k-th element of y apperas in x. Since I want to preserve the order of y in x, I have to get rid of some indices. Therefore I go through the indices and check if previousIndices[i] < potIndexList[[k]]. If this is TRUE I know that the order is right.

I suggest using expand.grid unless your actual problem involves much larger vectors:
y = c(10,20,30)
x = c(10,20,20,10,20,20,30,30)
#find matches:
inds <- lapply(y, function(z) which(z == x))
#all combinations of matches:
res <- do.call(expand.grid, inds)
#remove combinations with wrong order:
res <- res[apply(res, 1, function(z) all(order(z) == seq_along(y))),]
# Var1 Var2 Var3
# 1 1 2 7
# 3 1 3 7
# 5 1 5 7
# 6 4 5 7
# 7 1 6 7
# 8 4 6 7
# 9 1 2 8
#11 1 3 8
#13 1 5 8
#14 4 5 8
#15 1 6 8
#16 4 6 8
nrow(res)
#[1] 12

Related

R how to find a series of common values in a vector (identifying growing season)

I'm looking for a way to identify a growing season which consists of a number of days greater than say 60 between the last frost day of spring and the first frost day in the fall. A general version of this problem is this. If I have a vector of numbers like testVec, I want the item numbers of the beginning and end range of values where the number of items is 5 or greater and all of them are greater than 0.
testVec <- c(1,3,4,0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
In this example, the relevant range is 1,3,4,6,7,5,9 which is testVec[9] to testVec[15]
One option could be:
testVec[with(rle(testVec > 0), rep(lengths * values >= 5, lengths))]
[1] 1 3 4 6 7 5 9
Here, the idea is to, first, create runs of values that are smaller or equal to zero and bigger than zero. Second, it checks whether the runs of values bigger than zero are of length 5 or more. Finally, it subsets the original vector for the runs of values bigger than zero with length 5 or more.
1) rleid This also handles any number of sequences including zero. rleid(ok) is a vector the same length as ok such that the first run of identical elements is replaced with 1, the second run with 2 and so on. The result is a list of vectors where each vector has its positions in the original input as its names.
library(data.table)
getSeq <- function(x) {
names(x) <- seq_along(x)
ok <- x > 0
s <- split(x[ok], rleid(ok)[ok])
unname(s)[lengths(s) >= 5]
}
getSeq(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
getSeq(numeric(16))
## list()
getSeq(c(testVec, 10 * testVec))
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
##
## [[2]]
## 25 26 27 28 29 30 31
## 10 30 40 60 70 50 90
If a data frame were desired then following gives the values and which sequence the row came from. The row names indicate the positions in the original input.
gs <- getSeq(c(testVec, 10 * testVec))
names(gs) <- seq_along(gs)
if (length(gs)) stack(gs) else gs
## values ind
## 9 1 1
## 10 3 1
## 11 4 1
## 12 6 1
## 13 7 1
## 14 5 1
## 15 9 1
## 25 10 2
## 26 30 2
## 27 40 2
## 28 60 2
## 29 70 2
## 30 50 2
## 31 90 2
2) gregexpr Replace each element that is > 0 with 1 and each other element with 0 pasting the 0's and 1's into a single character string. Then use gregexpr to look for sequences of 1's at least 5 long and for the ith such nonoverlapping sequence return the first positions, g, and lengths, attr(g, "match.length"). Define a function vals which extracts the values at the required positions from testVec of the ith such nonoverlapping sequence returning a list such that the ith component of the list is the ith such sequence. The names in the output vector are its positions in the input.
getSeq2 <- function(x) {
g <- gregexpr("1{5,}", paste(+(x > 0), collapse = ""))[[1]]
vals <- function(i) {
ix <- seq(g[i], length = attr(g, "match.length")[i])
setNames(x[ix], ix)
}
if (length(g) == 1 && g == -1) list() else lapply(seq_along(g), vals)
}
getSeq2(testVec)
## [[1]]
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
The above handles any number of sequences including 0 but if we knew there were exactly one sequence (which is the case for the example in the question) then it could be simplified to the following where the return value is just that vector:
g <- gregexpr("1{5,}", paste(+(testVec > 0), collapse = ""))[[1]]
ix <- seq(g, length = attr(g, "match.length"))
setNames(testVec[ix], ix)
## 9 10 11 12 13 14 15
## 1 3 4 6 7 5 9
You could "fix" #tmfmnk's solution like this:
f1 <- function(x, threshold, n) {
range(which(with(rle(x > threshold), rep(lengths * values >= n, lengths))))
}
x <- c(1, 3, 4, 0, 1, -5, 6, 0, 1,3,4,6,7,5,9, 0)
f1(x, 0, 5)
#[1] 9 15
But that does not work well when there are multiple runs
xx <- c(x, x)
f1(xx, 0, 5)
#[1] 9 31
Here is another, not so concise approach that returns the start and end of the longest run (the first one if there are ties).
f2 <- function(x, threshold, n) {
y <- x > threshold
y[is.na(y)] <- FALSE
a <- ave(y, cumsum(!y), FUN=cumsum)
m <- max(a)
if (m < n) return (c(NA, NA))
i <- which(a == m)[1]
c(i-m+1, i)
}
f2(x, 0, 5)
#[1] 9 15
f2(xx, 0, 5)
#[1] 9 15
or with rle
f3 <- function(x, threshold, n) {
y <- x > threshold
r <- rle(y)
m <- max(r$lengths)
if (m < n) return (c(NA, NA))
i <- sum(r$lengths[1:which.max(r$lengths)[1]])
c(i-max(r$lengths)+1, i)
}
f3(x, 0, 5)
#[1] 9 15
f3(xx, 0, 5)
#[1] 9 15
If you wanted the first run that is at least n, that is you do not want a next run, even if it is longer, you could do
f4 <- function(x, threshold, n) {
y <- with(rle(x > threshold), rep(lengths * values >= n, lengths))
i <- which(y)[1]
j <- i + which(!y[-c(1:i)])[1] - 1
c(i, j)
}

Difference results assigning a variable within a function, and calling a function from a function in R

I'm struggling to understand the source of difference in these outputs for a function I wrote that lengthens a vector to a desired length. In the first instance of the function I used variable assignment for current_length <- length(x):
lengthen_vector <- function(x, target_length){
repeat{
current_length <- length(x)
x <- append(x, current_length + 1, after = current_length)
current_length <- current_length + 1
if(current_length == target_length) {
return(x)
break
}
}
}
Which results as expected for a target length of 20 from a starting length of 10:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
However, when I change from variable assignment to calling the length() function throughout the vector_lengthen() function as shown below:
lengthen_vector <- function(x, target_length){
repeat{
x <- append(x, length(x) + 1, after = length(x))
length(x) <- length(x) + 1
if(length(x) == target_length) {
return(x)
break
}
}
}
...results in the following:
[1] 1 2 3 4 5 6 7 8 9 10 11 NA 13 NA 15 NA 17 NA 19 NA
What is the difference between these two that is causing this? I can't seem to locate it.
The meaning of length(x) <- n is to make the length of x to be n by either cutting it off or extending it with NAs. For example,
x <- 1:3
length(x) <- 4
x
## [1] 1 2 3 NA
so if in your second version x has 10 elements then after the first append is performed x will have 11 elements and then the length(x) <- length(x) + 1 will extend it to 12 elements by appending an NA.
Just omit the length(x) <- length(x) + 1 statement giving:
lengthen_vector1 <- function(x, target_length){
repeat{
x <- append(x, length(x) + 1, after = length(x))
if(length(x) == target_length) {
return(x)
break
}
}
}
There are still some additional improvements that can be made:
remove the break statement since it can never be reached given that it comes after a return statement. Alternately move the return statement to after the loop.
if the target_length is less than or equal to the length of x it will loop forever. This leaves open what it should do in that case. Let us assume that if the target_length is less than the length of x that we should return x unchanged. To do these items place the if statement before the append statement and fix the if so that it returns unless the target_length exceeds the length of x. Also, if that is done then the if and repeat can be consolidated into a while statement.
since the extra numbers are added to the end of x we can use c instead of append avoiding the third argument.
Thus we can write:
lengthen_vector2 <- function(x, target_length) {
while(length(x) < target_length) {
x <- c(x, length(x) + 1)
}
x
}
lengthen_vector2(1:10, 15)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
lengthen_vector2(1:10, 3)
## [1] 1 2 3 4 5 6 7 8 9 10
Also it could be done without loops by concatenating the required sequence to the end of x. We specify that the sequence ends in target_length and the length of the sequence is target_length - length(x) or 0 if negative.
lengthen_vector3 <- function(x, target_length) {
c(x, seq(to = target_length, length = max(target_length - length(x), 0)))
}
If we wanted to be able to shrink the length as well as expand it then call length_vector3 using head(x, target_length) instead of x.
lengthen_vector4 <- function(x, target_length) {
lengthen_vector3(head(x, target_length), target_length)
}
lengthen_vector4(1:10, 15)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
lengthen_vector4(1:10, 3)
## [1] 1 2 3
or combine the last two into a single function:
lengthen_vector5 <- function(x, target_length) {
c(head(x, target_length),
seq(to = target_length, length = max(target_length - length(x), 0)))
}

Match values between two matrices based on max number of matches needed

I want to get only the rows from matrix A that have only any 5 or any 4 numbers existing in matrix B. Is there an R function I can use to solve it?
Μatrix A
1 2 3 4 5
2 3 5 6 7
3 5 7 8 1
2 7 5 4 3
matrix B:
1 2 4 5 6
2 4 1 3 7
3 5 7 9 8
5 8 9 2 6
if I ask for 5 numbers from B to match rows in A I will get no matching row.
if I ask for 4 numbers from B to match rows in A I will get :
B1 - A1
B2 - A1, A4
B3 - A3
I don't know of anything built into R that will work, but this custom function may help you get what you're after.
find.matches <- function(A, B, num.matches){
# Set rownames for the matrix
rownames(A) = paste0(deparse(substitute(A)), 1:nrow(A))
rownames(B) = paste0(deparse(substitute(B)), 1:nrow(B))
# Create matrix indicating matching items
out <- t(apply(cbind(seq_len(nrow(B)),B), 1,
function(y) {
cur.b = y[-1]
res <- apply(cbind(seq_len(nrow(A)),A), 1,
function(z) {
cur.a = z[-1]
ifelse(sum(table(cur.a[cur.a %in% cur.b])) == num.matches, rownames(A)[z[1]], NA)})}))
# Create list of matching items
out <- apply(out, 1, function(x) paste(x[!is.na(x)]))
# Remove non matches from list
out <- out[lapply(out,length) > 0]
if(length(out) > 0){
# Convert list to a vector
out <- paste0(names(out), " - ", lapply(c(out), paste, collapse = ", "))
# Print the vector
cat(out, sep = "\n")
} else{
print("No Matches Found")
}
}
# Create matrices to compare
A <- matrix(c(1,2,3,2,2,3,5,7,3,5,7,5,4,6,8,4,5,7,1,3), nrow = 4)
B <- matrix(c(1,2,3,5,2,4,5,8,4,1,7,9,5,3,9,2,6,7,8,6), nrow = 4)
# Compare matrices
find.matches(A, B, 4)

Subsetting vector: how to programatically pass negative index safely?

Given a vector, say v = 1:10, one can remove elements from v using negative indexing, e.g. v[-1], v[-length(v)], v[-c(2,3)], to remove the first, last and 2nd/3rd element respectively.
I would like to split v by passing in a split index n, taking values 0 to length(v). The code below:
v1 <- v[1:n]
v2 <- v[-c(1:n)]
works perfectly fine except for n = 0. Now I know that 1:n is generally unsafe and should be replaced with seq_len(n), however, the assignment v2 <- v[-seq_len(0)] produces an empty vector.
Is there way of doing this 'safely' using the bracket subsetting notation? Otherwise I know how to do it using head and tails:
v1 <- head(v, n)
v2 <- tail(v, length(v) - n)
Relevant other q/as:
Complement of empty index vector is empty index vector
You could use an if() statement inside the brackets. For example, this will just return the whole vector if n is zero and remove the sequence 1:n otherwise.
x <- 1:10
n <- 0
x[ if(n == 0) TRUE else -seq_len(n) ] ## n == 0 is !n for the golfers
# [1] 1 2 3 4 5 6 7 8 9 10
n <- 5
x[ if(n == 0) TRUE else -seq_len(n) ]
# [1] 6 7 8 9 10
v = 1:10
n = 0; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`FALSE`
# [1] 1 2 3 4 5 6 7 8 9 10
n = 1; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`FALSE`
#[1] 2 3 4 5 6 7 8 9 10
#$`TRUE`
#[1] 1
n = 10; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`TRUE`
# [1] 1 2 3 4 5 6 7 8 9 10
n = -10; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`TRUE`
# [1] 1 2 3 4 5 6 7 8 9 10
n = 100; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`TRUE`
# [1] 1 2 3 4 5 6 7 8 9 10
Further simplified by thelatemail in comment
split(v, seq_along(v) > n)

Combining vectors of unequal length and non-unique values

I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.
Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5
I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)

Resources