Match values between two matrices based on max number of matches needed - r

I want to get only the rows from matrix A that have only any 5 or any 4 numbers existing in matrix B. Is there an R function I can use to solve it?
Μatrix A
1 2 3 4 5
2 3 5 6 7
3 5 7 8 1
2 7 5 4 3
matrix B:
1 2 4 5 6
2 4 1 3 7
3 5 7 9 8
5 8 9 2 6
if I ask for 5 numbers from B to match rows in A I will get no matching row.
if I ask for 4 numbers from B to match rows in A I will get :
B1 - A1
B2 - A1, A4
B3 - A3

I don't know of anything built into R that will work, but this custom function may help you get what you're after.
find.matches <- function(A, B, num.matches){
# Set rownames for the matrix
rownames(A) = paste0(deparse(substitute(A)), 1:nrow(A))
rownames(B) = paste0(deparse(substitute(B)), 1:nrow(B))
# Create matrix indicating matching items
out <- t(apply(cbind(seq_len(nrow(B)),B), 1,
function(y) {
cur.b = y[-1]
res <- apply(cbind(seq_len(nrow(A)),A), 1,
function(z) {
cur.a = z[-1]
ifelse(sum(table(cur.a[cur.a %in% cur.b])) == num.matches, rownames(A)[z[1]], NA)})}))
# Create list of matching items
out <- apply(out, 1, function(x) paste(x[!is.na(x)]))
# Remove non matches from list
out <- out[lapply(out,length) > 0]
if(length(out) > 0){
# Convert list to a vector
out <- paste0(names(out), " - ", lapply(c(out), paste, collapse = ", "))
# Print the vector
cat(out, sep = "\n")
} else{
print("No Matches Found")
}
}
# Create matrices to compare
A <- matrix(c(1,2,3,2,2,3,5,7,3,5,7,5,4,6,8,4,5,7,1,3), nrow = 4)
B <- matrix(c(1,2,3,5,2,4,5,8,4,1,7,9,5,3,9,2,6,7,8,6), nrow = 4)
# Compare matrices
find.matches(A, B, 4)

Related

Producing equal results for column entries which are the same

I have the piece of code below. What i want is alter the code
such that when the column entries for mat are the same, i get the same result in their respective positions in summation without it performing the fit operation?
So instead of getting
1 3 5 1 7 1
2 4 6 2 8 2
3 9 12 5 16 4
I want
1 3 5 1 7 1
2 4 6 2 8 2
3 9 12 3 16 3
set.seed(123)
fit = function(A){
x = A[1]
y = A[2]
z = sum(sample((x+y),2))
return(z)
}
mat= matrix(c(1,2,3,4,5,6,1,2,7,8,1,2),nrow=2,ncol=6)
summation=apply(mat, 2, FUN = 'fit')
newmat=rbind(mat,summation)
newmat
You can find out columns that are duplicates and replace the corresponding summation value with the first value of summation so that you get the same value.
fit = function(A){
x = A[1]
y = A[2]
z = sum(sample((x+y),2))
return(z)
}
mat= matrix(c(1,2,3,4,5,6,1,2,7,8,1,2),nrow=2,ncol=6)
summation=apply(mat, 2, FUN = 'fit')
vals <- apply(mat, 2, paste0, collapse = '-')
summation <- ave(summation, match(vals, unique(vals)), FUN = function(x) x[1])
newmat=rbind(mat,summation)
newmat
To pass only unique columns to fit function we can do :
fit = function(A){
x = A[1]
y = A[2]
z = sum(sample((x+y),2))
return(z)
}
mat= matrix(c(1,2,3,4,5,6,1,2,7,8,1,2),nrow=2,ncol=6)
vals <- apply(mat, 2, paste0, collapse = '-')
summation <- apply(mat[, !duplicated(vals)], 2, fit)
summation <- summation[match(vals, unique(vals))]
newmat=rbind(mat,summation)
newmat

R: Count appereances of vector y in x

Suppose you have two vectors y and x with length(y) < length(x).
What I want is to count how often all elements of y appear in x in the same order.
I can assume that the elements of y appear at least one time in x in the right order.
Example:
y = c(10,20)
x = c(10,20,20,10,20)
The indexpairs of appereances of y in x are:
(1,2),(1,3),(1,5),(4,5)
so my result should be 4.
I've already written a function:
countAllPositionsOfLCS <- function(y, x) {
potIndexList <- lapply(y, function(k) {
which(k == x)
})
previousIndices <- potIndexList[[1]]
counter <- length(potIndexList[[1]])
if (length(potIndexList) >= 2) {
for (k in 2:length(potIndexList)) {
newIndices <- potIndexList[[k]]
for (i in 1:length(previousIndices)) {
currentFittingInds <- newIndices[which(previousIndices[i] < newIndices)]
counter <- counter + length(currentFittingInds) - 1
}
previousIndices <- newIndices
}
}
counter
}
I tested this function with
c(10,20,30) and c(10,20,20,10,20,20,30,30)
The proper result should be 12, but the function says 10.
Indexpairs are 1,2,7 | 1,2,8 | 1,3,7 | 1,3,8 | 1,5,7 | 1,5,8 | 1,6,7 | 1,6,8 | 4,5,7 | 4,5,8 | 4,6,7 | 4,6,8.
My idea was following:
First of all I look where the values of y appear in x and for each value of y I store the indices in the list potIndexList.
then I go through all elements of this list, say we are at element potIndexList[[k]], so we have all the indices there, where the k-th element of y apperas in x. Since I want to preserve the order of y in x, I have to get rid of some indices. Therefore I go through the indices and check if previousIndices[i] < potIndexList[[k]]. If this is TRUE I know that the order is right.
I suggest using expand.grid unless your actual problem involves much larger vectors:
y = c(10,20,30)
x = c(10,20,20,10,20,20,30,30)
#find matches:
inds <- lapply(y, function(z) which(z == x))
#all combinations of matches:
res <- do.call(expand.grid, inds)
#remove combinations with wrong order:
res <- res[apply(res, 1, function(z) all(order(z) == seq_along(y))),]
# Var1 Var2 Var3
# 1 1 2 7
# 3 1 3 7
# 5 1 5 7
# 6 4 5 7
# 7 1 6 7
# 8 4 6 7
# 9 1 2 8
#11 1 3 8
#13 1 5 8
#14 4 5 8
#15 1 6 8
#16 4 6 8
nrow(res)
#[1] 12

How to squeeze in missing values into a vector

Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
library(dtw)
ifelse(duplicated(dtw(A, B)$index2), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
{
c[i,1] <- NA
}
}

insert elements in a vector in R

I have a vector in R,
a = c(2,3,4,9,10,2,4,19)
let us say I want to efficiently insert the following vectors, b, and c,
b = c(2,1)
d = c(0,1)
right after the 3rd and 7th positions (the "4" entries), resulting in,
e = c(2,3,4,2,1,9,10,2,4,0,1,19)
How would I do this efficiently in R, without recursively using cbind or so.
I found a package R.basic but its not part of CRAN packages so I thought about using a supported version.
Try this:
result <- vector("list",5)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (c(3,7)+1)))
result[c(FALSE,TRUE)] <- list(b,d)
f <- unlist(result)
identical(f, e)
#[1] TRUE
EDIT: generalization to arbitrary number of insertions is straightforward:
insert.at <- function(a, pos, ...){
dots <- list(...)
stopifnot(length(dots)==length(pos))
result <- vector("list",2*length(pos)+1)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos+1)))
result[c(FALSE,TRUE)] <- dots
unlist(result)
}
> insert.at(a, c(3,7), b, d)
[1] 2 3 4 2 1 9 10 2 4 0 1 19
> insert.at(1:10, c(4,7,9), 11, 12, 13)
[1] 1 2 3 4 11 5 6 7 12 8 9 13 10
> insert.at(1:10, c(4,7,9), 11, 12)
Error: length(dots) == length(pos) is not TRUE
Note the bonus error checking if the number of positions and insertions do not match.
You can use the following function,
ins(a, list(b, d), pos=c(3, 7))
# [1] 2 3 4 2 1 9 10 2 4 0 1 4 19
where:
ins <- function(a, to.insert=list(), pos=c()) {
c(a[seq(pos[1])],
to.insert[[1]],
a[seq(pos[1]+1, pos[2])],
to.insert[[2]],
a[seq(pos[2], length(a))]
)
}
Here's another function, using Ricardo's syntax, Ferdinand's split and #Arun's interleaving trick from another question:
ins2 <- function(a,bs,pos){
as <- split(a,cumsum(seq(a)%in%(pos+1)))
idx <- order(c(seq_along(as),seq_along(bs)))
unlist(c(as,bs)[idx])
}
The advantage is that this should extend to more insertions. However, it may produce weird output when passed invalid arguments, e.g., with any(pos > length(a)) or length(bs)!=length(pos).
You can change the last line to unname(unlist(... if you don't want a's items named.
The straightforward approach:
b.pos <- 3
d.pos <- 7
c(a[1:b.pos],b,a[(b.pos+1):d.pos],d,a[(d.pos+1):length(a)])
[1] 2 3 4 2 1 9 10 2 4 0 1 19
Note the importance of parenthesis for the boundaries of the : operator.
After using Ferdinand's function, I tried to write my own and surprisingly it is far more efficient.
Here's mine :
insertElems = function(vect, pos, elems) {
l = length(vect)
j = 0
for (i in 1:length(pos)){
if (pos[i]==1)
vect = c(elems[j+1], vect)
else if (pos[i] == length(vect)+1)
vect = c(vect, elems[j+1])
else
vect = c(vect[1:(pos[i]-1+j)], elems[j+1], vect[(pos[i]+j):(l+j)])
j = j+1
}
return(vect)
}
tmp = c(seq(1:5))
insertElems(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
insert.at(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
And there's the benchmark result :
> microbenchmark(insertElems(tmp, c(2,4,5), c(NA,NA,NA)), insert.at(tmp, c(2,4,5), c(NA,NA,NA)), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
insertElems(tmp, c(2, 4, 5), c(NA, NA, NA)) 9.660 11.472 13.44247 12.68 13.585 1630.421 10000
insert.at(tmp, c(2, 4, 5), c(NA, NA, NA)) 58.866 62.791 70.36281 64.30 67.923 2475.366 10000
my code works even better for some cases :
> insert.at(tmp, c(1,4,5), c(NA,NA,NA))
# [1] 1 2 3 NA 4 NA 5 NA 1 2 3
# Warning message:
# In result[c(TRUE, FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos))) :
# number of items to replace is not a multiple of replacement length
> insertElems(tmp, c(1,4,5), c(NA,NA,NA))
# [1] NA 1 2 3 NA 4 NA 5
Here's an alternative that uses append. It's fine for small vectors, but I can't imagine it being efficient for large vectors since a new vector is created upon each iteration of the loop (which is, obviously, bad). The trick is to reverse the vector of things that need to be inserted to get append to insert them in the correct place relative to the original vector.
a = c(2,3,4,9,10,2,4,19)
b = c(2,1)
d = c(0,1)
pos <- c(3, 7)
z <- setNames(list(b, d), pos)
z <- z[order(names(z), decreasing=TRUE)]
for (i in seq_along(z)) {
a <- append(a, z[[i]], after = as.numeric(names(z)[[i]]))
}
a
# [1] 2 3 4 2 1 9 10 2 4 0 1 19

Combining vectors of unequal length and non-unique values

I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.
Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5
I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)

Resources