How to remove items in a vector from another vector - r

I have two vectors:
a = c(1,1,2,2,3,3,4,4)
b = c(1)
I want to remove the first match of b from a. Thus, here only the first 1 is removed from a:
c = c(1,2,2,3,3,4,4)
The order of items in a is not important.
I tried this code:
a[a != b]
a[! a %in% b]
Both results are:
[1] 2 2 3 3 4 4.
All numbers of 1 are removed. However, I only want to remove the specific item in b from a.
If b = c(1, 1, 2), then I wish the result
[1] 2 3 3 4 4
a[-(1:3)]
The above code could lead to the result of [1] 2 3 3 4 4. However, I wish it could be more flexible. For example when the order of items are unknown or random:
a = c(3,4,3,1,2,2,1,4)
How can I do it using R?

vecsets package can perform standard set operations, while retaining duplicates:
vecsets::vsetdiff( c(1,1,2,2,3,3,4,4), c(1) )
## [1] 1 2 2 3 3 4 4
vecsets::vsetdiff( c(1,1,2,2,3,3,4,4), c(1,1,2) )
## [1] 2 3 3 4 4
Note that it will preserve the order of the first argument. Using your last example:
vecsets::vsetdiff( c(3,4,3,1,2,2,1,4), c(1,1,2) )
## [1] 3 4 3 2 4

Taking inspiration from this answer to one of the questions I linked in comment, you can use fsetdiff from the package data.table.
It takes all as argument, which avoids having only the unique values returned, as happens with setdiff:
library(data.table)
# with your first example (b = c(1)):
unlist(fsetdiff(data.table(v1=a), data.table(v1=b), all = TRUE))
# v11 v12 v13 v14 v15 v16 v17
# 1 2 2 3 3 4 4
# with second example (b = c(1, 1, 2)):
unlist(fsetdiff(data.table(v1=a), data.table(v1=b), all = TRUE))
# v11 v12 v13 v14 v15
# 2 3 3 4 4

You can use which()
a = c(3, 4, 3, 1, 2, 2, 1, 4)
a
## [1] 3 4 3 1 2 2 1 4
b = 1
a[- which(a %in% b)[1]]
## [1] 3 4 3 2 2 1 4
Case b has two elements:
b2 = c(1, 2)
sapply(seq_along(b1), function(x) a <<- a[- which(a == x)[1]])[[2]]
## [1] 3 4 3 2 1 4
Or three...
b3 <- c(1, 2, 3)
sapply(seq_along(b1), function(x) a <<- a[- which(a == x)[1]])[[3]]
# [1] 4 3 2 1 4

I don't think that the following is the best solution (the vecsets approach strikes me as the best), but #Aaron's comment about possibly using Rcpp struck me as interesting. This is the first time I used that package. If nothing else, the fact that I was able to get working code in less than 20 minutes underscores his point that Rcpp makes it relatively easy:
library(Rcpp)
cppFunction('
NumericVector difference(NumericVector xs, NumericVector ys){
int m = xs.size();
int n = ys.size();
float flag = 1 + abs(max(xs)) + abs(max(ys)); //occurs in neither xs nor ys
NumericVector zs = clone(xs);
for(int i = 0; i < n; i++){
double y = ys[i];
int j = 0;
while(j < m && zs[j]!= y) j++;
if(j < m) zs[j] = flag;
}
int count = 0;
for(int k = 0; k < m; k++){
if(zs[k] < flag) count++;
}
NumericVector ws(count);
int k = 0;
for(int j = 0; j < m; j++){
if(zs[j] < flag){
ws[k] = zs[j];
k++;
}
}
return ws;
}
')
After you source this:
> a = c(1,1,2,2,3,3,4,4)
> b = c(1,2,1)
> difference(a,b)
[1] 2 3 3 4 4
Since this was my first attempt at such code, I'm sure that it could be improved in multiple ways.

A little frustrating, the syntactical order, but Reduce and which do it with just Base R.
Reduce(b, a) a[-which(a==b)[1]], a, b)

Related

R, function that turns numeric input into a recycled numeric output bounded by a upper limit

How can I write a function f(v, n) (or use a base R function) that turns a numeric vector v into another based on n, a recycling factor (in lack of a better word).
For instance:
f(v = c(1,2,3,4,5), n = 1) #would yield c(1,2,3,4,5)
f(v = c(1,2,3,4,5), n = 2) #would yield c(1,2,1,2,1)
f(v = c(1,2,3,4,5), n = 3) #would yield c(1,2,3,1,2)
f(v = c(5,4,3,2,1), n = 3) #would yield c(2,1,3,2,1)
f(v = c(3,6), n = 3) #would yield c(3,3)
The closest I got was to use %%
1:5%%3 #giving me: [1] 1 2 0 1 2 - not quite what I want, at least some recycling.
We can create the function as
f <- function(x, n) if(n ==1) x else (x - 1) %% n + 1
f(1:5, 1)
#[1] 1 2 3 4 5
f(1:5, 2)
#[1] 1 2 1 2 1
f(1:5, 3)
#[1] 1 2 3 1 2
f(5:1, 3)
#[1] 2 1 3 2 1
f(c(3, 6), 3)
#[1] 3 3

Minimum Cost Distance in Matrix

I have got this matrix below
k
[,1] [,2] [,3] ,4][,5] [,6]
[1,] 1 4 9 16 25 36
[2,] 1 3 7 13 21 31
[3,] 2 2 5 10 17 26
[4,] 4 2 4 8 14 22
[5,] 7 3 3 6 11 18
[6,] 11 5 3 5 9 15
and I want to loop through starting from k[1,1] and ending at k[6,6]. My looping criteria is based on min(k[i,j+1], k[i+1,j], k[i+1, j+1]) and the answer I hope to get is something like 1+1+2+2+3+3+5+9+15 = 41 (travelling through the minimum path)
So pretty much it calculates the minimum starting from k[1,1] and then continues downwards till k[6,6]
warpingDist = function(x, y, z){
mincal = numeric(length(k))
m = nrow(k)
n = ncol(k)
i=1
j=1
mincal = which(k == min(k[i, j+1], k[i+1, j], k[i+1, j+1]), arr.ind = TRUE)
indx = data.frame(mincal)
i= indx$row
j= indx$col
if(i != m || j!=n)
{
warpingDist(k[i, j+1], k[i+1, j], k[i+1, j+1])
}
warpSum = sum(mincal)
return(warpSum)
}
value = apply(k, c(1,2), warpingDist)
value
When I run this code it displays the below:
Error: object 'value' not found
Not sure why this is happening...
As you don't provide a minimal reproducible example, I can only guess:
warpingDist = function(x, y, z, k){
# browser() # This is a good option to activate, if you run your script in RStudio
...
return(warpSum)
}
# your code
k <- whatever it is
result <- warpingDist(x, y, z, k)
I hope that helps.
Am glad, I was finally able to solve the problem...The code runs fast as well
Problem: To find the minimum cost for a matrix. For clarity, let's assume I have the matrix given below:
[1,] 1 4 6 7 8 9 0
[2,] 10 12 1 3 11 2 0
[3,] 11 12 2 8 17 1 0
[4,] 20 1 18 4 28 1 0
[5,] 5 20 80 6 9 3 0
My goal is to add the minimum path distance starting from kata[1,1] first row to the last row K[5,4]. So effectively, I want to have something like 1 + 4 + 1 + 2 + 4 + 6 + 9 + 3.
Below is the R code which I have used to implement this. It implements two functions:
# Function that calculates minimum of three values. Returns the Value.
minFUN <- function(Data, a, b){
d = (min(Data[a, b+1], Data[a+1, b], Data[a+1, b+1]))
return(d)
}
# Function that calculates the index of the minimum value, from which the
# The next iteration begins
NextRC <- function(Data, a, b){
d = min(Data[a, b+1], Data[a+1, b], Data[a+1, b+1])
if(d == Data[a, b+1]){
c = cbind(a, b+1)
}else
if(d == Data[a+1, b]){
c = cbind(a+1, b)
} else
if(d == Data[a+1, b+1]){
c = cbind(a+1, b+1)
}
return(c)
}
Je <- c()
NewRow = 1
NewCol = 1
# Warping Function that uses both functions above to loop through the dataset
WarpDist <- function(Data, a = NewRow, b = NewCol){
for(i in 1:4) {
Je[i] = minFUN(Data, a, b)
# Next Start Point
NSP = NextRC(Data, a,b)
NewRow = as.numeric(NSP[1,1])
NewCol = as.numeric(NSP[1,2])
a = NewRow
b = NewCol
}
return(Je)
}
Value=WarpDist(Data = Data, a = NewRow, b = NewCol)
warpo = Data[1,1] + sum(Value)
w = sqrt(warpo)
The result is the minimum path from the first row to the last row
Value
[1] 4 1 2 4 6
The result omits 9 and 3 because its already on the last row.
Time:
Time difference of 0.08833408 secs

Subsetting vector: how to programatically pass negative index safely?

Given a vector, say v = 1:10, one can remove elements from v using negative indexing, e.g. v[-1], v[-length(v)], v[-c(2,3)], to remove the first, last and 2nd/3rd element respectively.
I would like to split v by passing in a split index n, taking values 0 to length(v). The code below:
v1 <- v[1:n]
v2 <- v[-c(1:n)]
works perfectly fine except for n = 0. Now I know that 1:n is generally unsafe and should be replaced with seq_len(n), however, the assignment v2 <- v[-seq_len(0)] produces an empty vector.
Is there way of doing this 'safely' using the bracket subsetting notation? Otherwise I know how to do it using head and tails:
v1 <- head(v, n)
v2 <- tail(v, length(v) - n)
Relevant other q/as:
Complement of empty index vector is empty index vector
You could use an if() statement inside the brackets. For example, this will just return the whole vector if n is zero and remove the sequence 1:n otherwise.
x <- 1:10
n <- 0
x[ if(n == 0) TRUE else -seq_len(n) ] ## n == 0 is !n for the golfers
# [1] 1 2 3 4 5 6 7 8 9 10
n <- 5
x[ if(n == 0) TRUE else -seq_len(n) ]
# [1] 6 7 8 9 10
v = 1:10
n = 0; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`FALSE`
# [1] 1 2 3 4 5 6 7 8 9 10
n = 1; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`FALSE`
#[1] 2 3 4 5 6 7 8 9 10
#$`TRUE`
#[1] 1
n = 10; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`TRUE`
# [1] 1 2 3 4 5 6 7 8 9 10
n = -10; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`TRUE`
# [1] 1 2 3 4 5 6 7 8 9 10
n = 100; split(v, seq_along(v)/min(n,length(v)) <= 1)
#$`TRUE`
# [1] 1 2 3 4 5 6 7 8 9 10
Further simplified by thelatemail in comment
split(v, seq_along(v) > n)

insert elements in a vector in R

I have a vector in R,
a = c(2,3,4,9,10,2,4,19)
let us say I want to efficiently insert the following vectors, b, and c,
b = c(2,1)
d = c(0,1)
right after the 3rd and 7th positions (the "4" entries), resulting in,
e = c(2,3,4,2,1,9,10,2,4,0,1,19)
How would I do this efficiently in R, without recursively using cbind or so.
I found a package R.basic but its not part of CRAN packages so I thought about using a supported version.
Try this:
result <- vector("list",5)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (c(3,7)+1)))
result[c(FALSE,TRUE)] <- list(b,d)
f <- unlist(result)
identical(f, e)
#[1] TRUE
EDIT: generalization to arbitrary number of insertions is straightforward:
insert.at <- function(a, pos, ...){
dots <- list(...)
stopifnot(length(dots)==length(pos))
result <- vector("list",2*length(pos)+1)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos+1)))
result[c(FALSE,TRUE)] <- dots
unlist(result)
}
> insert.at(a, c(3,7), b, d)
[1] 2 3 4 2 1 9 10 2 4 0 1 19
> insert.at(1:10, c(4,7,9), 11, 12, 13)
[1] 1 2 3 4 11 5 6 7 12 8 9 13 10
> insert.at(1:10, c(4,7,9), 11, 12)
Error: length(dots) == length(pos) is not TRUE
Note the bonus error checking if the number of positions and insertions do not match.
You can use the following function,
ins(a, list(b, d), pos=c(3, 7))
# [1] 2 3 4 2 1 9 10 2 4 0 1 4 19
where:
ins <- function(a, to.insert=list(), pos=c()) {
c(a[seq(pos[1])],
to.insert[[1]],
a[seq(pos[1]+1, pos[2])],
to.insert[[2]],
a[seq(pos[2], length(a))]
)
}
Here's another function, using Ricardo's syntax, Ferdinand's split and #Arun's interleaving trick from another question:
ins2 <- function(a,bs,pos){
as <- split(a,cumsum(seq(a)%in%(pos+1)))
idx <- order(c(seq_along(as),seq_along(bs)))
unlist(c(as,bs)[idx])
}
The advantage is that this should extend to more insertions. However, it may produce weird output when passed invalid arguments, e.g., with any(pos > length(a)) or length(bs)!=length(pos).
You can change the last line to unname(unlist(... if you don't want a's items named.
The straightforward approach:
b.pos <- 3
d.pos <- 7
c(a[1:b.pos],b,a[(b.pos+1):d.pos],d,a[(d.pos+1):length(a)])
[1] 2 3 4 2 1 9 10 2 4 0 1 19
Note the importance of parenthesis for the boundaries of the : operator.
After using Ferdinand's function, I tried to write my own and surprisingly it is far more efficient.
Here's mine :
insertElems = function(vect, pos, elems) {
l = length(vect)
j = 0
for (i in 1:length(pos)){
if (pos[i]==1)
vect = c(elems[j+1], vect)
else if (pos[i] == length(vect)+1)
vect = c(vect, elems[j+1])
else
vect = c(vect[1:(pos[i]-1+j)], elems[j+1], vect[(pos[i]+j):(l+j)])
j = j+1
}
return(vect)
}
tmp = c(seq(1:5))
insertElems(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
insert.at(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
And there's the benchmark result :
> microbenchmark(insertElems(tmp, c(2,4,5), c(NA,NA,NA)), insert.at(tmp, c(2,4,5), c(NA,NA,NA)), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
insertElems(tmp, c(2, 4, 5), c(NA, NA, NA)) 9.660 11.472 13.44247 12.68 13.585 1630.421 10000
insert.at(tmp, c(2, 4, 5), c(NA, NA, NA)) 58.866 62.791 70.36281 64.30 67.923 2475.366 10000
my code works even better for some cases :
> insert.at(tmp, c(1,4,5), c(NA,NA,NA))
# [1] 1 2 3 NA 4 NA 5 NA 1 2 3
# Warning message:
# In result[c(TRUE, FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos))) :
# number of items to replace is not a multiple of replacement length
> insertElems(tmp, c(1,4,5), c(NA,NA,NA))
# [1] NA 1 2 3 NA 4 NA 5
Here's an alternative that uses append. It's fine for small vectors, but I can't imagine it being efficient for large vectors since a new vector is created upon each iteration of the loop (which is, obviously, bad). The trick is to reverse the vector of things that need to be inserted to get append to insert them in the correct place relative to the original vector.
a = c(2,3,4,9,10,2,4,19)
b = c(2,1)
d = c(0,1)
pos <- c(3, 7)
z <- setNames(list(b, d), pos)
z <- z[order(names(z), decreasing=TRUE)]
for (i in seq_along(z)) {
a <- append(a, z[[i]], after = as.numeric(names(z)[[i]]))
}
a
# [1] 2 3 4 2 1 9 10 2 4 0 1 19

Combining vectors of unequal length and non-unique values

I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.
Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5
I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)

Resources