Calculate cumsum() while ignoring NA values - r

Consider the following named vector x.
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
# a b c d e f g h
# 1 2 0 NA 4 NA NA 6
I'd like to calculate the cumulative sum of x while ignoring the NA values. Many R functions have an argument na.rm which removes NA elements prior to calculations. cumsum() is not one of them, which makes this operation a bit tricky.
I can do it this way.
y <- setNames(numeric(length(x)), names(x))
z <- cumsum(na.omit(x))
y[names(y) %in% names(z)] <- z
y[!names(y) %in% names(z)] <- x[is.na(x)]
y
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
But this seems excessive, and makes a lot of new assignments/copies. I'm sure there's a better way.
What better methods are there to return the cumulative sum while effectively ignoring NA values?

You can do this in one line with:
cumsum(ifelse(is.na(x), 0, x)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
Or, similarly:
library(dplyr)
cumsum(coalesce(x, 0)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13

It's an old question but tidyr gives a new solution.
Based on the idea of replacing NA with zero.
require(tidyr)
cumsum(replace_na(x, 0))
a b c d e f g h
1 3 3 3 7 7 7 13

Do you want something like this:
x2 <- x
x2[!is.na(x)] <- cumsum(x2[!is.na(x)])
x2
[edit] Alternatively, as suggested by a comment above, you can change NA's to 0's -
miss <- is.na(x)
x[miss] <- 0
cs <- cumsum(x)
cs[miss] <- NA
# cs is the requested cumsum

Here's a function I came up from the answers to this question. Thought I'd share it, since it seems to work well so far. It calculates the cumulative FUNC of x while ignoring NA. FUNC can be any one of sum(), prod(), min(), or max(), and x is a numeric vector.
cumSkipNA <- function(x, FUNC)
{
d <- deparse(substitute(FUNC))
funs <- c("max", "min", "prod", "sum")
stopifnot(is.vector(x), is.numeric(x), d %in% funs)
FUNC <- match.fun(paste0("cum", d))
x[!is.na(x)] <- FUNC(x[!is.na(x)])
x
}
set.seed(1)
x <- sample(15, 10, TRUE)
x[c(2,7,5)] <- NA
x
# [1] 4 NA 9 14 NA 14 NA 10 10 1
cumSkipNA(x, sum)
# [1] 4 NA 13 27 NA 41 NA 51 61 62
cumSkipNA(x, prod)
# [1] 4 NA 36 504 NA 7056 NA
# [8] 70560 705600 705600
cumSkipNA(x, min)
# [1] 4 NA 4 4 NA 4 NA 4 4 1
cumSkipNA(x, max)
# [1] 4 NA 9 14 NA 14 NA 14 14 14
Definitely nothing new, but maybe useful to someone.

Another option is using the collapse package with fcumsum function like this:
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
#> a b c d e f g h
#> 1 2 0 NA 4 NA NA 6
library(collapse)
fcumsum(x)
#> a b c d e f g h
#> 1 3 3 NA 7 NA NA 13
Created on 2022-08-24 with reprex v2.0.2

Related

The usage of "If", "else", "is.na()" in one function of R

I have Length and Weight values in a data frame. However some of them are missing. The data frame is like:
df <- data.frame(
L = c(13,15,19,NA,NA,32,35,NA,NA,18,15),
W = c(NA,NA,50, NA,NA,NA,80,NA,NA,30,NA)
)
I need a function which will work when length is not NA and weight is NA. it will calculate the weight for length, and else it will do nothing.
lwr <- function(data, length, weight, a, b) {
if(!is.na(data$length) && is.na(data$weight)) {
data$weight = 10^(log(a) + b*log(data$length))
} else {
data$weight
}
}
Here we go
lwr(data=df, length = L, weight = W, a=0.003, b=3.2)
but it does not work.
If you help me, I would be appreciated. Thank you very much for your time.
You probably could do that easier.
f <- \(x, a, b) 10^(log(a) + b*log(x))
naw <- is.na(df$W)
df$W[naw] <- f(df$L[naw], .003, 3.2)
# L W
# 1 13 250.4350
# 2 15 718.8159
# 3 19 50.0000
# 4 NA NA
# 5 NA NA
# 6 32 191078.5331
# 7 35 80.0000
# 8 NA NA
# 9 NA NA
# 10 18 30.0000
# 11 15 718.8159
You should use the vectorized ifelse(), instead of if()...else....
lwr <- function(length, weight, a, b) {
ifelse(is.na(weight), 10^(log(a) + b*log(length)), weight)
}
df |>
transform(W2 = lwr(L, W, a=0.003, b=3.2))
# equivalent:
# df$W2 <- lwr(df$L, df$W, a=0.003, b=3.2)
# L W W2
# 1 13 NA 250.4350
# 2 15 NA 718.8159
# 3 19 50 50.0000
# 4 NA NA NA
# 5 NA NA NA
# 6 32 NA 191078.5331
# 7 35 80 80.0000
# 8 NA NA NA
# 9 NA NA NA
# 10 18 30 30.0000
# 11 15 NA 718.8159

How to merge elements of atomic vector in R?

I wanted to merge different elements of atomic vectors by elements names stored in list. See example:
ls = list(a = c(a = 1, b = 2, d = 2), b = c(b = 2, c = 3), c = c(a = 1, b = 2))
Now, I wanted to get output like this:
a b c
a 1 NA 1
b 2 2 2
c NA 3 NA
d 2 NA NA
I tried Reduce, but it is not working. I do not want to use any external package for this problem.
Thanks
You can use [ in sapply after you have extracted all elements names.
i <- sort(unique(unlist(lapply(ls, names))))
x <- sapply(ls, "[", i)
rownames(x) <- i
x
# a b c
#a 1 NA 1
#b 2 2 2
#c NA 3 NA
#d 2 NA NA
We could also use bind_rows here
library(dplyr)
library(tibble)
bind_rows(ls, .id = 'x') %>%
column_to_rownames('x') %>%
t
a b c
a 1 NA 1
b 2 2 2
d 2 NA NA
c NA 3 NA
Or using base R
xtabs(values ~ ind + x, do.call(rbind, Map(cbind, x = names(ls), lapply(ls, stack))))
x
ind a b c
a 1 0 1
b 2 2 2
d 2 0 0
c 0 3 0
A data.table option using rbindlist
> t(rbindlist(Map(function(x) data.table(t(x)), lst), fill = TRUE))
[,1] [,2] [,3]
a 1 NA 1
b 2 2 2
d 2 NA NA
c NA 3 NA

Populate matrix by colname identity

I have many samples, each one of which has a corresponding abundance matrix. From these abundance matrices, I would like to create a large matrix that contains abundance information for each sample in rows.
For example, a single abundance matrix would look like:
A B C D
sample1 1 3 4 2
where A, B, C, and D represent colnames, and the abundances are the row values.
I would like to populate my larger matrix, which has as colnames all possible letters (A:Z) and all possible samples (sample1:sampleN) as rows, by matching the colname values.
For ex. :
A B C D E F G .... Z
sample1 1 3 4 2 NA NA NA ....
sample2 NA NA 2 5 7 NA NA ....
sample3 4 NA 6 9 2 NA 2 .....
....
sampleN
Different samples have a varying mix of abundances, in no guaranteed order.
When iteratively adding to this larger matrix, how could I ensure that the correct columns are populated by the right abundance values (ex. column "A" is only filled by values corresponding to abundances of "A" in different samples)? Thanks!
Starting data, changing just a little to highlight differences:
m1 <- as.matrix(read.table(header=TRUE, text="
A B C Z
sample1 1 3 4 2"))
m2 <- as.matrix(read.table(header=TRUE, text="
A B C D E F G
sample2 NA NA 2 5 7 NA NA
sample3 4 NA 6 9 2 NA 2"))
First, we need to make sure both matrices have the same column names:
newcols <- setdiff(colnames(m2), colnames(m1))
m1 <- cbind(m1, matrix(NA, nr=nrow(m1), nc=length(newcols), dimnames=list(NULL, newcols)))
newcols <- setdiff(colnames(m1), colnames(m2))
m2 <- cbind(m2, matrix(NA, nr=nrow(m2), nc=length(newcols), dimnames=list(NULL, newcols)))
m1
# A B C Z D E F G
# sample1 1 3 4 2 NA NA NA NA
m2
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
And now we combine them; regular cbind needs the column names to be aligned as well:
rbind(m2, m1[,colnames(m2),drop=FALSE])
# A B C D E F G Z
# sample2 NA NA 2 5 7 NA NA NA
# sample3 4 NA 6 9 2 NA 2 NA
# sample1 1 3 4 NA NA NA NA 2
You should be able to take advantage of matrix indexing, like so:
big[cbind(rownames(abun),colnames(abun))] <- abun
Using this example abundance matrix, and a big matrix to fill:
abun <- matrix(c(1,3,4,2),nrow=1,dimnames=list("sample1",LETTERS[1:4]))
big <- matrix(NA,nrow=5,ncol=26,dimnames=list(paste0("sample",1:5),LETTERS))
Another solution using reduce from purrr package and union_all from dplyr package:
library(purrr)
library(dplyr)
sample_names <- c("sample1","sample2","sample3")
Generating 3 random abundance dataframes:
num1 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df1 <- data.frame(t(num1))
colnames(df1) <- sample(LETTERS,length(num1))
num2 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df2 <- data.frame(t(num2))
colnames(df2) <- sample(LETTERS,length(num2))
num3 <- round(runif(runif(1,min = 1, max = 10),min = 1, max = 10))
df3 <- data.frame(t(num3))
colnames(df3) <- sample(LETTERS,length(num3))
This is actually the code that does all the magic:
A <- reduce(list(df1,df2,df3),union_all)
col_order <- sort(colnames(A),decreasing = FALSE)
A <- A[,col_order]
rownames(A) <- sample_names
Output:
> A
A C E F O P Q U W Y
sample1 9 NA NA NA 9 NA 5 6 NA NA
sample2 NA NA NA NA 5 4 NA NA 5 NA
sample3 NA 6 5 9 NA NA 3 NA 5 7

How to get the position of elements in a list?

Given a list variable, I'd like to have a data frame of the positions of each element. For a simple non-nested list, it seems quite straightforward.
For example, here's a list of character vectors.
l <- replicate(
10,
sample(letters, rpois(1, 2), replace = TRUE),
simplify = FALSE
)
l looks like this:
[[1]]
[1] "m"
[[2]]
[1] "o" "r"
[[3]]
[1] "g" "m"
# etc.
To get the data frame of positions, I can use:
d <- data.frame(
value = unlist(l),
i = rep(seq_len(length(l)), lengths(l)),
j = rapply(l, seq_along, how = "unlist"),
stringsAsFactors = FALSE
)
head(d)
## value i j
## 1 m 1 1
## 2 o 2 1
## 3 r 2 2
## 4 g 3 1
## 5 m 3 2
## 6 w 4 1
Given a trickier nested list, for example:
l2 <- list(
"a",
list("b", list("c", c("d", "a", "e"))),
character(),
c("e", "b"),
list("e"),
list(list(list("f")))
)
this doesn't easily generalize.
The output I expect for this example is:
data.frame(
value = c("a", "b", "c", "d", "a", "e", "e", "b", "e", "f"),
i1 = c(1, 2, 2, 2, 2, 2, 4, 4, 5, 6),
i2 = c(1, 1, 2, 2, 2, 2, 1, 2, 1, 1),
i3 = c(NA, 1, 1, 2, 2, 2, NA, NA, 1, 1),
i4 = c(NA, NA, 1, 1, 2, 3, NA, NA, NA, 1),
i5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1)
)
How do I get a data frame of positions for a nested list?
Here's an approach that yields a slightly different output than you showed, but it'll be useful further down the road.
f <- function(l) {
names(l) <- seq_along(l)
lapply(l, function(x) {
x <- setNames(x, seq_along(x))
if(is.list(x)) f(x) else x
})
}
Function f simply iterates (recursively) through all levels of the given list and names it's elements 1,2,...,n where n is the length of the (sub)list. Then, we can make use of the fact that unlist has a use.names argument that is TRUE by default and has effect when used on a named list (that's why we have to use f to name the list first).
For the nested list l2 it returns:
unlist(f(l2))
# 1.1 2.1.1 2.2.1.1 2.2.2.1 2.2.2.2 2.2.2.3 4.1 4.2 5.1.1 6.1.1.1.1
# "a" "b" "c" "d" "a" "e" "e" "b" "e" "f"
Now, in order to return a data.frame as asked for in the question, I'd do this:
g <- function(l) {
vec <- unlist(f(l))
n <- max(lengths(strsplit(names(vec), ".", fixed=TRUE)))
require(tidyr)
data.frame(
value = unname(vec),
i = names(vec)
) %>%
separate(i, paste0("i", 1:n), sep = "\\.", fill = "right", convert = TRUE)
}
And apply it like this:
g(l2)
# value i1 i2 i3 i4 i5
#1 a 1 1 NA NA NA
#2 b 2 1 1 NA NA
#3 c 2 2 1 1 NA
#4 d 2 2 2 1 NA
#5 a 2 2 2 2 NA
#6 e 2 2 2 3 NA
#7 e 4 1 NA NA NA
#8 b 4 2 NA NA NA
#9 e 5 1 1 NA NA
#10 f 6 1 1 1 1
An improved version of g, contributed by #AnandaMahto (thanks!), would use data.table:
g <- function(inlist) {
require(data.table)
temp <- unlist(f(inlist))
setDT(tstrsplit(names(temp), ".", fixed = TRUE))[, value := unname(temp)][]
}
Edit (credits go to #TylerRinkler - thanks!)
This has the beneft of easily being converted to a data.tree object which can then be converted to many other data types. With a slight mod to g:
g <- function(l) {
vec <- unlist(f(l))
n <- max(lengths(strsplit(names(vec), ".", fixed=TRUE)))
require(tidyr)
data.frame(
i = names(vec),
value = unname(vec)
) %>%
separate(i, paste0("i", 1:n), sep = "\\.", fill = "right", convert = TRUE)
}
library(data.tree)
x <- data.frame(top=".", g(l2))
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
mytree <- data.tree::as.Node(x)
mytree
# levelName
#1 .
#2 ¦--1
#3 ¦ °--1
#4 ¦ °--a
#5 ¦--2
#6 ¦ ¦--1
#7 ¦ ¦ °--1
#8 ¦ ¦ °--b
#9 ¦ °--2
#10 ¦ ¦--1
#11 ¦ ¦ °--1
#12 ¦ ¦ °--c
#13 ¦ °--2
#14 ¦ ¦--1
#15 ¦ ¦ °--d
#16 ¦ ¦--2
#17 ¦ ¦ °--a
#18 ¦ °--3
#19 ¦ °--e
#20 ¦--4
#21 ¦ ¦--1
#22 ¦ ¦ °--e
#23 ¦ °--2
#24 ¦ °--b
#25 ¦--5
#26 ¦ °--1
#27 ¦ °--1
#28 ¦ °--e
#29 °--6
#30 °--1
#31 °--1
#32 °--1
#33 °--1
#34 °--f
And to produce a nice plot:
plot(mytree)
Other forms of presenting the data:
as.list(mytree)
ToDataFrameTypeCol(mytree)
More on converting data.tree types:
https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#tree-conversion
http://www.r-bloggers.com/how-to-convert-an-r-data-tree-to-json/
Here's an alternative. It's not going to be as fast as the approach by #docendodiscimus, but it is still pretty straightforward.
The basic idea is to use melt from "reshape2"/"data.table". melt has a method for lists that creates output like the following:
melt(l2)
# value L3 L2 L4 L1
# 1 a NA NA NA 1
# 2 b NA 1 NA 2
# 3 c 1 2 NA 2
# 4 d 2 2 NA 2
# 5 a 2 2 NA 2
# 6 e 2 2 NA 2
# 7 e NA NA NA 4
# 8 b NA NA NA 4
# 9 e NA 1 NA 5
# 10 f 1 1 1 6
Except for the column ordering and the last value that you're interested in, that seems to have all the info you're after. To get the last value you're interested in, you can use rapply(l2, seq_along).
Putting those two requirements together, you would have something like this:
myFun <- function(inlist) {
require(reshape2) ## Load required package
x1 <- melt(inlist) ## Melt the data
x1[[paste0("L", ncol(x1))]] <- NA_integer_ ## Add a column to hold the position info
x1 <- x1[c(1, order(names(x1)[-1]) + 1)] ## Reorder the columns
vals <- rapply(inlist, seq_along) ## These are the positional values
positions <- max.col(is.na(x1), "first") ## This is where the positions should go
x1[cbind(1:nrow(x1), positions)] <- vals ## Matrix indexing for replacement
x1 ## Return the output
}
myFun(l2)
# value L1 L2 L3 L4 L5
# 1 a 1 1 NA NA NA
# 2 b 2 1 1 NA NA
# 3 c 2 2 1 1 NA
# 4 d 2 2 2 1 NA
# 5 a 2 2 2 2 NA
# 6 e 2 2 2 3 NA
# 7 e 4 1 NA NA NA
# 8 b 4 2 NA NA NA
# 9 e 5 1 1 NA NA
# 10 f 6 1 1 1 1
The "data.table" version of g from the answer by #docendodiscimus is a little bit more direct:
g <- function(inlist) {
require(data.table)
temp <- unlist(f(inlist))
setDT(tstrsplit(names(temp), ".", fixed = TRUE))[, value := unname(temp)][]
}
Similar to docendo's, but attempting to operate as much as possible inside the recursion than fixing the result afterwards:
ff = function(x)
{
if(!is.list(x)) if(length(x)) return(seq_along(x)) else return(NA)
lapply(seq_along(x),
function(i) cbind(i, do.call(rBind, as.list(ff(x[[i]])))))
}
ans = do.call(rBind, ff(l2))
data.frame(value = unlist(l2),
ans[rowSums(is.na(ans[, -1L])) != (ncol(ans) - 1L), ])
# value X1 X2 X3 X4 X5
#1 a 1 1 NA NA NA
#2 b 2 1 1 NA NA
#3 c 2 2 1 1 NA
#4 d 2 2 2 1 NA
#5 a 2 2 2 2 NA
#6 e 2 2 2 3 NA
#7 e 4 1 NA NA NA
#8 b 4 2 NA NA NA
#9 e 5 1 1 NA NA
#10 f 6 1 1 1 1
rBind is a wrapper around rbind to avoid the "non-matching columns" errors:
rBind = function(...)
{
args = lapply(list(...), function(x) if(is.matrix(x)) x else matrix(x))
nc = max(sapply(args, ncol))
do.call(rbind,
lapply(args, function(x)
do.call(cbind, c(list(x), rep_len(list(NA), nc - ncol(x))))))
}
This can also be done with rrapply in the rrapply-package (extended version of base rapply) using how = "melt" to return a melted data.frame similar to reshape2::melt:
library(rrapply)
## use rapply or rrapply to convert terminal nodes to lists
l2_list <- rapply(l2, f = as.list, how = "replace")
## use rrapply with how = "melt" to return melted data.frame
l2_melt <- rrapply(l2_list, how = "melt")
#> L1 L2 L3 L4 L5 value
#> 1 ..1 ..1 <NA> <NA> <NA> a
#> 2 ..2 ..1 ..1 <NA> <NA> b
#> 3 ..2 ..2 ..1 ..1 <NA> c
#> 4 ..2 ..2 ..2 ..1 <NA> d
#> 5 ..2 ..2 ..2 ..2 <NA> a
#> 6 ..2 ..2 ..2 ..3 <NA> e
#> 7 ..4 ..1 <NA> <NA> <NA> e
#> 8 ..4 ..2 <NA> <NA> <NA> b
#> 9 ..5 ..1 ..1 <NA> <NA> e
#> 10 ..6 ..1 ..1 ..1 ..1 f
NB: we can convert the level columns to numeric columns afterwards if necessary.
rrapply(l2_melt, condition = function(x, .xname) grepl("^L", .xname), f = function(x) as.numeric(sub("\\.+", "", x)))
#> L1 L2 L3 L4 L5 value
#> 1 1 1 NA NA NA a
#> 2 2 1 1 NA NA b
#> 3 2 2 1 1 NA c
#> 4 2 2 2 1 NA d
#> 5 2 2 2 2 NA a
#> 6 2 2 2 3 NA e
#> 7 4 1 NA NA NA e
#> 8 4 2 NA NA NA b
#> 9 5 1 1 NA NA e
#> 10 6 1 1 1 1 f
Computation times
Using rrapply instead of reshape2::melt can give significant speed-ups for (very) large nested lists as shown in the benchmark timings below:
## create deeply nested list
deep_list <- rrapply(list(1, 1), classes = c("list", "numeric"), condition = function(x, .xpos) length(.xpos) < 18, f = function(x) list(1, 1), how = "recurse")
system.time(reshape2::melt(deep_list))
#> user system elapsed
#> 119.747 0.024 119.784
system.time(rrapply(deep_list, how = "melt"))
#> user system elapsed
#> 0.240 0.008 0.249
## create large shallow nested list
large_list <- lapply(replicate(500, 1, simplify = F), function(x) replicate(500, 1, simplify = F))
system.time(reshape2::melt(large_list))
#> user system elapsed
#> 40.558 0.008 40.569
system.time(rrapply(large_list, how = "melt"))
#> user system elapsed
#> 0.073 0.000 0.073

How to squeeze in missing values into a vector

Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
library(dtw)
ifelse(duplicated(dtw(A, B)$index2), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
{
c[i,1] <- NA
}
}

Resources