Related
I have 3 vectors
x <- c(1,3,5,7,3,8)
y <- c(3,5,7)
z <- c(3,3,8)
I want to find the elements of x that are not in y and not in z. Is there a function f that would give me the following output:
> f(x,y)
1 3 8
> f(x,z)
1 5 7
In other words, I want to find the "set difference" between 2 vectors, either of which may have repeated values. The functions %in%, match and setdiff do not work in this case for obvious reasons.
There should be some better ways to do this but here is one option
get_diff_vectors <- function(x, y) {
count_x <- table(x)
count_y <- table(y)
same_counts <- match(names(count_y), names(count_x))
count_x[same_counts] <- count_x[same_counts] - count_y
as.numeric(rep(names(count_x), count_x))
}
get_diff_vectors(x, y)
#[1] 1 3 8
get_diff_vectors(x, z)
#[1] 1 5 7
get_diff_vectors(x, c(5, 7))
#[1] 1 3 3 8
We count the frequency of x and y using table, match the numbers which occur in both and subtract the counts y from x. Finally recreate the remaining vector using rep.
Still not able to find a better way but here is dplyr way using the somewhat similar logic.
library(dplyr)
get_diff_vectors_dplyr <- function(x, y) {
df1 <- data.frame(x) %>% count(x)
df2 <- data.frame(y) %>% count(y)
final <- left_join(df1, df2, by = c("x" = "y")) %>%
mutate_at(c("n.x", "n.y"), funs(replace(., is.na(.), 0))) %>%
mutate(n = n.x - n.y)
rep(final$x, final$n)
}
get_diff_vectors_dplyr(x, y)
#[1] 1 3 8
get_diff_vectors_dplyr(x, z)
#[1] 1 5 7
get_diff_vectors_dplyr(x, c(5, 7))
#[1] 1 3 3 8
The vecsets package mentioned by OP has function vsetdiff which does this very easily
vecsets::vsetdiff(x, y)
#[1] 1 3 8
vecsets::vsetdiff(x, z)
#[1] 1 5 7
vecsets::vsetdiff(x, c(5, 7))
#[1] 1 3 3 8
Here's an attempt using make.unique to account for duplicates:
dupdiff <- function(x,y) x[-match(
make.unique(as.character(y)),
make.unique(as.character(x)),
nomatch=0
)]
Testing:
dupdiff(x,y)
#[1] 1 3 8
dupdiff(x,z)
#[1] 1 5 7
dupdiff(x, c(5, 7))
#[1] 1 3 3 8
dupdiff(x, c(5, 7, 9))
#[1] 1 3 3 8
match with a little for-loop does work:
> f(x, y)
[1] 1 3 8
> f(x, z)
[1] 1 5 7
Code
f <- function(s, r) {
for(i in 1:length(s)){
j <- match(s[i], r)
if(!is.na(j)) {
s[i] <- NA
r[j] <- NA
}
}
print(s[complete.cases(s)])
}
There is the new Hadley-verse package waldo
Makes real nice and concise overviews of the differences of objects, not only vectors
library(waldo)
compare(x, y)
#> `old`: 1 3 5 7 3 8
#> `new`: 3 5 7
compare(x, z)
#> `old`: 1 3 5 7 3 8
#> `new`: 3 3 8
In R, I try systematically to avoid "for" loops and use lapply() family instead.
But how to do so when an iteration contains an increment step ?
For example : is it possible to obtain the same result as below with a lapply approach ?
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
for (i in 1:10){
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
> data.frame(a, b)
> a b
> 1 0 0
> 2 1 0-1
> 3 0 0-1-0
> 4 0 0-1-0-0
> 5 1 0-1-0-0-1
> 6 0 0-1-0-0-1-0
> 7 0 0-1-0-0-1-0-0
> 8 0 0-1-0-0-1-0-0-0
> 9 1 0-1-0-0-1-0-0-0-1
> 10 1 0-1-0-0-1-0-0-0-1-1
EDIT
My question was very badly redacted. The below new example is much more illustrative : is it anyway to use lapply family if each iteration is calculated from the previous one ?
a <- c()
b <- c()
for (i in 1:10){
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
> data.frame(a, b)
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 1 0-1-0-1
5 1 0-1-0-1-1
6 1 0-1-0-1-1-1
7 1 0-1-0-1-1-1-1
8 0 0-1-0-1-1-1-1-0
9 1 0-1-0-1-1-1-1-0-1
10 1 0-1-0-1-1-1-1-0-1-1
For the sake of completeness, there is also the accumulate() function from the purrr package.
So, building on the answers of Sotos and ThomasIsCoding:
df <- data.frame(a = 1:10)
df$b <- purrr::accumulate(df$a, paste, sep = "-")
df
a b
1 1 1
2 2 1-2
3 3 1-2-3
4 4 1-2-3-4
5 5 1-2-3-4-5
6 6 1-2-3-4-5-6
7 7 1-2-3-4-5-6-7
8 8 1-2-3-4-5-6-7-8
9 9 1-2-3-4-5-6-7-8-9
10 10 1-2-3-4-5-6-7-8-9-10
The difference to Reduce() is
that accumulate() is a function verb on its own (no additional parameter accumulate = TRUE required)
and that additional arguments like sep = "-" can be passed on to the mapped function which may help to avoid the creation of an anonymous function.
EDIT
If I understand correctly OP's edit of the question, the OP is asking if a for loop which computes a result iteratively can be replaced by lapply().
This is difficult to answer for me. Here are some thoughts and observations:
First, accumulate() still will work:
set.seed(1L) # required for reproducible data
df <- data.frame(a = sample(0:1, 10L, TRUE))
df$b <- purrr::accumulate(df$a, paste, sep = "-")
df
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 0 0-1-0-0
5 1 0-1-0-0-1
6 0 0-1-0-0-1-0
7 0 0-1-0-0-1-0-0
8 0 0-1-0-0-1-0-0-0
9 1 0-1-0-0-1-0-0-0-1
10 1 0-1-0-0-1-0-0-0-1-1
This is possible because the computation of a can be pulled out off the loop as it does not depend on b.
IMHO, accumulate() and Reduce() do what the OP is looking for but is not called lapply(): They take the result of the previous iteration and combine it with the actual value, for instance
Reduce(`+`, 1:3)
returns the sum of 1, 2, and 3 by iteratively computing (((0 + 1) + 2) + 3). This can be visualised by using the accumulate parameter
Reduce(`+`, 1:3, accumulate = TRUE)
[1] 1 3 6
Second, there is a major difference between a for loop and functions of the lapply() family: lapply(X, FUN, ...) requires a function FUN to be called on each element of X. So, scoping rules for functions apply.
When we transplant the body of the loop into an anonymous function within lapply()
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
lapply(1:10, function(i) {
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
})
we get
[[1]]
[1] "0"
[[2]]
[1] "1"
[[3]]
[1] "0"
[[4]]
[1] "0"
[[5]]
[1] "1"
[[6]]
[1] "0"
[[7]]
[1] "0"
[[8]]
[1] "0"
[[9]]
[1] "1"
[[10]]
[1] "1"
data.frame(a, b)
data frame with 0 columns and 0 rows data.frame(a, b)
Due to the scoping rules, a and b inside the function are considered as local to the function. No reference is made to a and b defined outside of the function.
This can be fixed by global assignment using the global assignment operator <<-:
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
lapply(1:10, function(i) {
a <<- c(a, sample(c(0,1), 1))
b <<- c(b, (paste(a, collapse = "-")))
})
data.frame(a, b)
a b
1 0 0
2 1 0-1
3 0 0-1-0
4 0 0-1-0-0
5 1 0-1-0-0-1
6 0 0-1-0-0-1-0
7 0 0-1-0-0-1-0-0
8 0 0-1-0-0-1-0-0-0
9 1 0-1-0-0-1-0-0-0-1
10 1 0-1-0-0-1-0-0-0-1-1
However, global assignment is considered bad programming practice and should be avoided, see, e.g., the 6th Circle of Patrick Burns' The R Inferno and many questions on SO.
Third, the way the loop is written grows vectors in the loop. This also is considered bad practice as it requires to copy the data over and over again which may slow down tremendously with increasing size. See, e.g., the 2nd Circle of Patrick Burns' The R Inferno.
However, the original code
a <- c()
b <- c()
set.seed(1L) # required for reproducible data
for (i in 1:10) {
a <- c(a, sample(c(0,1), 1))
b <- c(b, (paste(a, collapse = "-")))
}
data.frame(a, b)
can be re-written as
a <- integer(10)
b <- character(10)
set.seed(1L) # required for reproducible data
for (i in seq_along(a)) {
a[i] <- sample(c(0,1), 1)
b[i] <- if (i == 1L) a[1] else paste(b[i-1], a[i], sep = "-")
}
data.frame(a, b)
Here, vectors are pre-allocated with the required size to hold the result. Elements to update are identified by subscripting.
Calculation of b[i] still depends only the value of the previous iteration b[i-1] and the actual value a[i] as requested by the OP.
Another way is to use Reduce with accumulate = TRUE, i.e.
df$new <- do.call(rbind, Reduce(paste, split(df, seq(nrow(df))), accumulate = TRUE))
which gives,
a new
1 1 1
2 2 1 2
3 3 1 2 3
4 4 1 2 3 4
5 5 1 2 3 4 5
6 6 1 2 3 4 5 6
7 7 1 2 3 4 5 6 7
8 8 1 2 3 4 5 6 7 8
9 9 1 2 3 4 5 6 7 8 9
10 10 1 2 3 4 5 6 7 8 9 10
You can use sapply (lapply would work too but it returns a list) and iterate over every value of a in df and create a sequence and paste the value together.
df <- data.frame(a = 1:10)
df$b <- sapply(df$a, function(x) paste(seq(x), collapse = "-"))
df
# a b
#1 1 1
#2 2 1-2
#3 3 1-2-3
#4 4 1-2-3-4
#5 5 1-2-3-4-5
#6 6 1-2-3-4-5-6
#7 7 1-2-3-4-5-6-7
#8 8 1-2-3-4-5-6-7-8
#9 9 1-2-3-4-5-6-7-8-9
#10 10 1-2-3-4-5-6-7-8-9-10
If there could be non-numerical values in data on which we can not use seq like
df <- data.frame(a =letters[1:10])
In those case, we can use
df$b <- sapply(seq_along(df$a), function(x) paste(df$a[seq_len(x)], collapse = "-"))
df
# a b
#1 a a
#2 b a-b
#3 c a-b-c
#4 d a-b-c-d
#5 e a-b-c-d-e
#6 f a-b-c-d-e-f
#7 g a-b-c-d-e-f-g
#8 h a-b-c-d-e-f-g-h
#9 i a-b-c-d-e-f-g-h-i
#10 j a-b-c-d-e-f-g-h-i-j
Another way of using Reduce, different to the approach by #Sotos
df$b <- Reduce(function(...) paste(...,sep = "-"), df$a, accumulate = T)
such that
> df
a b
1 1 1
2 2 1-2
3 3 1-2-3
4 4 1-2-3-4
5 5 1-2-3-4-5
6 6 1-2-3-4-5-6
7 7 1-2-3-4-5-6-7
8 8 1-2-3-4-5-6-7-8
9 9 1-2-3-4-5-6-7-8-9
10 10 1-2-3-4-5-6-7-8-9-10
I have a dataframe, say
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
I want to remove only those rows in which one or multiple ts are directly in between a d and a c, in all other cases I want to retain the cases. So for this example, I would like to remove the ts on row 8, 18 and 19, but keep the others. I have over thousands of cases so doing this manually would be a true horror. Any help is very much appreciated.
One option would be to use rle to get runs of the same string and then you can use an sapply to check forward/backward and return all the positions you want to drop:
rle_vals <- rle(as.character(df$x))
drop <- unlist(sapply(2:length(rle_vals$values), #loop over values
function(i, vals, lengths) {
if(vals[i] == "t" & vals[i-1] == "d" & vals[i+1] == "c"){#Check if value is "t", previous is "d" and next is "c"
(sum(lengths[1:i-1]) + 1):sum(lengths[1:i]) #Get row #s
}
},vals = rle_vals$values, lengths = rle_vals$lengths))
drop
#[1] 8 18 19
df[-drop,]
# x y
#1 a 2
#2 a 4
#3 b 5
#4 b 2
#5 b 6
#6 c 2
#7 d 4
#9 c 2
#10 b 6
#11 t 2
#12 c 4
#13 t 5
#14 a 2
#15 a 6
#16 b 2
#17 d 4
#20 c 6
This also works, by collapsing to a string, identifying groups of t's between d and c (or c and d - not sure whether you wanted this option as well), then working out where they are and removing the rows as appropriate.
df = data.frame(x=c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y=c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6),stringsAsFactors = FALSE)
dfs <- paste0(df$x,collapse="") #collapse to a string
dfs2 <- do.call(rbind,lapply(list(gregexpr("dt+c",dfs),gregexpr("ct+d",dfs)),
function(L) data.frame(x=L[[1]],y=attr(L[[1]],"match.length"))))
dfs2 <- dfs2[dfs2$x>0,] #remove any -1 values (if string not found)
drop <- unlist(mapply(function(a,b) (a+1):(a+b-2),dfs2$x,dfs2$y))
df2 <- df[-drop,]
Here is another solution with base R:
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
#
s <- paste0(df$x, collapse="")
L <- c(NA, NA)
while (TRUE) {
r <- regexec("dt+c", s)[[1]]
if (r[1]==-1) break
L <- rbind(L, c(pos=r[1]+1, length=attr(r, "match.length")-2))
s <- sub("d(t+)c", "x\\1x", s)
}
L <- L[-1,]
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
# > drop
# 8 18 19
# > df[-drop, ]
# x y
# 1 a 2
# 2 a 4
# 3 b 5
# 4 b 2
# 5 b 6
# 6 c 2
# 7 d 4
# 9 c 2
# 10 b 6
# 11 t 2
# 12 c 4
# 13 t 5
# 14 a 2
# 15 a 6
# 16 b 2
# 17 d 4
# 20 c 6
With gregexpr() it is shorter:
s <- paste0(df$x, collapse="")
g <- gregexpr("dt+c", s)[[1]]
L <- data.frame(pos=g+1, length=attr(g, "match.length")-2)
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
I have a data frame like this:
n = c(2, 2, 3, 3, 4, 4)
n <- as.factor(n)
s = c("a", "b", "c", "d", "e", "f")
df = data.frame(n, s)
df
n s
1 2 a
2 2 b
3 3 c
4 3 d
5 4 e
6 4 f
and I want to access the first element of each level of my factor (and have in this example a vector containing a, c, e).
It is possible to reach the first element of one level, with
df$s[df$n == 2][1]
but it does not work for all levels:
df$s[df$n == levels(n)]
[1] a f
How would you do that?
And to go further, I’d like to modify my data frame to see which is the first element for each level at every occurrence. In my example, a new column should be:
n s rep firstelement
1 2 a a a
2 2 b c a
3 3 c e c
4 3 d a c
5 4 e c e
6 4 f e e
Edit. The first part of my answer addresses the original question, i.e. before "And to go further" (which was added by OP in an edit).
Another possibility, using duplicated. From ?duplicated: "duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts."
Here we use !, the logical negation (NOT), to select not duplicated elements of 'n', i.e. first elements of each level of 'n'.
df[!duplicated(df$n), ]
# n s
# 1 2 a
# 3 3 c
# 5 4 e
Update Didn't see your "And to go further" edit until now. My first suggestion would definitely be to use ave, as already proposed by #thelatemail and #sparrow. But just to dig around in the R toolbox and show you an alternative, here's a dplyr way:
Group the data by n, use the mutate function to create a new variable 'first', with the value 'first element of s' (s[1]),
library(dplyr)
df %.%
group_by(n) %.%
mutate(
first = s[1])
# n s first
# 1 2 a a
# 2 2 b a
# 3 3 c c
# 4 3 d c
# 5 4 e e
# 6 4 f e
Or go all in with dplyr convenience functions and use first instead of [1]:
df %.%
group_by(n) %.%
mutate(
first = first(s))
A dplyr solution for your original question would be to use summarise:
df %.%
group_by(n) %.%
summarise(
first = first(s))
# n first
# 1 2 a
# 2 3 c
# 3 4 e
Here is an approach using match:
df$s[match(levels(n), df$n)]
EDIT: Maybe this looks a bit confusing ...
To get a column which lists the first elements you could use match twice (but with x and table arguments swapped):
df$firstelement <- df$s[match(levels(n), df$n)[match(df$n, levels(n))]]
df$firstelement
# [1] a a c c e e
# Levels: a b c d e f
Lets look at this in detail:
## this returns the first matching elements
match(levels(n), df$n)
# [1] 1 3 5
## when we swap the x and table argument in match we get the level index
## for each df$n (the duplicated indices are important)
match(df$n, levels(n))
# [1] 1 1 2 2 3 3
## results in
c(1, 3, 5)[c(1, 1, 2, 2, 3, 3)]
# [1] 1 1 3 3 5 5
df$s[c(1, 1, 3, 3, 5, 5)]
# [1] a a c c e e
# Levels: a b c d e f
the function ave is useful in these cases:
df$firstelement = ave(df$s, df$n, FUN = function(x) x[1])
df
n s firstelement
1 2 a a
2 2 b a
3 3 c c
4 3 d c
5 4 e e
6 4 f e
In this case I prefer plyr package, it gives further freedom to manipulate the data.
library(plyr)
ddply(df,.(n),function(subdf){return(subdf[1,])})
n s
1 2 a
2 3 c
3 4 e
You could also use data.table
library(data.table)
dt = as.data.table(df)
dt[, list(firstelement = s[1]), by=n]
which would get you:
n firstelement
1: 2 a
2: 3 c
3: 4 e
The by=n bit groups everything by each value of n so s[1] is getting the first element of each of those groups.
To get this as an extra column you could do:
dt[, newcol := s[1], by=n]
dt
# n s newcol
#1: 2 a a
#2: 2 b a
#3: 3 c c
#4: 3 d c
#5: 4 e e
#6: 4 f e
So this just takes the value of s from the first row of each group and assigns it to a new column.
df$s[sapply(levels(n), function(particular.level) { which(df$n == particular.level)[1]})]
I believe your problem is that you are comparing two vectors df$n is a vector and levels(n) is a vector. vector == vector only happens to work for you since df$n is a multiple length of levels(n)
Surprised not to see this classic in the answer stream yet.
> do.call(rbind, lapply(split(df, df$n), function(x) x[1,]))
## n s
## 2 2 a
## 3 3 c
## 4 4 e
I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.
Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5
I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)