Say I have 5 vectors:
a <- c(1,2,3)
b <- c(2,3,4)
c <- c(1,2,5,8)
d <- c(2,3,4,6)
e <- c(2,7,8,9)
I know I can calculate the intersection between all of them by using Reduce() together with intersect(), like this:
Reduce(intersect, list(a, b, c, d, e))
[1] 2
But how can I find elements that are common in, say, at least 2 vectors? i.e.:
[1] 1 2 3 4 8
It is much simpler than a lot of people are making it look. This should be very efficient.
Put everything into a vector:
x <- unlist(list(a, b, c, d, e))
Look for duplicates
unique(x[duplicated(x)])
# [1] 2 3 1 4 8
and sort if needed.
Note: In case there can be duplicates within a list element (which your example does not seem to implicate), then replace x with x <- unlist(lapply(list(a, b, c, d, e), unique))
Edit: as the OP has expressed interest in a more general solution where n >= 2, I would do:
which(tabulate(x) >= n)
if the data is only made of natural integers (1, 2, etc.) as in the example. If not:
f <- table(x)
names(f)[f >= n]
This is now not too far from James solution but it avoids the costly-ish sort. And it is miles faster than computing all possible combinations.
You could try all possible combinations, for example:
## create a list
l <- list(a, b, c, d)
## get combinations
cbn <- combn(1:length(l), 2)
## Intersect them
unique(unlist(apply(cbn, 2, function(x) intersect(l[[x[1]]], l[[x[2]]]))))
## 2 3 1 4
Here's another option:
# For each vector, get a vector of values without duplicates
deduplicated_vectors <- lapply(list(a,b,c,d,e), unique)
# Flatten the lists, then sort and use rle to determine how many
# lists each value appears in
rl <- rle(sort(unlist(deduplicated_vectors)))
# Get the values that appear in two or more lists
rl$values[rl$lengths >= 2]
This is an approach that counts the number of vectors each unique value occurs in.
unique_vals <- unique(c(a, b, c, d, e))
setNames(rowSums(!!(sapply(list(a, b, c, d, e), match, x = unique_vals)),
na.rm = TRUE), unique_vals)
# 1 2 3 4 5 8 6 7 9
# 2 5 3 2 1 2 1 1 1
A variation of #rengis method would be:
unique(unlist(Map(`intersect`, cbn[1,], cbn[2,])))
#[1] 2 3 1 4 8
where,
l <- mget(letters[1:5])
cbn <- combn(l,2)
Yet another approach, applying a vectorised function with outer:
L <- list(a, b, c, d, e)
f <- function(x, y) intersect(x, y)
fv <- Vectorize(f, list("x","y"))
o <- outer(L, L, fv)
table(unlist(o[upper.tri(o)]))
# 1 2 3 4 8
# 1 10 3 1 1
The output above gives the number of pairs of vectors that share each of the duplicated elements 1, 2, 3, 4, and 8.
When the vector is huge, solutions like duplicated or tabulate might overflow your system. In that case, dplyr comes in handy with the following code
library(dplyr) combination_of_vectors <- c(a, b, c, d, e)
#For more than 1
combination_of_vectors %>% as_tibble() %>% group_by(x) %>% filter(n()>1)
#For more than 2
combination_of_vectors %>% as_tibble() %>% group_by(x) %>% filter(n()>2)
#For more than 3
combination_of_vectors %>% as_tibble() %>% group_by(x) %>% filter(n()>2)
Hope it helps somebody
Related
I have a list that looks as follows:
a <- c(1, 3, 4)
b <- c(0, 2, 6)
c <- c(3, 4)
d <- c(0, 2, 6)
list(a, b, c, d)
From this list I would like to remove all subsets such that the list looks as follows:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
How do I do this? In my actual data I am working with a very long list (> 500k elements) so any suggestions for an efficient implementation are welcome.
Here is an approach.
lst <- list(a, b, c, d) # The list
First, remove all duplicates.
lstu <- unique(lst)
If the list still contains more than one element, we order the list by the lengths of its elements (decreasing).
lstuo <- lstu[order(-lengths(lstu))]
Then subsets can be filtered with this command:
lstuo[c(TRUE, !sapply(2:length(lstuo),
function(x) any(sapply(seq_along(lstuo)[-x],
function(y) all(lstuo[[x]] %in% lstu[[y]])))))]
The result:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
Alternative approach
Your data
lst <- list(a, b, c, d) # The list
lstu <- unique(lst) # remove duplicates, piggyback Sven's approach
Make matrix of values and index
m <- combn(lstu, 2) # 2-row matrix of non-self pairwise combinations of values
n <- combn(length(lstu), 2) # 2-row matrix of non-self pairwise combination of index
Determine if subset
issubset <- t(sapply(list(c(1,2),c(2,1)), function(z) mapply(function(x,y) all(x %in% y), m[z[1],], m[z[2],])))
Discard subset vectors from list
discard <- c(n*issubset)[c(n*issubset)>0]
ans <- lstu[-discard]
Output
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
I wish to do exactly this: Take dates from one dataframe and filter data in another dataframe - R
except without joining, as I am afraid that after I join my data the result will be too big to fit in memory, prior to the filter.
Here is sample data:
tmp_df <- data.frame(a = 1:10)
I wish to do an operation that looks like this:
lower_bound <- c(2, 4)
upper_bound <- c(2, 5)
tmp_df %>%
filter(a >= lower_bound & a <= upper_bound) # does not work as <= is vectorised inappropriately
and my desired result is:
> tmp_df[(tmp_df$a <= 2 & tmp_df$a >= 2) | (tmp_df$a <= 5 & tmp_df$a >= 4), , drop = F]
# one way to get indices to subset data frame, impractical for a long range vector
a
2 2
4 4
5 5
My problem with memory requirements (with respect to the join solution linked) is when tmp_df has many more rows and the lower_bound and upper_bound vectors have many more entries. A dplyr solution, or a solution that can be part of pipe is preferred.
Maybe you could borrow the inrange function from data.table, which
checks whether each value in x is in between any of the
intervals provided in lower,upper.
Usage:
inrange(x, lower, upper, incbounds=TRUE)
library(dplyr); library(data.table)
tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
# a
#1 2
#2 4
#3 5
If you'd like to stick with dplyr it has similar functionality provided through the between function.
# ranges I want to check between
my_ranges <- list(c(2,2), c(4,5), c(6,7))
tmp_df <- data.frame(a=1:10)
tmp_df %>%
filter(apply(bind_rows(lapply(my_ranges,
FUN=function(x, a){
data.frame(t(between(a, x[1], x[2])))
}, a)
), 2, any))
a
1 2
2 4
3 5
4 6
5 7
Just be aware that the argument boundaries are included by default and that cannot be changed as with inrange
Here is a simple testing case.
Was planning to split and extract only the first part of each string.
library(dplyr)
library(stringr)
test = data.frame(x= c('a b', 'c d'),stringsAsFactors = F)
test
x
1 a b
2 c d
test %>% mutate(y = str_split(x,'\\s+')[[1]][1])
x y
1 a b a
2 c d a
Was expecting something like:
x y
1 a b a
2 c d c
Nowadays there are various packaged functions for splitting columns into pieces. Here you could use the separate() function from the tidyr package. Since you want the first value of a split on the spaces, you can just remove everything after the first space.
tidyr::separate(test, x, "y", "\\s.*", FALSE, extra = "drop")
# x y
# 1 a b a
# 2 c d c
str_split returns a list where each element corresponds to an element in the original atomic vector. As such you will need to use lapply or similar to index appropriately
test %>% mutate(y = unlist(lapply(str_split(x,'\\s+'),'[[',1)))
We can also use sub
library(data.table)
setDT(test)[, y:= sub('\\s+.*', '', x)]
test
# x y
#1: a b a
#2: c d c
I'm trying to use one column to determine which column to use as the value for another column It looks something like this:
X Y Z Target
1 a b c X
2 d e f Y
3 g h i Z
And I want something that looks like this:
X Y Z Target TargetValue
1 a b c X a
2 d e f Y e
3 g h i Z i
Where each TargetValue is the value determined by the column specified by Target. I've been using dplyr a bit to get this to work. If I knew how to make the output of paste the input for mutate that would be great,
mutate(TargetWordFixed = (paste("WordMove",TargetWord,".rt", sep="")))
but maybe there is another way to do the same thing.
Be gentle, I'm new to both stackoverflow and R...
A vectorized approach would be to use matrix subsetting:
df %>% mutate(TargetValue = .[cbind(1:n(), match(Target, names(.)))])
# X Y Z Target TargetValue
#1 a b c X a
#2 d e f Y e
#3 g h i Z i
Or just using base R (same approach):
transform(df, TargetValue = df[cbind(1:nrow(df), match(Target, names(df)))])
Explanation:
match(Target, names(.)) computes the column indices of the entries in Target (which column is called X etc)
The . in the dplyr version refers to the data you "pipe" into the mutate statement with %>% (i.e. it refers to df)
df[cbind(1:n(), match(Target, names(df))] creates a matrix to subset df to the correct values - the first column of the matrix is just the row numbers starting from 1 to the number of rows of df (therefore 1:nrow(df)) and the second column in the matrix is the index which column holds the Target value of interest (computed by match(Target, names(df))).
The matrix that is produced for subsetting the example data is:
cbind(1:nrow(df), match(df$Target, names(df)))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
You could try apply rowwise like this:
transform(df, TargetValue = apply(df, 1, function(x) x[x["Target"]]))
# X Y Z Target TargetValue
# 1 a b c X a
# 2 d e f Y e
# 3 g h i Z i
library(tidyverse)
df <-setNames(data.frame(cbind(matrix(letters[1:9],3,3,byrow=T), c("X", "Y", "Z"))), c("X", "Y", "Z", "Target"))
df
df %>%
gather(key="ID", value="TargetValue", X:Z) %>%
filter(ID==Target) %>%
select(Target, TargetValue) %>%
left_join(df, by="Target")
suppose you have the following two data.frames:
set.seed(1)
x <- letters[1:10]
df1 <- data.frame(x)
z <- rnorm(20,100,10)
df2 <- data.frame(x,z)
(note that both dfs have a column named "x")
and you want to summarize the sums of df2$z for the groups of "x" in df1 like this:
df1 %.%
group_by(x) %.%
summarize(
z = sum(df2$z[df2$x == x])
)
this returns an error "invalid indextype integer" (translated).
But when I change the name of column "x" in any one of the two dfs, it works:
df2 <- data.frame(x1 = x,z) #column is now named "x1", it would also work if the name was changed in df1
df1 %.%
group_by(x) %.%
summarize(
z = sum(df2$z[df2$x1 == x])
)
# x z
#1 a 208.8533
#2 b 205.7349
#3 c 185.4313
#4 d 193.8058
#5 e 214.5444
#6 f 191.3460
#7 g 204.7124
#8 h 216.8216
#9 i 213.9700
#10 j 202.8851
I can imagine many situations, where you have two dfs with the same column name (like an "ID" column) for which this might be a problem, unless there is a simple way around it.
Did I miss something? There may be other ways to get to the same result for this example but I'm interested in understanding if this is possible in dplyr (or perhaps why not).
(the two dfs dont necessarily need to have the same unique "x" values as in this example)
Following the comment from #beginneR, I'm guessing it'd be something like:
inner_join(df1, df2) %.% group_by(x) %.% summarise(z=sum(z))
Joining by: "x"
Source: local data frame [10 x 2]
x z
1 a 208.8533
2 b 205.7349
3 c 185.4313
4 d 193.8058
5 e 214.5444
6 f 191.3460
7 g 204.7124
8 h 216.8216
9 i 213.9700
10 j 202.8851
you can try:
df2%.%filter(x%in%df1$x)%.%group_by(x)%.%summarise(sum(z))
hth