In R, discover rows which partially match rows in another data frame - r

I have the following two data frames:
> df1
# A tibble: 4 x 4
x y z w
<dbl> <dbl> <dbl> <dbl>
4 5 8 9
4 6 7 4
3 6 7 10
8 2 8 9
> df2
# A tibble: 4 x 4
x y z w
<dbl> <dbl> <dbl> <dbl>
6 2 7 9
2 6 7 10
4 5 8 12
4 5 8 3
I would like to discover which rows in df2 have a match in df1, where a match means being identical in at least n/2 columns.
So in this example, row 1 in df2 is a match to row 4 in df1 (columns 1 and 3), row 2 in df2 matches row 2 in df1 on columns 2 and 3 and row 3 on columns 2,3,4 and so on.
I also have to save the location of the repeating rows and the columns on which they match.
For small data sets, I could replicate both data sets and subtract them and count the zeros. However what I need is a solution which would work on very large data sets (~20K rows).
Any ideas? A dplyr solution (rather than a data.table) would be highly appreciated.

This final output might not be the ideal format, but it should at least have the information you're looking for and work with many more fields/columns.
df1 <- read.table(text =
"x y z w
4 5 8 9
4 6 7 4
3 6 7 10
8 2 8 9",
header = T)
df2 <- read.table(text =
"x y z w
6 2 7 9
2 6 7 10
4 5 8 12
4 5 8 3",
header = T)
library(dplyr)
library(tidyr)
Add a row ID number to each data frame and reshape the data from wide to long with gather. (I'm assuming each row can be treated as a unique id):
df1 <- df1 %>%
mutate(df1_id = row_number()) %>%
gather(field, value, x:w) %>%
arrange(df1_id)
df2 <- df2 %>%
mutate(df2_id = row_number()) %>%
gather(field, value, x:w) %>%
arrange(df2_id)
Join the two data frames with an inner_join on field/column and value. Then use group and filter to get only field and value combinations that have two or more matches
df2 %>%
inner_join(df1, by = c('value', 'field')) %>%
group_by(df2_id, df1_id) %>%
filter(n()>=2) %>% # where 2 is the minimum number of matches
arrange(df2_id, df1_id, value) %>%
select(df2_id, df1_id, field, value)
# A tibble: 13 x 4
# Groups: df2_id, df1_id [5]
df2_id df1_id field value
<int> <int> <chr> <int>
1 1 4 y 2
2 1 4 w 9
3 2 2 y 6
4 2 2 z 7
5 2 3 y 6
6 2 3 z 7
7 2 3 w 10
8 3 1 x 4
9 3 1 y 5
10 3 1 z 8
11 4 1 x 4
12 4 1 y 5
13 4 1 z 8
You can see that df2 row id 1 matches df1 row 4 on the fields y and w,
df2 row 2 matches df1 row 2 on fields fields y and z,
df2 row 2 also matches df1 row 3 on fields y, x, and w.
df2 rows 3 and 4 match df1 row 1 on x, y, and z.
arrange and select are really only necessary for easier viewing of the data.

How bout this? Using dplyr and purrr, we add id.1/id.2 fields and append .1 or .2 to the existing fields to both data frames as appropriate. Then we create a list of vectors for the by parameter. We will iterate through each vector when inner_join-ing df2 to df1, concatenate all the results from the inner_join-ing, and selecting the ids from both data frames.
require(dplyr)
require(purrr)
df1 <- tibble(
x = c(4, 4, 3, 8),
y = c(5, 6, 6, 2),
z = c(8, 7, 7, 8),
w = c(9, 4, 10, 9)
)
df2 <- tibble(
x = c(6, 2, 4, 4),
y = c(2, 6, 5, 5),
z = c(7, 7, 8, 8),
w = c(9, 10, 12, 13)
)
df1 <- df1 %>%
mutate(id.1 = 1:length(.)) %>%
rename(
x.1 = x,
y.1 = y,
z.1 = z,
w.1 = w
)
df2 <- df2 %>%
mutate(id.2 = 1:length(.)) %>%
rename(
x.2 = x,
y.2 = y,
z.2 = z,
w.2 = w
)
inner_join_by <-
list(
c("x.1" = "x.2", "y.1" = "y.2"),
c("x.1" = "x.2", "z.1" = "z.2"),
c("x.1" = "x.2", "w.1" = "w.2"),
c("y.1" = "y.2", "z.1" = "z.2"),
c("y.1" = "y.2", "w.1" = "w.2"),
c("z.1" = "z.2", "w.1" = "w.2")
)
filtered <- inner_join_by %>%
map_df(.f = ~inner_join(x = df1, y = df2, by = .x)) %>%
select(id.1, id.2) %>%
distinct()

One option could be using apply row-wise:
apply(df1, 1, function(x)apply(df2,1,function(y)x==y))
# [,1] [,2] [,3] [,4]
# [1,] FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE TRUE
# [3,] FALSE TRUE TRUE FALSE
# [4,] TRUE FALSE FALSE TRUE
# [5,] FALSE FALSE FALSE FALSE
# [6,] FALSE TRUE TRUE FALSE
# [7,] FALSE TRUE TRUE FALSE
# [8,] FALSE FALSE TRUE FALSE
# [9,] TRUE TRUE FALSE FALSE
# [10,] TRUE FALSE FALSE FALSE
# [11,] TRUE FALSE FALSE TRUE
# [12,] FALSE FALSE FALSE FALSE
# [13,] TRUE TRUE FALSE FALSE
# [14,] TRUE FALSE FALSE FALSE
# [15,] TRUE FALSE FALSE TRUE
# [16,] FALSE FALSE FALSE FALSE

What about the following solution (still involving a loop):
Here the function which for a given row checks and returns matches:
fct <- function(x, dat){
M1logical <- t(unlist(x) == t(dat))
n <- which(rowSums(M1logical) > 1)
if(length(n) > 0){
return(n)
}
if(length(n) == 0){
return(0)
}
}
Now applying iterating:
mylist <- rep(list(NA), nrow(df2))
for(k in 1:nrow(df2)){
mylist[[k]] <- fct(df2[k,], df1)
}
It takes my computer 23.14 seconds (microbenchmark) to compute it with two data frames of size 20000x4 each, see here for the dummy data (roughly 45 seconds on an older device):
df1 <- data.frame(x=sample(1:20,20000, replace = T), y=sample(1:20,20000, replace = T),
z=sample(1:20,20000, replace = T), w=sample(1:20,20000, replace = T),
stringsAsFactors = F)
df2 <- data.frame(x=sample(1:20,20000, replace = T), y=sample(1:20,20000, replace = T),
z=sample(1:20,20000, replace = T), w=sample(1:20,20000, replace = T),
stringsAsFactors = F)

Related

Find point in dataframe where (col_1[ i ], col_2[ i ]) = (col_1[ j ], -col_2[ j ])

There might be an obvious solution to this that I have missed but here goes:
Consider the data frame below. I wish to create a column with TRUE/FALSE values, where the value is TRUE whenever the condition (col_1[i], col_2[i]) = (col_1[j], -col_2[j]) is fulfilled. Note that sum() does not work here, since there might be a third value.
To elaborate; what I have is:
col_1 <- c("x", "x", "y", "y", "y", "z", "z")
col_2 <- c(-1, 1, 3, -3, 4, 7, 3)
df <- data.frame(col_1, col_2)
What I want is:
I think the answer must be something with df %>% group_by(x), but I can't think of the complete solution.
Here is my attempt. As you were saying, grouping data is necessary. I defined groups with col_1 and foo. foo contains absolute values of col_2. If the number of observation is larger than one and unique number of observation in col_2 is equal to 2, you have the pairs you are searching.
group_by(df, col_1, foo = abs(col_2)) %>%
mutate(check = n() > 1 & n_distinct(col_2) == 2) %>%
ungroup %>%
select(-foo)
col_1 col_2 check
<fct> <dbl> <lgl>
1 x -1 TRUE
2 x 1 TRUE
3 y 3 TRUE
4 y -3 TRUE
5 y 4 FALSE
6 z 7 FALSE
7 z 3 FALSE
As Ronak previously mentioned, there may be cases like this.
col_1 <- c("x", "x", "y", "y", "y", "z", "z")
col_2 <- c(1, 1, 3, -3, 4, 7, 3)
df2 <- data.frame(col_1, col_2)
col_1 col_2
1 x 1
2 x 1
3 y 3
4 y -3
5 y 4
6 z 7
7 z 3
group_by(df2, col_1, foo = abs(col_2)) %>%
mutate(check = n() > 1 & n_distinct(col_2) == 2) %>%
ungroup %>%
select(-foo)
col_1 col_2 check
<fct> <dbl> <lgl>
1 x 1 FALSE
2 x 1 FALSE
3 y 3 TRUE
4 y -3 TRUE
5 y 4 FALSE
6 z 7 FALSE
7 z 3 FALSE
You can try the following base R code, where a custom function f is defined to check the sum:
f <- function(v) {
unique(c(combn(seq(v),2)[,combn(v,2,sum)==0]))
}
dfout <- Reduce(rbind,
lapply(split(df,df$col_1),
function(v) {
v$col_3 <- F
v$col_3[f(v$col_2)] <- T
v
})
)
dfout <- dfout[order(as.numeric(rownames(dfout))),]
such that
> dfout
col_1 col_2 col_3
1 x -1 TRUE
2 x 1 TRUE
3 y 3 TRUE
4 y -3 TRUE
5 y 4 FALSE
6 z 7 FALSE
7 z 3 FALSE

Recursively sum data frames for matching rows

I would like to combine a set of data frames into a single data frame by summing columns that have matching variables (instead of appending columns).
For example, given
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
I want to match by "A" and "B" and sum the values of "x". For this example, I can get the desired result as follows:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
Result:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
A more general solution is
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
Is there a more efficient option for combining a large set of data frames? Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
An easier option is to bind the rows of the datasets, then group by the columns of interest and get the summarised output by getting the sum of 'x'
library(tidyverse)
bind_rows(df1, df2, df3) %>%
group_by(A, B) %>%
summarise(x = sum(x))
# A tibble: 11 x 3
# Groups: A [?]
# A B x
# <dbl> <dbl> <dbl>
# 1 0 1 6
# 2 0 2 3
# 3 0 5 5
# 4 1 1 9
# 5 1 2 5
# 6 1 3 7
# 7 1 4 3
# 8 2 1 7
# 9 2 2 2
#10 2 4 0
#11 2 5 3
If there are many objects in the global environment with the pattern "df" followed by some digits
mget(ls(pattern= "^df\\d+")) %>%
bind_rows %>%
group_by(A, B) %>%
summarise(x = sum(x))
As the OP mentioned about memory constraints, if we do the join first and then use rowSums or + with reduce, it would be more efficient
mget(ls(pattern= "^df\\d+")) %>%
reduce(full_join, by = c("A", "B")) %>%
transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
arrange(A, B)
# A B x
#1 0 1 6
#2 0 2 3
#3 0 5 5
#4 1 1 9
#5 1 2 5
#6 1 3 7
#7 1 4 3
#8 2 1 7
#9 2 2 2
#10 2 4 0
#11 2 5 3
This could also be done with data.table
library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]
Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
If you're memory constrained and willing to sacrifice speed (vs #akrun's data.table approach), use one table at a time in a loop:
library(data.table)
tabs = c("df1", "df2", "df3")
# enumerate all combos for the results table
# initializing sum to 0
res = CJ(A = 0:2, B = 1:5, x = 0)
# loop over tabs, adding on
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res[tab, on=.(A, B), x := x + i.x][]
rm(tab)
}
If you need to read tables from disk, change tabs to file names and get to fread or whatever function.
I am skeptical that you can fit all the tables in memory, but cannot also fit an rbind-ed copy of them together.
Similarly (thanks to #akrun's comment), use his approach pairwise:
res = data.table(get(tabs[[1]]))[0L]
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res = rbind(res, tab)[, .(x = sum(x)), by=.(A,B)]
rm(tab)
}

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Add (not merge!) two data frames with unequal rows and columns

I want to efficiently sum the entries of two data frames, though the data frames are not guaranteed to have the same dimensions or column names. Merge isn't really what I'm after here. Instead I want to create an output object with all of the row and column names that belong to either of the added data frames. In each position of that output, I want to use the following logic for the computed value:
If a row/column pairing belongs to both input data frames I want the output to include their sum
If a row/column pairing belongs to just one input data frame I want to include that value in the output
If a row/column pairing does not belong to any input matrix I want to have 0 in that position in the output.
As an example, consider the following input data frames:
df1 = data.frame(x = c(1,2,3), y = c(4,5,6))
rownames(df1) = c("a", "b", "c")
df2 = data.frame(x = c(7,8), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
> df1
x y
a 1 4
b 2 5
c 3 6
> df2
x z w
a 7 9 2
d 8 10 3
I want the final result to be
> df2
x y z w
a 8 4 9 2
b 2 5 0 0
c 3 6 0 0
d 8 0 10 3
What I've done so far -
bind_rows / bind_cols in dplyr can throw the following:
"Error: incompatible number of rows (3, expecting 2)"
I have duplicated column names, so 'merge' isn't working for my purposes either - returns an empty df for some reason.
Seems like you could merge on the rownames, then take care of the sums and conversion of NA to zero with some additional munging:
library(dplyr)
df.new = df1 %>% add_rownames %>%
full_join(df2 %>% add_rownames, by="rowname") %>%
mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
mutate(x = x.x + x.y) %>%
select(rowname,x,y,z,w)
Or, with #DavidArenburg's much more elegant and extensible solution:
df.new = df1 %>% add_rownames %>%
full_join(df2 %>% add_rownames) %>%
group_by(rowname) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
df.new
rowname x y z w
1 a 8 4 9 2
2 b 2 5 0 0
3 c 3 6 0 0
4 d 8 0 10 3
This seems like some type of a simple merge on common column names (+ row names) and then a simple aggregation, this is how I would tackle this
library(data.table)
merge(setDT(df1, keep.rownames = TRUE), # Convert to data.table + keep rows
setDT(df2, keep.rownames = TRUE), # Convert to data.table + keep rows
by = intersect(names(df1), names(df2)), # merge on common column names
all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn] # Sum all columns by group
# rn x y z w
# 1: a 8 4 9 2
# 2: b 2 5 0 0
# 3: c 3 6 0 0
# 4: d 8 0 10 3
Are a pretty straight forward base R solution
df1$rn <- row.names(df1)
df2$rn <- row.names(df2)
res <- merge(df1, df2, all = TRUE)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
# x y z w
# a 8 4 9 2
# b 2 5 0 0
# c 3 6 0 0
# d 8 0 10 3
First, I would grab the names of all the rows and columns of the new entity:
(all.rows <- unique(c(row.names(df1), row.names(df2))))
# [1] "a" "b" "c" "d"
(all.cols <- unique(c(names(df1), names(df2))))
# [1] "x" "y" "z" "w"
Then I would construct an output matrix with those rows and column names (with matrix data initialized to all 0s), adding df1 and df2 to the relevant parts of that matrix.
out <- matrix(0, nrow=length(all.rows), ncol=length(all.cols))
rownames(out) <- all.rows
colnames(out) <- all.cols
out[row.names(df1),names(df1)] <- unlist(df1)
out[row.names(df2),names(df2)] <- out[row.names(df2),names(df2)] + unlist(df2)
out
# x y z w
# a 8 4 9 2
# b 2 5 0 0
# c 3 6 0 0
# d 8 0 10 3
Using xtabs on melted / stacked data frames:
out <- rbind(cbind(rn=rownames(df1),stack(df1)), cbind(rn=rownames(df2),stack(df2)))
as.data.frame.matrix(xtabs(values ~ rn + ind, data=out))
# x y w z
#a 8 4 2 9
#b 2 5 0 0
#c 3 6 0 0
#d 8 0 3 10
I’m not convinced the accepted (or alternative merge) method is the best. It will give incorrect results if you have common rows, they’ll get joined and not summed.
This can be shown trivialy by changing df2 to:
df2 = data.frame(x = c(1,2), y = c(4,5), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
expected results:
rn x y z w
1: a 2 8 9 2
2: b 2 5 0 0
3: c 3 6 0 0
4: d 2 5 10 3
actual results
merge(setDT(df1, keep.rownames = TRUE),
setDT(df2, keep.rownames = TRUE),
by = intersect(names(df1), names(df2)),
all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn]
rn x y z w
1: a 1 4 9 2
2: b 2 5 0 0
3: c 3 6 0 0
4: d 2 5 10 3
You need to combine both the outer join with an inner join (or left/right joins, merge all=T/all=F). Or alternatively using plyr’s rbind.fill :
base R solution
res <- rbind.fill(df1,df2)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
data table solution
as.data.table(rbind.fill(
setDT(df1, keep.rownames = TRUE),
setDT(df2, keep.rownames = TRUE)
))[, lapply(.SD, sum, na.rm = TRUE), by = rn]
I prefer the rbind.fill method as you can "merge" > 2 data frames using the same syntax.

Bind data frames on longer identifiers R

I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3

Resources