R Count duplicates between two dataframes - r

I have two dataframes df1 and df2. They both have a column 'ID'. For each row in DF1, I would like to find out how many duplicates of its ID there are in df2 and add the count to that row. If there are no duplicates, the count should return as 0.
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1_234 1 1
# 2 1_235 1 2
# 3 2_222 1 1
# 4 2_654 1 2
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1_234 1 1
# 2 1_235 1 2
# 3 1_234 1 1
# 4 3_234 1 2

Using dplyr:
Your data:
df1 <- data.frame(ID = c("1_234","1_235","2_222","2_654"),
a = c(1,1,1,1),
b = c(1,2,1,2))
df2 <- data.frame(ID = c("1_234","1_235","1_234","3_235"),
a = c(1,1,1,1),
b = c(1,2,1,2))
Edit: considering only the IDs:
output <- left_join(df1,
as.data.frame(table(df2$ID)),
by = c("ID" = "Var1")) %>%
mutate(Freq = ifelse(is.na(Freq), 0, Freq))
Output:
ID a b Freq
1 1_234 1 1 2
2 1_235 1 2 1
3 2_222 1 1 0
4 2_654 1 2 0

A base R option using subset + aggregate
subset(
aggregate(
n ~ .,
rbind(
cbind(df1, n = 1),
cbind(df2, n = 1)
), function(x) length(x) - 1
), ID %in% df1$ID
)
gives
ID a b n
1 1_234 1 1 2
2 2_222 1 1 0
3 1_235 1 2 1
4 2_654 1 2 0

I think you can do it with a simple sapply() and base r (no extra packages).
df1$count <- sapply(df1$ID, function(x) sum(df2$ID == x))

We may also use outer
df1$count <- rowSums(outer(df1$ID, df2$ID, FUN = `==`))
df1$count
[1] 2 1 0 0

We could use semi_join and n() to get the count of duplicates:
library(dplyr)
df1 %>%
semi_join(df2, by="ID") %>%
summarise(duplicates_df1_df2 = n())
Output:
duplicates_df1_df2
1 2

Related

Multiplying column value by another value matching column name R

I have a data frame which looks like this:
Value1 = c("1","2","1","3")
Letter = c("A","B","B","A")
A = c("2","2","0","1")
B = c("1","1","1","0")
data <- data.frame(Value1,Letter,A,B)
data
Value1 Letter A B
1 1 A 2 1
2 2 B 2 1
3 1 B 0 1
4 3 A 1 0
I'm trying to add a new column which is the multiplication of column Value1, by column A or B depending on what is in the Letter column. The expected result would be:
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
I'm trying to use the match() function, but without success.
Thanks!
With base R:
data <- type.convert(data, as.is = TRUE)
data$Results <- ifelse(data$Letter == 'A', data$A * data$Value1, data$B * data$Value1)
Output
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
Another option would be to pivot to long form, do the calculation, then pivot back to wide format.
library(tidyverse)
data %>%
type.convert(as.is = TRUE) %>%
pivot_longer(c(A, B)) %>%
mutate(Results = ifelse(Letter == name, value * Value1, NA_integer_)) %>%
pivot_wider(names_from = "name", values_from = "value") %>%
group_by(Value1, Letter) %>%
summarise_all(discard, is.na)
Output
Value1 Letter Results A B
<int> <chr> <int> <int> <int>
1 1 A 2 2 1
2 1 B 1 0 1
3 2 B 2 2 1
4 3 A 3 1 0
Use case_when or ifelse
library(dplyr)
data <- data %>%
type.convert(as.is = TRUE) %>%
mutate(Results = case_when(Letter == 'A' ~ A * Value1,
TRUE ~ B * Value1))
-output
data
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
Or use get with rowwise
data <- data %>%
type.convert(as.is = TRUE) %>%
rowwise %>%
mutate(Result = get(Letter) * Value1) %>%
# or may also use
# mutate(Result = cur_data()[[Letter]] * Value1) %>%
ungroup
-output
data
# A tibble: 4 × 5
Value1 Letter A B Result
<int> <chr> <int> <int> <int>
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
In base R, we may use row/column indexing as vectorized option
data <- type.convert(data, as.is = TRUE)
nm1 <- unique(data$Letter)
data$Results <-data[nm1][cbind(seq_len(nrow(data)),
match(data$Letter, nm1))] * data$Value1

how to find duplicated columns in row in R?

I have such a data frame below and I want to find duplicated columns in each row of this data frame. Please see the input and output example below. 0 is repeated 2 times in the first row, that is why column rep should be 0 (data_input[1,"rep"]=0); 2 is repeated 2 times in the second row, that is why column rep should be 0; there are no replicated values in the 3rd row that is why column rep can be 4 (or you can add any value instead of 0,1,2) and 1 is repeated 3 times in the 4th row, that is why column rep should be 1.
data_input=data.frame(X1=c(0,1,2,1), X2=c(0,2,1,1),
X3=c(1,2,0,1))
data_output=data.frame(X1=c(0,1,2,1),
X2=c(0,2,1,1), X3=c(1,2,0,1), rep=c(0,2,4,1))
Here is an option with rowwise - create the rowwise attribute, then find the duplicated element from the row, if there are none, replace the NA with 4
library(dplyr)
library(tidyr)
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
replace_na(tmp[duplicated(tmp)][1], 4)
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <dbl>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Above solution didn't consider the case where there are multiple duplicates. If there are, then either consider to create a list column or paste the unique elements together to a single string
data_input %>%
rowwise %>%
mutate(rep = {tmp <- c_across(everything())
tmp <- toString(sort(unique(tmp[duplicated(tmp)])))
replace(tmp, tmp == "", "4")
}) %>%
ungroup
-output
# A tibble: 4 × 4
X1 X2 X3 rep
<dbl> <dbl> <dbl> <chr>
1 0 0 1 0
2 1 2 2 2
3 2 1 0 4
4 1 1 1 1
Or using base R
data_input$rep <- apply(data_input, 1, FUN = \(x) x[anyDuplicated(x)][1])
data_input$rep[is.na(data_input$rep)] <- 4
Another solution, based on base R:
nCols <- ncol(data_input)
data_output <- cbind(
data_input, rep = apply(data_input, 1,
function(x) if (length(table(x)) != nCols) x[which.max(table(x))] else nCols+1))
data_output
#> X1 X2 X3 rep
#> 1 0 0 1 0
#> 2 1 2 2 2
#> 3 2 1 0 4
#> 4 1 1 1 1

How to cross tabulate the summary values across same field

This may have solutions/answers available here, but I am unable to find.
Let us assume a simple data like this
x <- data.frame(id = rep(1:3, each = 2),
v1 = c('A', 'B', 'A', 'B', 'A', 'C'))
> x
id v1
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 3 C
Now I want an output of relation of V1 column with itself, but across group on id something like this
v1 A B C
1 A 0 2 1
2 B 2 0 0
3 C 1 0 0
So, I proceeded like this..
library(tidyverse)
#merged the V1 column by itself with all = TRUE
x <- merge(x, x, by = "id", all = T)
# removed same group rows
x <- x[x$v1.x != x$v1.y, ]
# final code
x %>% select(-id) %>%
group_by(v1.x, v1.y) %>%
summarise(val = n()) %>%
pivot_wider(names_from = v1.y, values_from = val, values_fill = 0L, names.sort = T)
# A tibble: 3 x 4
# Groups: v1.x [3]
v1.x A B C
<chr> <int> <int> <int>
1 A 0 2 1
2 B 2 0 0
3 C 1 0 0
My question is that any better/direct method to obtain the cross-table?
How about creating a contingency table with xtabs (which can work with large data sets as well). Then, you can use crossprod on the table and set the diagonal to zero for the final result.
ct <- xtabs(~ id + v1, data = x)
cp <- crossprod(ct, ct)
diag(cp) <- 0
cp
Instead of xtabs you can create a cross-table with simply table as well. As noted by #A5C1D2H2I1M1N2O1R2T1, you can simplify to a nice one-liner equivalent:
"diag<-"(crossprod(table(x)), 0)
Output
v1
v1 A B C
A 0 2 1
B 2 0 0
C 1 0 0

R: Generating indicators that values differ within groups

I have a data frame where each row is an observation and I have two columns:
the group membership of the observation
the outcome for the observation.
I'm trying to create a new variable outcome_change that takes a value of 1 if outcome is NOT identical for all observations in a given group and 0 otherwise.
Shown in the below code (dat) is an example of the data I have. Meanwhile, dat_out1 shows what I'm looking for the code to produce in the presence of no NA values. The dat_out2 is identical except it shows that the same results arise when there are missing values in a group's values.
Surely there is somewhat to do this with dplyr::group_by()? I don't know how to make these comparisons within groups.
# Input (2 groups: 1 with identical values of outcome
# in the group (group a) and 1 with differing values of
# outcome in the group (group b)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
# Output 1: add a variable for all observations belonging to
# a group where the outcome changed within each group
dat_out1 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2),
outcome_change = c(0,0,0,1,1,1))
# Output 2: same as Output 1, but able to ignore NA values
dat_out2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA),
outcome_change = c(0,0,0,1,1,1))
Here is an aproach:
library(tidyverse)
dat %>%
group_by(group) %>%
mutate(outcome_change = ifelse(length(unique(outcome[!is.na(outcome)])) > 1, 1, 0))
#output
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a 1 0
4 b 3 1
5 b 2 1
6 b 2 1
with dat2
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a NA 0
4 b 3 1
5 b 2 1
6 b NA 1
library(dplyr)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
dat2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA))
dat_out1 <- dat %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome) == max(outcome), 0, 1))
dat_out2 <- dat2 %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome, na.rm = TRUE) == max(outcome, na.rm = TRUE), 0, 1))
Here is an option using data.table
library(data.table)
setDT(dat1)[, outcome_change := as.integer(uniqueN(outcome[!is.na(outcome)])>1), group]
dat1
# group outcome outcome_change
#1: a 1 0
#2: a 1 0
#3: a 1 0
#4: b 3 1
#5: b 2 1
#6: b 2 1
If we apply the same with 'dat2'
dat2
# group outcome outcome_change2
#1: a 1 0
#2: a 1 0
#3: a NA 0
#4: b 3 1
#5: b 2 1
#6: b NA 1

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

Resources