Count number of duplicates in other dataframe - r

I have two data.frames dfA and dfB. Both of them have a column called key.
Now I'd like to know how many duplicates for A$key there are in B$key.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
It should be A=2, B=3, C=0 and D=1. Whats the most easiest way to do this?

Use table
table(factor(B$key, levels = sort(unique(A$key))))
#A B C D
#2 3 0 1
factor is needed here such that we also 'count' entries that do not appear in B$key, that is C.

A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
library(dplyr)
library(tidyr)
B %>%
filter(key %in% A$key) %>% # keep values that appear in A
count(key) %>% # count values
complete(key = A$key, fill = list(n = 0)) # add any values from A that don't appear
# # A tibble: 4 x 2
# key n
# <chr> <dbl>
# 1 A 2
# 2 B 3
# 3 C 0
# 4 D 1

Using tidyverse you can do:
A %>%
left_join(B %>% #Merging df A with df B for which the count in "key" was calculated
group_by(key) %>%
tally(), by = c("key" = "key")) %>%
mutate(n = ifelse(is.na(n), 0, n)) #Replacing NA with 0
key n
1 A 2
2 B 3
3 C 0
4 D 1

Actually you mean how many occurrences of each value of A$key you have in B$key?
You can obtain this by coding B$key as factor with the unique values of A$key as levels.
o <- table(factor(B$key, levels=unique(A$key)))
Yielding:
> o
A B C D
2 3 0 1
If you really want to count duplicates, do
dupes <- ifelse(o - 1 < 0, 0, o - 1)
Yielding:
> dupes
A B C D
1 2 0 0

Related

Create cross-tabulation of most frequent value of string variable and sort by frequency

I have a sample dataset:
df <- data.frame(category = c("A", "A", "B", "C", "C", "D", "E", "C", "E", "A", "B", "C", "B", "B", "B", "D", "D", "D", "D", "B"), year = c(1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 3, 1, 2, 3, 1))
and would like to create a cross-tabulation of year and category such that only the 3 most frequent categories are in the table and also sorted by total number of occurences:
1 2 3
B 4 2 0
D 1 2 2
C 2 1 1
Using something like
df %>%
add_count(category) %>%
filter(n %in% tail(sort(unique(n)),3)) %>%
arrange(desc(n)) %>% {table(.$category, .$year)}
will filter for the three most occurring categories but leave the table unsorted
1 2 3
B 4 2 0
C 2 1 1
D 1 2 2
This should give you what you want.
# Make a table
df.t <- table(df)
# Order by top occurrences (sum over margin 1)
df.t <- df.t[order(apply(df.t, 1, sum), decreasing=TRUE),]
# Keep top 3 results
df.t <- df.t[1:3,]
Output:
year
category 1 2 3
B 4 2 0
D 1 2 2
C 2 1 1
You'd want to arrange by the rowsums after creating table. If you want to stay (more) within tidyverse, e.g.:
df |>
janitor::tabyl(category, year) |>
arrange(desc(rowSums(across(where(is.numeric))))) |>
head(3)
Here with janitor::tabyl(), but you could also use dplyr::tally() and tidyr::pivot_longer() directly or do df |> table() |> as.data.frame.matrix() like #Adamm.
It's not elegent solution using base R but it works
result <- as.data.frame.matrix(table(df))
result$sum <- rowSums(result)
result <- result[order(-result$sum),]
result <- result[1:3,]
result$sum <- NULL
1 2 3
B 4 2 0
D 1 2 2
C 2 1 1

How to count the frequency of unique factor across each row in r dataframe

I have a dataset like the following:
Age Monday Tuesday Wednesday
6-9 a b a
6-9 b b c
6-9 c a
9-10 c c b
9-10 c a b
Using R, I want to get the following data set/ results (where each column represents the total frequency of each of the unique factor):
Age a b c
6-9 2 1 0
6-9 0 2 1
6-9 1 0 1
9-10 0 1 2
9-10 1 1 1
Note: My data also contains missing values
couple of quick and dirty tidyverse solutions - there should be a way to reduce steps though.
library(tidyverse) # install.packages("tidyverse")
input <- tribble(
~Age, ~Monday, ~Tuesday, ~Wednesday,
"6-9", "a", "b", "a",
"6-9", "b", "b", "c",
"6-9", "", "c", "a",
"9-10", "c", "c", "b",
"9-10", "c", "a", "b"
)
# pivot solution
input %>%
rowid_to_column() %>%
mutate_all(function(x) na_if(x, "")) %>%
pivot_longer(cols = -c(rowid, Age), values_drop_na = TRUE) %>%
count(rowid, Age, value) %>%
pivot_wider(id_cols = c(rowid, Age), names_from = value, values_from = n, values_fill = list(n = 0)) %>%
select(-rowid)
# manual solution (if only a, b, c are expected as options)
input %>%
unite(col = "combine", Monday, Tuesday, Wednesday, sep = "") %>%
transmute(
Age,
a = str_count(combine, "a"),
b = str_count(combine, "b"),
c = str_count(combine, "c")
)
In base R, we can replace empty values with NA, get unique values in the dataframe, and use apply row-wise and count the occurrence of values using table.
df[df == ''] <- NA
vals <- unique(na.omit(unlist(df[-1])))
cbind(df[1], t(apply(df, 1, function(x) table(factor(x, levels = vals)))))
# Age a b c
#1 6-9 2 1 0
#2 6-9 0 2 1
#3 6-9 1 0 1
#4 9-10 0 1 2
#5 9-10 1 1 1

How can I row bind the unmatch data in the column of first table from the second table

How can I row bind the not match data in the column of first table from the second table......
library(gtools)
df1 <- data.frame(a = c("a", "b", "c"), number=c(4,3,2))
df2 <- data.frame(a = c("a", "b", "c", "k", "z"))
# fill in non-overlapping columns with NAs
df2[setdiff(names(df1), names(df2))] <- 0
rbind(df1, df2)
this the output in my code
a number
1 a 4
2 b 3
3 c 2
4 a 0
5 b 0
6 c 0
7 k 0
8 z 0
the output i want..it will just add the not match data in the row of first table..
a number
1 a 4
2 b 3
3 c 2
4 k 0
5 z 0
Try to left join df2 and df1 and replace NA with 0.
df3 <- merge(df2, df1, all.x = TRUE)
df3$number[is.na(df3$number)] <- 0
df3
# a number
#1 a 4
#2 b 3
#3 c 2
#4 k 0
#5 z 0
Using dplyr, you could do the same by
library(dplyr)
df2 %>%
left_join(df1, by = "a") %>%
mutate(number = replace(number, is.na(number), 0))
Or another option using match
df3 <- df2
df3$number <- df1$number[match(df2$a, df1$a)]
df3$number[is.na(df3$number)] <- 0
data
df1 <- data.frame(a = c("a", "b", "c"), number=c(4,3,2))
df2 <- data.frame(a = c("a", "b", "c", "k", "z"))

Solution on R group by issue _ multiple combination

I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6

Count occurrence of a categorical variable, when grouping and summarising by a different variable in R

I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333

Resources