Check matches between two data frames in R - r

I have two almost identical dataframes containing the same people (df_A and df_B). I would now like to check for each person how many values ​​in df_A and df_B match (e.g., Person 1 has 3 identical values in df_A and df_B, whereas Person 4 has 2 identical values).
I would like to create new variables that contain the information about the number of matching values.
df_A and df_B could look like this:
df_A <- read.table(text=
"ID Var_1 Var_2 Var_3 Var_4 Var_5 Var_6
1 1 NA NA 1 NA 1
2 NA NA NA 1 1 1
3 NA 1 1 NA NA 1
4 1 1 NA NA 1 NA
5 NA NA NA 1 1 1", header=TRUE)
df_B <- read.table(text=
"ID Var_1 Var_2 Var_3 Var_4 Var_5 Var_6
1 1 NA NA 1 NA 1
2 NA NA NA 1 1 1
3 1 NA 1 1 NA NA
4 1 1 1 NA NA NA
5 1 1 1 NA NA NA", header=TRUE)
Ideally, the end result would look like this:
df_C <- read.table(text=
"ID Matches
1 3
2 3
3 1
4 2
5 0", header=TRUE)
Do you have any ideas on how achieve this most efficiently using R?
I'm relatively new to R and would like to learn how to solve such problems without lengthy code. Thanks for your hints!

Here's an idea.
library(dplyr)
library(tidyr)
left_join(df_A, df_B, by = 'ID') %>%
pivot_longer(-ID, names_pattern = '(.*).[xy]') %>%
group_by(ID, name) %>%
summarise(matches = !any(is.na(value)) & n_distinct(value, na.rm = TRUE)) %>%
summarise(matches = sum(matches))
#> # A tibble: 5 × 2
#> ID matches
#> <int> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 1
#> 4 4 2
#> 5 5 0

Related

Subset dataframe in R using function inside select_if to make it conditional on a grouping variable?

I would like to conditionally subset a dataframe in R, using dplyr::select_if(). More specifically, I have a dataframe that is made up of a grouping variable and numerous other variables that contain a bunch of NAs:
data <- tibble(group = sort(rep(letters[1:5],3)),
var_1 = c(1,1,1,1,rep(NA,11)),
var_2 = c(1,1,1,1,1,1,rep(NA,9)),
var_3 = 1,
var_4 = c(1,1,rep(NA,10),1,1,1),
var_5 = c(1,1,1,1,1,1,NA,NA,NA,NA,NA,NA,1,1,1))
# A tibble: 15 x 6
group var_1 var_2 var_3 var_4 var_5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 1 1 1 1
2 a 1 1 1 1 1
3 a 1 1 1 NA 1
4 b 1 1 1 NA 1
5 b NA 1 1 NA 1
6 b NA 1 1 NA 1
7 c NA NA 1 NA NA
8 c NA NA 1 NA NA
9 c NA NA 1 NA NA
10 d NA NA 1 NA NA
11 d NA NA 1 NA NA
12 d NA NA 1 NA NA
13 e NA NA 1 1 1
14 e NA NA 1 1 1
15 e NA NA 1 1 1
In this dataframe, I need to identify and remove columns like var_4 in this case that only occur in one group (but irrespective of whether or not they show up in the last group: "e"). Importantly, everything else has to remain untouched (i.e. I want to keep variables that look like var_1,var_2,var_3, and var_5). This is what I tried:
library(dplyr)
data %>%
filter(group!="e") %>% # Ignore last group.
select_if(~ function(col)) %>% # Write function to look for cols that only have values for one group of the total four groups remaining (a-d).
names() -> cols_to_drop # Save col names.
data %>% select(-cols_to_drop) -> new_data # Subset by saved col names.
Unfortunately, I can't figure out how to write that function inside select_if() to specify that grouping variable condition.
A second thing that I have been wondering about is whether I can use select_if() to remove cols based on the percentage of NAs it contains. Is there a way?
I am not sure if select_if would be able to do such grouped selection of columns.
Here is one way to do this getting data in long format :
library(dplyr)
cols <- data %>%
filter(group != "e") %>%
tidyr::pivot_longer(cols = starts_with('var')) %>%
group_by(name, group) %>%
summarise(value = any(!is.na(value))) %>%
summarise(value = sum(value)) %>%
filter(value > 1) %>%
pull(name)
#Select the columns
data %>% select(group, cols)
# group var_1 var_2 var_3 var_5
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1
# 2 a 1 1 1 1
# 3 a 1 1 1 1
# 4 b 1 1 1 1
# 5 b NA 1 1 1
# 6 b NA 1 1 1
# 7 c NA NA 1 NA
# 8 c NA NA 1 NA
# 9 c NA NA 1 NA
#10 d NA NA 1 NA
#11 d NA NA 1 NA
#12 d NA NA 1 NA
#13 e NA NA 1 1
#14 e NA NA 1 1
#15 e NA NA 1 1

r data formatting with NAs

If I have a dataset with three columns like this below
Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F
How could I clean this data such that I see a dataset like this below ?
Id Date Gender
1 03-11-1977 F
2 04-17-2005 M
3 06-04-1999 F
Thanks.
fill the values by Id and filter NA values.
library(dplyr)
df %>%
group_by(Id) %>%
tidyr::fill(Gender, .direction = "updown") %>%
filter(!is.na(Date))
# Id Date Gender
# <int> <chr> <chr>
#1 1 03-11-1977 F
#2 2 04-17-2005 M
#3 3 06-04-1999 F
You may use na.omit in a by approach.
dat <- do.call(rbind, by(dat, dat$Id, function(x) cbind(x[1,1,drop=F], lapply(x[-1], na.omit))))
dat
# Id Date Gender
# 1 1 03-11-1977 F
# 2 2 04-17-2005 M
# 3 3 06-04-1999 F
Data:
dat <- read.table(header=T,text=' Id Date Gender
1 NA F
1 NA NA
1 03-11-1977 NA
2 04-17-2005 NA
2 NA M
3 NA NA
3 06-04-1999 NA
3 NA F')

is there an R function to collapse duplicated rows while combining unique columns within these duplicated rows?

I want to collapse duplicated rows, by unique record ID, in order to consolidate unique variables that exist on these duplicated rows. Certain variables are only listed on one version of the duplicate row, while other variables that are unique exist on a different row of the duplicated record. I'm working in R. I'd like to just have records exist on one row, without losing any of the unique columns. One "sum-total" row basically, that collects each of the columns that may have been filled on different rows, so that this final row is not a duplicate, and shows each variable that could have been filled all together...
I've looked into merge and bind, and I've thought about writing an If rule, but the duplication vary by record (see example)..
record Var1 var2 var3 var4 var5
2 1 1 NA NA NA
2 NA NA 1 1 1
3 2 2 NA NA NA
3 NA NA 2 NA NA
3 NA NA NA 2 2
4 1 1 NA NA NA
5 NA NA 1 1 1
5 NA 2 NA NA NA
desired output example of record 2:
record Var1 var2 var3 var4 var5
2 1 1 1 1 1
3 ....
With base R's aggregate:
aggregate(df[2:ncol(df)], by = df["record"], sum, na.rm = T)
#### OUTPUT ####
record Var1 var2 var3 var4 var5
1 2 1 1 1 1 1
2 3 2 2 2 2 2
3 4 1 1 0 0 0
4 5 0 2 1 1 1
With dplyr:
library(dplyr)
df %>% group_by(record) %>% summarize_all(sum, na.rm = T)
#### OUTPUT ####
# A tibble: 4 x 6
record Var1 var2 var3 var4 var5
<int> <int> <int> <int> <int> <int>
1 2 1 1 1 1 1
2 3 2 2 2 2 2
3 4 1 1 0 0 0
4 5 0 2 1 1 1
The only thing is that NAs are turned into 0s. But it's easy to change them back.

Replace NA with 0 depending on group (rows) and variable names (column)

I have a large data set and want to replace many NAs, but not all.
In one group i want to replace all NAs with 0.
In the other group i want to replace all NAs with 0, but only in variables that do not include a certain part of the variable name e.g. 'b'
Here is an example:
group <- c(1,1,2,2,2)
abc <- c(1,NA,NA,NA,NA)
bcd <- c(2,1,NA,NA,NA)
cde <- c(5,NA,NA,1,2)
df <- data.frame(group,abc,bcd,cde)
group abc bcd cde
1 1 1 2 5
2 1 NA 1 NA
3 2 NA NA NA
4 2 NA NA 1
5 2 NA NA 2
This is what i want:
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
This is what i tried:
#set 0 in first group: this works fine
df[is.na(df) & df$group==1] <- 0
#set 0 in second group but only if the variable name includes b: does not work
df[is.na(df) & df$group==2 & !grepl('b',colnames(df))] <- 0
dplyr solutions are welcome as well as basic
For the second group, create the column index with grep and use that to subset the data while assigning
j1 <- !grepl('b',colnames(df))
df[j1][df$group == 2 & is.na(df[j1])] <- 0
df
# group abc bcd cde
#1 1 1 2 5
#2 1 0 1 0
#3 2 NA NA 0
#4 2 NA NA 1
#5 2 NA NA 2
Using dplyr::mutate_at you can also do:
library(dplyr)
vars_mutate_1 <- names(df)[-1]
vars_mutate_2 <- grep(x = names(df)[-1], pattern = '^(?!.*b).*$', perl = TRUE, value = TRUE)
df %>%
mutate_at(.vars = vars_mutate_1, .funs = funs(if_else(group == 1 & is.na(.), 0, .))) %>%
mutate_at(.vars = vars_mutate_2, .funs = funs(if_else(group == 2 & is.na(.), 0, .)))
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2
Alternatively, you can use:
library(dplyr)
df2 <- df %>% mutate_at(vars(names(df)[-1]),
function(x) case_when((group==1 & is.na(x) ) ~ 0,
(group==2 & is.na(x) & !grepl("b",deparse(substitute(x)))) ~ 0,
TRUE ~ x))
> df2
group abc bcd cde
1 1 1 2 5
2 1 0 1 0
3 2 NA NA 0
4 2 NA NA 1
5 2 NA NA 2

Add a column to a dataframe using (extracting unique values) from existing columns

I am new to R, and was not able to search answers for the specific problem I have encountered.
If my dataframe looks like below:
d <- data.frame(Name = c("Jon", "Jon", "Jon", "Kel", "Kel", "Kel", "Don", "Don", "Don"),
No1 = c(1,2,3,1,1,1,3,3,3),
No2 = c(1,1,1,2,2,2,3,3,3))
Name No1 No2
Jon 1 1
Jon 2 1
Jon 3 1
Kel 1 2
Kel 1 2
Kel 1 2
Don 3 3
Don 3 3
Don 3 3
...
How would I add be able to add new columns to the dataframe, where the columns would indicate the unique values in column No1 and No2: which would be (1,2,3), (1,2), (3) for John, Kelly, Don, respectively
So, if the new columns are named ID#, The desired results should be
d2 <- data.frame(Name = c("Jon", "Jon", "Jon", "Kel", "Kel", "Kel", "Don", "Don", "Don"),
No1 = c(1,2,3,1,1,1,3,3,3),
No2 = c(1,1,1,2,2,2,3,3,3),
ID1 = c(1,1,1,1,1,1,3,3,3),
ID2 = c(2,2,2,2,2,2,NA,NA,NA),
ID3 = c(3,3,3,NA,NA,NA,NA,NA,NA))
Name No1 No2 ID1 ID2 ID3
Jon 1 1 1 2 3
Jon 2 1 1 2 3
Jon 3 1 1 2 3
Kel 1 2 1 2 NA
Kel 1 2 1 2 NA
Kel 1 2 1 2 NA
Don 3 3 3 NA NA
Don 3 3 3 NA NA
Don 3 3 3 NA NA
A tidyverse approach:
library(dplyr)
library(tidyr)
# evaluate separately for each name
d %>% group_by(Name) %>%
# add a column of the unique values pasted together into a string
mutate(ID = paste(unique(c(No1, No2)), collapse = ' ')) %>%
# separate the string into individual columns, filling with NA and converting to numbers
separate(ID, into = paste0('ID', 1:3), fill = 'right', convert = TRUE)
## Source: local data frame [9 x 6]
## Groups: Name [3]
##
## Name No1 No2 ID1 ID2 ID3
## * <fctr> <dbl> <dbl> <int> <int> <int>
## 1 Jon 1 1 1 2 3
## 2 Jon 2 1 1 2 3
## 3 Jon 3 1 1 2 3
## 4 Kel 1 2 1 2 NA
## 5 Kel 1 2 1 2 NA
## 6 Kel 1 2 1 2 NA
## 7 Don 3 3 3 NA NA
## 8 Don 3 3 3 NA NA
## 9 Don 3 3 3 NA NA
Here's a nice base version with a basic split-apply-combine approach:
# store distinct values in No1 and No2
cols <- unique(unlist(d[,-1]))
# split No1 and No2 by Name,
ids <- data.frame(t(sapply(split(d[,-1], d$Name),
# find unique values for each split,
function(x){y <- unique(unlist(x))
# pad with NAs,
c(y, rep(NA, length(cols) - length(y)))
# and return a data.frame
})))
# fix column names
names(ids) <- paste0('ID', cols)
# turn rownames into column
ids$Name <- rownames(ids)
# join two data.frames on Name columns
merge(d, ids, sort = FALSE)
## Name No1 No2 ID1 ID2 ID3
## 1 Jon 1 1 1 2 3
## 2 Jon 2 1 1 2 3
## 3 Jon 3 1 1 2 3
## 4 Kel 1 2 1 2 NA
## 5 Kel 1 2 1 2 NA
## 6 Kel 1 2 1 2 NA
## 7 Don 3 3 3 NA NA
## 8 Don 3 3 3 NA NA
## 9 Don 3 3 3 NA NA
And just for kicks, here's a creative alternate base version that leverages table instead of splitting/grouping:
# copy d so as not to distort original with factor columns
d_f <- d
# make No* columns factors to ensure similar table structure
d_f[, -1] <- lapply(d[,-1], factor, levels = unique(unlist(d[, -1])))
# make tables of cols, sum to aggregate occurrences, and set as boolean mask for > 0
tab <- Reduce(`+`, lapply(d_f[, -1], table, d_f$Name)) > 0
# replace all TRUE values with values they tabulated
tab <- tab * matrix(as.integer(rownames(tab)), nrow = nrow(tab), ncol = ncol(tab))
# replace 0s with NAs
tab[tab == 0] <- NA
# store column names
cols <- paste0('ID', rownames(tab))
# sort each row, keeping NAs
tab <- data.frame(t(apply(tab, 2, sort, na.last = T)))
# apply stored column names
names(tab) <- cols
# turn rownames into column
tab$Name <- rownames(tab)
# join two data.frames on Name columns
merge(d, tab, sort = FALSE)
Results are identical.
library(dplyr)
library(tidyr)
d %>%
group_by(Name) %>%
mutate(unique_id = paste0(unique(c(No1, No2)), collapse = ",")) %>%
separate(., unique_id, paste0("id_", 1:max(c(.$No1, .$No2))), fill = "right")
We can use a single external package i.e. data.table and get the output. Convert the 'data.frame' to 'data.table' (setDT(d)), grouped by 'Name', we unlist the columns mentioned in the .SDcols, get the unique values, and dcast from 'long' to 'wide' format, do a join with the original dataset on the "Name" column.
library(data.table)
dcast(setDT(d)[, unique(unlist(.SD)) , Name, .SDcols = No1:No2],
Name~paste0("ID", rowid(Name)), value.var="V1")[d, on = "Name"]
# Name ID1 ID2 ID3 No1 No2
#1: Jon 1 2 3 1 1
#2: Jon 1 2 3 2 1
#3: Jon 1 2 3 3 1
#4: Kel 1 2 NA 1 2
#5: Kel 1 2 NA 1 2
#6: Kel 1 2 NA 1 2
#7: Don 3 NA NA 3 3
#8: Don 3 NA NA 3 3
#9: Don 3 NA NA 3 3
Or this can be done in one-line by first pasteing the unique elements in 'No1' and 'No2', grouped by 'Name', and then split it to three columns by using cSplit from splitstackshape.
library(splitstackshape)
cSplit(setDT(d)[, ID:= paste(unique(c(No1, No2)), collapse=" ") , Name], "ID", " ")
# Name No1 No2 ID_1 ID_2 ID_3
#1: Jon 1 1 1 2 3
#2: Jon 2 1 1 2 3
#3: Jon 3 1 1 2 3
#4: Kel 1 2 1 2 NA
#5: Kel 1 2 1 2 NA
#6: Kel 1 2 1 2 NA
#7: Don 3 3 3 NA NA
#8: Don 3 3 3 NA NA
#9: Don 3 3 3 NA NA
Or using the baseVerse just for kicks
d1 <- read.table(text=ave(unlist(d[-1]), rep(d$Name, 2),
FUN = function(x) paste(unique(x), collapse=" "))[1:nrow(d)],
header=FALSE, fill=TRUE, col.names= paste0("ID", 1:3))
cbind(d, d1)
# Name No1 No2 ID1 ID2 ID3
#1 Jon 1 1 1 2 3
#2 Jon 2 1 1 2 3
#3 Jon 3 1 1 2 3
#4 Kel 1 2 1 2 NA
#5 Kel 1 2 1 2 NA
#6 Kel 1 2 1 2 NA
#7 Don 3 3 3 NA NA
#8 Don 3 3 3 NA NA
#9 Don 3 3 3 NA NA
NOTE: No packages used and without much effort in splitting.

Resources