Merging datasets by id and maintain one row for each id - r

I am attempting to merge 2 datasets belonging to a single id with a larger dataset.
However, I am having trouble merging the two single row datasets into a single row within the larger dataset.
Is there a simple way to merge with dplyr and only overwrite values if they are NA's?
My data:
df1 <- data.frame(id=1:5, b=6:10, c=c("a", "b", "c", "d", "e"), d=c(NA, 1,2,3, 4))
df2 <- data.frame(id=6, b=2, c="f", d=NA_real_)
df3 <- data.frame(id=6, b=NA_real_, c=NA_character_, d=5, e="a")
> df1
id b c d
1 1 6 a NA
2 2 7 b 1
3 3 8 c 2
4 4 9 d 3
5 5 10 e 4
> df2
id b c d
1 6 2 f NA
> df3
id b c d e
1 6 NA <NA> 5 a
My attempt:
merge1 <- dplyr::full_join(df1, df2) %>% full_join(df3)
Desired output:
output <- data.frame(id=1:6, b=c(6:10,2), c=c("a", "b", "c", "d", "e", "f"), d=c(NA, 1,2,3, 4, 5), e=c(NA,NA, NA, NA, NA, "a"))
> output
id b c d e
1 1 6 a NA <NA>
2 2 7 b 1 <NA>
3 3 8 c 2 <NA>
4 4 9 d 3 <NA>
5 5 10 e 4 <NA>
6 6 2 f 5 a
As opposed to:
id b c d e
1 1 6 a NA <NA>
2 2 7 b 1 <NA>
3 3 8 c 2 <NA>
4 4 9 d 3 <NA>
5 5 10 e 4 <NA>
6 6 2 f NA <NA>
7 6 NA <NA> 5 a
Thank you

You can try:
list(df1, df2, df3) %>%
bind_rows() %>%
group_by(id) %>%
summarise_all(~ first(na.omit(.)))
id b c d e
<dbl> <dbl> <chr> <dbl> <fct>
1 1 6 a NA <NA>
2 2 7 b 1 <NA>
3 3 8 c 2 <NA>
4 4 9 d 3 <NA>
5 5 10 e 4 <NA>
6 6 2 f 5 a

you can try
library(tidyverse)
df1 %>%
mutate_if(is.factor, as.character) %>%
bind_rows(mutate_if(df2, is.factor, as.character)) %>%
left_join(select(df3, id, d, e), by = "id") %>%
mutate(d= ifelse(is.na(d.x), d.y, d.x)) %>%
select(-d.x, -d.y)

Related

How can I compare two columns of different lengths from two dataframes to check for matching values in R?

I have two dataframes that look something like this:
df1
df2 <- data.frame(col1 = c(1,4,5,7), col2 = c("a","c","f", "g"))
df2
df1
col1 col2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
df2
col1 col2
1 1 a
2 2 c
3 3 f
4 4 g
5 10 z
I want to compare the values in col2 of each df and line up the columns of each df by the matches to get this:
col1 col2 col1.1 col2.1
1 1 a 1 a
2 2 b NA <NA>
3 3 c 2 c
4 4 d NA <NA>
5 5 e NA <NA>
6 6 f 3 f
7 7 g 4 g
Where ideally, the missing values from df1 are dropped and the missing values from df2 are filled in with NAs. Ultimately, I want to calculate what percent of the values in col2 of df1 have a match in col2 of df2.
Use left_join from dplyr
library(dplyr)
df1 %>%
left_join(df2, by="col2", keep = TRUE)
output:
col1.x col2.x col1.y col2.y
1 1 a 1 a
2 2 b NA <NA>
3 3 c 2 c
4 4 d NA <NA>
5 5 e NA <NA>
6 6 f 3 f
7 7 g 4 g
To get the percentage of match
out <- df1 %>%
left_join(df2, by="col2", keep = TRUE)
out %>%
filter(col2.x == col2.y) %>%
nrow()/nrow(out)*100
Result:
[1] 57.14286
If you want the % of match as a new column:
df1 %>%
left_join(df2, by="col2", keep = TRUE) %>%
mutate(percentage_match = sum(col2.x == col2.y, na.rm = TRUE)/nrow(.)*100)
output:
col1.x col2.x col1.y col2.y percentage_match
1 1 a 1 a 57.14286
2 2 b NA <NA> 57.14286
3 3 c 2 c 57.14286
4 4 d NA <NA> 57.14286
5 5 e NA <NA> 57.14286
6 6 f 3 f 57.14286
7 7 g 4 g 57.14286

Create all combinations of two variables from two dataframes while keeping all other variables in R

I have minimaldata in two dataframes and am looking for all combinations of two variables, each in one of the dataframes. This can be achieved by expand.grid(). Is there a function to preserve all other variables from the two dataframes?
df1 <- data.frame(var1 = c("A", "B", "C"), var2 = c(1,2,3))
df2 <- data.frame(var1 = c("D", "E", "F"), var2 = c(4,5,6))
expand.grid(df1$var1, df2$var1)
Var1 Var2
1 A D
2 B D
3 C D
4 A E
5 B E
6 C E
7 A F
8 B F
9 C F
Expected result are all combinations and all other variables, perhaps with a suffix.
Var1.x Var1.y var2.x var2.y
1 A D 1 4
2 B D 2 4
3 C D 3 4
4 A E 1 5
5 B E 2 5
6 C E 3 5
7 A F 1 6
8 B F 2 6
9 C F 3 6
With dplyr, you could use full_join(x, y, by = character()) to perform a cross-join, generating all combinations of x and y.
library(dplyr)
full_join(df1, df2, by = character())
# var1.x var2.x var1.y var2.y
# 1 A 1 D 4
# 2 A 1 E 5
# 3 A 1 F 6
# 4 B 2 D 4
# 5 B 2 E 5
# 6 B 2 F 6
# 7 C 3 D 4
# 8 C 3 E 5
# 9 C 3 F 6
An alternative is expand_grid() or crossing() from tidyr to create a tibble from all combinations of inputs.
library(tidyr)
crossing(df1, df2, .name_repair = ~ paste0(.x, rep(c('.x', '.y'), each = 2)))
crossing() is a wrapper around expand_grid() that de-duplicates and sorts its inputs.
You can use Map along with cbind
> do.call(cbind, Map(function(x, y) expand.grid(x = x, y = y), df1, df2))
var1.x var1.y var2.x var2.y
1 A D 1 4
2 B D 2 4
3 C D 3 4
4 A E 1 5
5 B E 2 5
6 C E 3 5
7 A F 1 6
8 B F 2 6
9 C F 3 6
Does this work:
library(dplyr)
library(tidyr)
df1 %>% uncount(3) %>% cbind(df2 %>% uncount(3)) %>%
setNames(., c('Var1.x', 'Var2.x', 'Var1.y', 'Var2.y'))
Var1.x Var2.x Var1.y Var2.y
1 A 1 D 4
2 A 1 D 4
3 A 1 D 4
4 B 2 E 5
5 B 2 E 5
6 B 2 E 5
7 C 3 F 6
8 C 3 F 6
9 C 3 F 6

collaps multiple rows with conditions

I would like to collapse multiples row conditions using tidyverse and here is my example
df <- data.frame(value = c(2,2,2,2,1,1,1,1,1,1),
name1 = c("a", "a", "b", "b", 'c', "d", "e", NA, NA, NA),
name2 = c("x", "x", "x", "x", "x", "x", "y", NA, NA, NA))
I would like to collapse rows saying that name1 is the same associated with name2 is the same, then those rows would be collapse into single row. Any suggestions for me?
My desired output like
value name1 name2
1 2 a x
2 2 b x
3 1 c x
4 1 d x
5 1 e y
6 1 <NA> <NA>
7 1 <NA> <NA>
8 1 <NA> <NA>
May be this helps
library(dplyr)
df %>%
filter(!duplicated(across(everything()))|if_any(everything(), is.na))
-output
value name1 name2
1 2 a x
2 2 b x
3 1 c x
4 1 d x
5 1 e y
6 1 <NA> <NA>
7 1 <NA> <NA>
8 1 <NA> <NA>
If it is based on selected number of columns
df %>%
filter(!duplicated(across(c(name1, name2)))|if_any(c(name1, name2), is.na))
Or in base R
df[!duplicated(df)|rowSums(is.na(df)) > 0,]
value name1 name2
1 2 a x
3 2 b x
5 1 c x
6 1 d x
7 1 e y
8 1 <NA> <NA>
9 1 <NA> <NA>
10 1 <NA> <NA>
Here is an dplyr alternative using a helper column to prepare to apply distinct()
library(dplyr)
df %>%
mutate(helper = paste0(name1, name2),
helper = ifelse(is.na(name1) | is.na(name2),
paste0(helper, row_number()), helper)
) %>%
distinct(helper, .keep_all = TRUE) %>%
select(-helper)
Outcome:
value name1 name2
1 2 a x
2 2 b x
3 1 c x
4 1 d x
5 1 e y
6 1 <NA> <NA>
7 1 <NA> <NA>
8 1 <NA> <NA>
Another tidyverse option could look as follows.
library(dplyr)
df %>%
filter(if_any(name1:name2, ~ !is.na(.))) %>%
distinct() %>%
bind_rows(filter(df, if_any(name1:name2, is.na)))
# value name1 name2
# 1 2 a x
# 2 2 b x
# 3 1 c x
# 4 1 d x
# 5 1 e y
# 6 1 <NA> <NA>
# 7 1 <NA> <NA>
# 8 1 <NA> <NA>

Assign ID to column with NA's

This must be easy but my brain is blocked!
I have this dataframe:
col1
<chr>
1 A
2 B
3 NA
4 C
5 D
6 NA
7 NA
8 E
9 NA
10 F
df <- structure(list(col1 = c("A", "B", NA, "C", "D", NA, NA, "E",
NA, "F")), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
I want to add a column with uniqueID only for values that are not NA with tidyverse.
Expected output:
col1 uniqueID
<chr> <dbl>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6
I have tried: n(), row_number(), cur_group_id ....
We could do this easily in data.table. Specify the condition in i i.e. non-NA elements in 'col1', create the column 'uniqueID' with the sequence of elements by assignment (:=)
library(data.table)
setDT(df)[!is.na(col1), uniqueID := seq_len(.N)]
-output
df
col1 uniqueID
1: A 1
2: B 2
3: <NA> NA
4: C 3
5: D 4
6: <NA> NA
7: <NA> NA
8: E 5
9: <NA> NA
10: F 6
In dplyr, we can use replace
library(dplyr)
df %>%
mutate(uniqueID = replace(col1, !is.na(col1),
seq_len(sum(!is.na(col1)))))
-output
# A tibble: 10 x 2
col1 uniqueID
<chr> <chr>
1 A 1
2 B 2
3 <NA> <NA>
4 C 3
5 D 4
6 <NA> <NA>
7 <NA> <NA>
8 E 5
9 <NA> <NA>
10 F 6
Another approach:
library(dplyr)
df %>%
mutate(UniqueID = cumsum(!is.na(col1)),
UniqueID = if_else(is.na(col1), NA_integer_, UniqueID))
# A tibble: 10 x 2
col1 UniqueID
<chr> <int>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6
A base R option using match + na.omit + unique
transform(
df,
uniqueID = match(col1, na.omit(unique(col1)))
)
gives
col1 uniqueID
1 A 1
2 B 2
3 <NA> NA
4 C 3
5 D 4
6 <NA> NA
7 <NA> NA
8 E 5
9 <NA> NA
10 F 6
A weird tidyverse solution:
library(dplyr)
df %>%
mutate(id = ifelse(is.na(col1), 0, 1),
id = cumsum(id == 1),
id = ifelse(is.na(col1), NA, id))
# A tibble: 10 x 2
col1 id
<chr> <int>
1 A 1
2 B 2
3 NA NA
4 C 3
5 D 4
6 NA NA
7 NA NA
8 E 5
9 NA NA
10 F 6

Group by cumulative sums with conditions

In this dataframe:
df <- data.frame(
ID = c("C", "B", "B", "B", NA, "C", "A", NA, "B", "B", "B")
)
I'd like to group the rows using cumsum with two conditions: (i) cumsum should not continue if is.na(ID) and (ii) it should not continue if the next ID value is the same as the prior. I do meet condition (i) with this:
df %>%
group_by(grp = cumsum(!is.na(ID)))
# A tibble: 11 x 2
# Groups: grp [9]
ID grp
<chr> <int>
1 C 1
2 B 2
3 B 3
4 B 4
5 NA 4
6 C 5
7 A 6
8 NA 6
9 B 7
10 B 8
11 B 9
but I don't know how to implement condition (ii) too, to obtain the desired result:
1 C 1
2 B 2
3 B 2
4 B 2
5 NA 2
6 C 3
7 A 4
8 NA 4
9 B 5
10 B 5
11 B 5
I tried it with this but I doesn't work:
df %>%
group_by(grp = cumsum(!is.na(ID) |!lag(ID,1) == ID))
Use na.locf0 from zoo to fill in the NAs and then apply rleid from data.table:
library(data.table)
library(zoo)
rleid(na.locf0(df$ID))
## [1] 1 2 2 2 2 3 4 4 5 5 5
Using tidyr and dplyr, you could do:
df %>%
mutate(grp = fill(., ID) %>% pull(),
grp = cumsum(grp != lag(grp, default = first(grp))))
ID grp
1 C 0
2 B 1
3 B 1
4 B 1
5 <NA> 1
6 C 2
7 A 3
8 <NA> 3
9 B 4
10 B 4
11 B 4
Using rle
library(zoo)
with(rle(na.locf0(df$ID)), rep(seq_along(values), lengths))
#[1] 1 2 2 2 2 3 4 4 5 5 5

Resources