Using coalesce function with many variables - r

I have two datasets with similar variables. dataset2 has values of of variables that were not captured in dataset2. My aim is to use the dataset2 variables to fill the corresponding values in variables in dataset1. Is there a way to achieve this. It is possible to use coalesce but listing all the variables is a bit cumbersome.
library(dplyr)
dat1 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c("Test", "Tested", "Testing", NA),
a2= c("Math", "Eng", NA, "French"),
a3= c("Science", NA, "Biology", "Chem"))
dat2 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c(NA, NA, NA, "Tested"),
a2= c("Math", NA, "UK", NA),
a3= c("Science", "Physic", NA, NA))
dat1 %>%
inner_join(dat2, by = "id") %>%
mutate(a1 = coalesce(a1.x, a1.y),
a2 = coalesce(a2.x, a2.y))

Another possible solution, based on powerjoin:
library(powerjoin)
library(tibble)
power_inner_join(dat1, dat2, by = "id", conflict = coalesce_xy)
#> # A tibble: 4 × 4
#> id a1 a2 a3
#> <chr> <chr> <chr> <chr>
#> 1 soo1 Test Math Science
#> 2 soo2 Tested Eng Physic
#> 3 soo3 Testing UK Biology
#> 4 soo4 Tested French Chem

You could also fill your values "downup" per group for every column like this:
library(dplyr)
library(tidyr)
dat1 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c("Test", "Tested", "Testing", NA),
a2= c("Math", "Eng", NA, "French"),
a3= c("Science", NA, "Biology", "Chem"))
dat2 <- tibble(
id = c("soo1", "soo2", "soo3", "soo4"),
a1= c(NA, NA, NA, "Tested"),
a2= c("Math", NA, "UK", NA),
a3= c("Science", "Physic", NA, NA))
dat1 %>%
bind_rows(dat2) %>%
group_by(id) %>%
fill(everything(), .direction = "downup") %>%
slice(1)
#> # A tibble: 4 × 4
#> # Groups: id [4]
#> id a1 a2 a3
#> <chr> <chr> <chr> <chr>
#> 1 soo1 Test Math Science
#> 2 soo2 Tested Eng Physic
#> 3 soo3 Testing UK Biology
#> 4 soo4 Tested French Chem
Created on 2022-07-18 by the reprex package (v2.0.1)

In dplyr, we may use rows_patch
library(dplyr)
rows_patch(dat1, dat2, by = 'id')
-output
# A tibble: 4 × 4
id a1 a2 a3
<chr> <chr> <chr> <chr>
1 soo1 Test Math Science
2 soo2 Tested Eng Physic
3 soo3 Testing UK Biology
4 soo4 Tested French Chem

Related

dplyr join with OR condition?

I am wondering whether there is any way, preferably in the tidyverse, to join two dataframes based on OR conditions.
There are two dataframes: df_obs and df_event.
a) The join should happen if there is a match between
the obs_id and event_id; and obs_date or event_date, or both are NA.
OR
b) the obs_date and event_date are identical; and obs_id, event_id, or both are NA.
The match should not happen if obs_id is not identical to event_id (if both are not NA) OR
obs_date and event_date are not identical (both being not NA).
The result should look like df_res below. The column 'event' from the df_event is added to the df_obs.
I have seen the answer to this question, but maybe there is a way around SQL?
df_obs <- tibble::tribble(
~obs, ~obs_date, ~obs_id,
"a1", NA, 10L,
"a2", "01/01/2000", NA,
"b", "02/01/2000", NA,
"a3", "03/01/2000", 10L
)
df_obs
#> # A tibble: 4 × 3
#> obs obs_date obs_id
#> <chr> <chr> <int>
#> 1 a1 <NA> 10
#> 2 a2 01/01/2000 NA
#> 3 b 02/01/2000 NA
#> 4 a3 03/01/2000 10
df_event <- tibble::tribble(
~event, ~event_date, ~event_id,
"A", "01/01/2000", 10L,
"B", "02/01/2000", NA
)
df_event
#> # A tibble: 2 × 3
#> event event_date event_id
#> <chr> <chr> <int>
#> 1 A 01/01/2000 10
#> 2 B 02/01/2000 NA
df_res <- tibble::tribble(
~obs, ~obs_date, ~obs_id, ~event,
"a1", NA, 10L, "A",
"a2", "01/01/2000", NA, "A",
"b", "02/01/2000", NA, "B",
"a3", "03/01/2000", 10L, NA
)
df_res
#> # A tibble: 4 × 4
#> obs obs_date obs_id event
#> <chr> <chr> <int> <chr>
#> 1 a1 <NA> 10 A
#> 2 a2 01/01/2000 NA A
#> 3 b 02/01/2000 NA B
#> 4 a3 03/01/2000 10 <NA>
Created on 2022-09-13 with reprex v2.0.2
I can come up only with this solution:
df_obs %>%
left_join(df_event, by = c("obs_id" = "event_id"), na_matches='never') %>%
mutate(event = ifelse(!(is.na(obs_date)|is.na(event_date)|obs_date == event_date), NA, event)) %>%
select(-event_date) %>%
left_join(df_event, by = c("obs_date" = "event_date"), na_matches='never') %>%
mutate(event.y = ifelse(!(is.na(obs_id)|is.na(event_id)|obs_id == event_id), NA, event.y)) %>%
select(-event_id) %>%
mutate(event = ifelse(is.na(event.x), event.y, event.x)) %>%
select(-c(event.x, event.y))

How to count rows with NA values across a selection of columns and include 0 count?

I am trying to count the number of species per region which have missing data (NA) for a selection of variables.
Here is an example of my dataframe:
library(tidyverse)
df <- structure(
list(
ID = c("AL01", "AL01", "AL02", "AL02", "AL03", "AL03"),
Species = c("Sp1",
"Sp2",
"Sp3",
"Sp4",
"Sp5",
"Sp6"),
Var1 = c("A", NA, NA, NA, "B", "B"),
Var2 = c(NA,
"A",
"B",
"C",
"B",
"C"),
Var3 = c(NA,
2.71, 2.86, 3.21, 2.87, 3.05),
Var4 = c("S", NA,
"C", NA, "S",
"C")
),
class = "data.frame",
row.names = c(NA,
6L)
)
I can get the count of species with NA for any of Var2, Var3 of Var4 by running:
df %>%
filter_at(
vars(
Var2,
Var3,
Var4
),
any_vars(is.na(.))
) %>%
group_by(ID) %>%
count()
# A tibble: 2 × 2
# Groups: ID [2]
ID n
<chr> <int>
1 AL01 2
2 AL02 1
However this only shows me AL01 and AL02 and I would also like to include AL03 for which the count is 0. I have tried this code which I thought should work:
df %>%
group_by(ID) %>%
summarise_at(vars(
Var2,
Var3,
Var4
), ~ sum(any_vars(is.na(.))))
But I get this error:
Error in `summarise()`:
! Problem while computing `Var2 = (structure(function (..., .x = ..1, .y = ..2, . =
..1) ...`.
ℹ The error occurred in group 1: ID = "AL01".
Caused by error in `abort_quosure_op()`:
! Summary operations are not defined for quosures. Do you need to unquote the
quosure?
# Bad: sum(myquosure)
# Good: sum(!!myquosure)
Run `rlang::last_error()` to see where the error occurred.
I realise I am not sure exactly how any_vars works and am unclear on how to continue. The output I would like would be:
# A tibble: 2 × 2
# Groups: ID [2]
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0
You can do:
library(tidyverse)
df %>%
mutate(missing = apply(across(num_range('Var', 2:4)), 1, function(x) any(is.na(x)))) %>%
group_by(ID) %>%
summarize(n = sum(missing))
# A tibble: 3 x 2
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0
df %>%
rowwise() %>%
mutate(across(num_range('Var', 2:4), is.na),
x = any(c_across(num_range('Var', 2:4)))) %>%
group_by(ID) %>%
summarise(n = sum(x))
# A tibble: 3 × 2
ID n
<chr> <int>
1 AL01 2
2 AL02 1
3 AL03 0

Filter by group and conditions

I have this type of data, where Sequis a grouping variable:
df <- data.frame(
Sequ = c(1,1,1,
2,2,2,
3,3,
4,4),
Answerer = c("A", NA, NA, "A", NA, NA, "B", NA, "C", NA),
PP_by = c(rep("A",5), rep("B",5)),
pp = c(0.1,0.2,0.3, 1, NA, NA, NA, NA, NA, NA)
)
I need to remove any Sequ where
(i) Answerer == PP_by AND
(ii) there is any NA in pp
I've tried this, but it obviously implements just the first condition (i):
library(dplyr)
df %>%
group_by(Sequ) %>%
filter(
all(!is.na(pp))
)
The expected result is:
Sequ Answerer PP_by pp
1 1 A A 0.1
2 1 <NA> A 0.2
3 1 <NA> A 0.3
9 4 C B NA
10 4 <NA> B NA
EDIT:
I've come up with this solution:
df %>%
group_by(Sequ) %>%
filter(
first(Answerer) != first(PP_by)
|
all(!is.na(pp))
)
Here's another way:
df %>%
group_by(Sequ) %>%
filter(!(
any(Answerer == PP_by, na.rm = TRUE) &
any(is.na(pp))
))
# # A tibble: 5 × 4
# # Groups: Sequ [2]
# Sequ Answerer PP_by pp
# <dbl> <chr> <chr> <dbl>
# 1 1 A A 0.1
# 2 1 NA A 0.2
# 3 1 NA A 0.3
# 4 4 C B NA
# 5 4 NA B NA

Wide to long with pivot_longer and mix of numeric and character data

help <- data.frame(
id = c(100, 100, 101, 102, 102),
q1 = c(NA, 1, NA, NA, 3),
q2 = c(1, NA, 2, NA, NA),
q3 = c(NA, 1, NA, 4, NA),
q4 = c(NA, NA, 4, NA, 5),
group = c("a", "b", "c", "a", "c"))
help$group <- as.character(help$group)
I am trying to pivot longer so dataset looks like this:
id score group
100 NA a
100 1 b
100 NA c
...
But I get an error with the numeric values of q1-q4 and the character string group.
pivot_longer(help, !id, names_to = "score",
values_to = "group", values_ptypes = list(group = 'character'))
Error: Can't convert <double> to <character>.
How can I pivot longer but also preserve the group variable (where there is several missing data for the q1-4 there is a match for every id and group)?
library(tidyr)
output <- pivot_longer(help, -c(id, group), names_to = "question",
values_to = "score") %>%
dplyr::select(-question) %>%
dplyr::arrange(id, group)
Output
head(output)
# A tibble: 6 × 3
id group score
<dbl> <chr> <dbl>
1 100 a NA
2 100 a 1
3 100 a NA
4 100 a NA
5 100 b 1
6 100 b NA

R: Data tyding when identifiers and variables are in the same column

I came across the following (stylized) data cleaning problem:
df <- data.frame(first_column = c("country1", "variable1", "variable2","country2", "variable1", "variable2"),
second_column = c(NA, "15", "16", NA, "62", "63")
)
df
#> first_column second_column
#> 1 country1 <NA>
#> 2 variable1 15
#> 3 variable2 16
#> 4 country2 <NA>
#> 5 variable1 62
#> 6 variable2 63
Created on 2020-11-02 by the reprex package (v0.3.0)
I was trying to convert this to a "tidy" (i.e. long or wide format) using pivot_longer_spec and pivot_wider_spec respectively, but couldn't work it out. There seems to be very little documentation on these functions, and it is difficult for me to find out how to specify the arguments correctly.
Can anyone tell me how to approach this problem, using either these functions or others?
Many thanks.
an alternative solution with the zoo package:
library(zoo)
library(dplyr)
df <- data.frame(first_column = c("country1", "variable1", "variable2","country2", "variable1", "variable2"),
second_column = c(NA, "15", "16", NA, "62", "63"))
df %>%
dplyr::mutate(COUNTRY = ifelse(is.na(second_column), first_column, NA)) %>%
dplyr::mutate(COUNTRY = zoo::na.locf(COUNTRY)) %>%
dplyr::filter(!is.na(second_column)) %>%
tidyr::pivot_wider(names_from = first_column, values_from = second_column)
# A tibble: 2 x 3
COUNTRY variable1 variable2
<chr> <chr> <chr>
1 country1 15 16
2 country2 62 63
This could be achieved like so:
The "tricky" part is to put the country identifiers in a separate column which I using an ifelse which conditions on NA values in the second column (as in the approach by #DPH), fill up the country column and filter to get rid of the "county rows"
After this we can simply pivot_wider
library(tidyr)
library(dplyr)
df <- data.frame(first_column = c("country1", "variable1", "variable2","country2", "variable1", "variable2"),
second_column = c(NA, "15", "16", NA, "62", "63")
)
df %>%
mutate(country = ifelse(is.na(second_column), first_column, NA)) %>%
tidyr::fill(country) %>%
filter(first_column != country) %>%
tidyr::pivot_wider(names_from = "first_column", values_from = "second_column")
#> # A tibble: 2 x 3
#> country variable1 variable2
#> <chr> <chr> <chr>
#> 1 country1 15 16
#> 2 country2 62 63

Resources