R: Joining Tibbles Using Derived Column Values - r

Consider the following tibbles:
library(tidyverse)
tbl_base_ids = tibble(base_id = c("ABC", "ABCDEF", "ABCDEFGHI"), base_id_length = c(3, 6, 9), record_id_length = c(10, 12, 15))
tbl_records = tibble(record_id = c("ABC1234567", "ABCDEF123456", "ABCDEFGHI123456"))
I'd like to join matching rows to produce a third tibble:
tbl_records_with_base
record_id
base_id
base_id_length
record_id_length
As you can see, this is not a matter of joining one or more variables from each of the first two. This requires matching variable derivatives. In SQL, I'd do this:
SELECT A.record_id,
B.base_id,
B.base_id_length,
B.record_id_length
FROM tbl_records A
JOIN tbl_base_ids B
ON LENGTH(a.record_id) = B.record_id_length
AND LEFT(a.record_id, B.base_id_length) = B.base_id
I've tried variations of dplyr joins and using the match function, to but to no avail. Can someone help? Thank you.

You should come up with some logic to separate base_id from record_id. because joining only on record_id_length would not be enough. For this example we can get base_id if we remove all numbers from record_id. Based on your actual dataset you need to change this if needed.
Once we do that we can join tbl_records with tbl_base_ids by base_id and record_id_length.
library(dplyr)
tbl_records %>%
mutate(base_id = sub('\\d+', '', record_id),
record_id_length = nchar(record_id)) %>%
inner_join(tbl_base_ids, by = c("base_id", "record_id_length")) -> result
result
# record_id base_id record_id_length base_id_length
# <chr> <chr> <dbl> <dbl>
#1 ABC1234567 ABC 10 3
#2 ABCDEF123456 ABCDEF 12 6
#3 ABCDEFGHI123456 ABCDEFGHI 15 9

I suggest using the fuzzyjoin package.
library(dplyr)
library(fuzzyjoin)
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(tbl_records, ., by = c("record_id" = "record_ptn"))
# # A tibble: 3 x 5
# record_id base_id base_id_length record_id_length record_ptn
# <chr> <chr> <dbl> <dbl> <chr>
# 1 ABC1234567 ABC 3 10 ^ABC.{7}$
# 2 ABCDEF123456 ABCDEF 6 12 ^ABCDEF.{6}$
# 3 ABCDEFGHI123456 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$
A note about this: the order of tables matters, where the regex must reside on the RHS of the by= settings. For instance, this does not work if we reverse it:
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(., tbl_records, by = c("record_ptn" = "record_id"))
# # A tibble: 6 x 5
# base_id base_id_length record_id_length record_ptn record_id
# <chr> <dbl> <dbl> <chr> <chr>
# 1 ABC 3 10 ^ABC.{7}$ <NA>
# 2 ABCDEF 6 12 ^ABCDEF.{6}$ <NA>
# 3 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$ <NA>
# 4 <NA> NA NA <NA> ABC1234567
# 5 <NA> NA NA <NA> ABCDEF123456
# 6 <NA> NA NA <NA> ABCDEFGHI123456

Related

merge three datasets using left_join and get unique suffixes

Can I merge three datasets using left_join and get unique suffixes for all three datasets?
My dummy data:
df1 <- tibble(x = 1:5, y = c("names", "matches", "multiple", "rows", "different"))
df2 <- tibble(x = 3:5, y = c("first", "second", "third"))
df3 <- tibble(x = 2:4, y = 1:3)
left_join(df1, df2, by='x', suffix = c(".first", ".second")) %>%
left_join(., df3 , by='x', suffix = c("third", "third"))
# # A tibble: 5 x 4
# x y.first y.second y
# <int> <chr> <chr> <int>
# 1 1 names <NA> NA
# 2 2 matches <NA> 1
# 3 3 multiple first 2
# 4 4 rows second 3
# 5 5 different third NA
What I'm looking to obtain (the '.third' in the name of the third last column)
# # A tibble: 5 x 4
# x y.first y.second y.third
# <int> <chr> <chr> <int>
# 1 1 names <NA> NA
# 2 2 matches <NA> 1
# 3 3 multiple first 2
# 4 4 rows second 3
# 5 5 different third NA
try this:
left_join(df1, df2, by='x', suffix = c(".first", "")) %>%
left_join(., df3 , by='x', suffix = c(".second", ".third"))
x y.first y.second y.third
<int> <chr> <chr> <int>
1 1 names NA NA
2 2 matches NA 1
3 3 multiple first 2
4 4 rows second 3
5 5 different third NA

fuzzyjoin based on relative difference

I have understood that fuzzyjoin::difference will join two tables based on absolute difference between columns. Is there an R function that will join tables based on relative/percentage differences? I could do so using a full_join() + filter() but I suspect there is a more straightforward way.
Minimal example as follows:
library(tidyverse)
library(fuzzyjoin)
df_1 <- tibble(id = c("wombat", "jerry", "akow"), scores = c(10, 50, 75))
df_2 <- tibble(id= c("wombat", "jerry", "akow"), scores = c(14, 45, 82))
# joining based on absolute difference
difference_full_join(df_1, df_2,
by=c("scores"),
max_dist= 5,
distance_col = "abs_diff" )
# A tibble: 4 x 5
id.x scores.x id.y scores.y abs_diff
<chr> <dbl> <chr> <dbl> <dbl>
1 wombat 10 wombat 14 4
2 jerry 50 jerry 45 5
3 akow 75 NA NA NA
4 NA NA akow 82 NA
## joining based on relative difference (setting 10% as a threshold)
full_join(df_1, df_2, "id") %>%
dplyr::filter( (abs(scores.x - scores.y)/scores.x) <=0.10)
# A tibble: 2 x 3
id scores.x scores.y
<chr> <dbl> <dbl>
1 jerry 50 45
2 akow 75 82

Can I extract the value from several columns by column's name?

library(dplyr)
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","c","a","a"))
mydf
Assuming that the data I have is mydf, I would like to produce the same result as mydf2.
I made a column with the name of the column containing the value to be extracted.
I want to extract the value through this column.
mydf2 <- data.frame(a_x=c(1,2,3,4,5),
b_x=c(8,9,10,11,12),
prefix=c("a","b","c","a","a"),
desired_x_value = c(1,9,NA,4,5),
desired_y_value = c('k','bb',NA,'d','z'))
mydf2
I've used 'get' and 'paste0' but it doesn't work. Can I solve this problem through 'dplyr' chain?
mydf %>% mutate(desired_x_value = get(paste0(prefix,"_x")),
desired_y_value = get(paste0(prefix,"_y")))
So basically you want to create new columns (desired_x_value and desired_y_value) of which its value depends on a condition. Using dplyr I prefer case_when as it is the best readable way to do it, but you could also use (nested) if(else) statements. What it is doing is "if X meets condition A do Y, if X meets condition B do Z, if X meets condition .... do ..."
mydf %>%
dplyr::mutate(
desired_x_value = case_when(
prefix == "a" ~ a_x,
prefix == "b" ~ b_x,
desired_y_values = case_when(
prefix == "a" ~a_y,
prefix == "b" ~b_y,
TRUE ~ NA_character_ ))
You can remove the columns you don't need anymore in a second step if you want. the code above results in the table:
a_x b_x a_y b_y prefix desired_x_value desired_y_values
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA <NA>
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
You can write a helper function for this :
get_value <- function(data, prefix, group) {
data[cbind(1:nrow(data), match(paste(prefix, group, sep = '_'), names(data)))]
}
mydf %>%
mutate(desired_x_value = get_value(select(., ends_with('_x')), prefix, 'x'),
desired_y_value = get_value(select(., ends_with('_y')), prefix, 'y'))
# a_x b_x a_y b_y prefix desired_x_value desired_y_value
#1 1 8 k aa a 1 k
#2 2 9 b bb b 9 bb
#3 3 10 a cc c NA <NA>
#4 4 11 d dd a 4 d
#5 5 12 z ee a 5 z
A simple rowwise also works.
mydf %>% rowwise() %>%
mutate(desired_x = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'x', sep = '_')), NA),
desired_y = ifelse(any(str_detect(names(mydf)[-5], prefix)),
get(paste(prefix, 'y', sep = '_')), NA))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc c NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
If the prefixes don't contain any invalid column prefixes, this will do without ifelse statement.
mydf <- data.frame(a_x = c(1,2,3,4,5),
b_x = c(8,9,10,11,12),
a_y = c("k",'b','a','d','z'),
b_y = c('aa','bb','cc','dd','ee'),
prefix=c("a","b","a","a","a"))
mydf %>% rowwise() %>%
mutate(desired_x = get(paste(prefix, 'x', sep = '_')),
desired_y = get(paste(prefix, 'y', sep = '_')))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x desired_y
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc a 3 a
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
First I would like to say that I am not presenting this as a good solution as other proposed solutions are much better and simpler. However, since you have brought up get function, I wanted to show you how to make use of it to get your desired output. As a matter of fact some of the values in your prefix column such as c does not have a match among your column names and get function throws an error on terminating the execution, and unlike mget function it does not have a ifnotfound argument. So you need a way to go around that error message by means of an ifelse:
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
library(glue)
mydf1 %>%
mutate(desired_x_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_x")), NA)),
desired_y_value = map(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(glue("{.x}_y")), NA))) %>%
unnest(cols = c(desired_x_value, desired_y_value))
# A tibble: 5 x 7
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z
You can also use paste function instead of glue and in case we already know the output types of the desired columns, we can spare the last line:
mydf1 %>%
mutate(desired_x_value = map_dbl(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "x", sep = "_")), NA)),
desired_y_value = map_chr(prefix, ~ ifelse(any(str_detect(names(mydf)[-5], .x)),
get(paste(.x, "y", sep = "_")), NA)))
# A tibble: 5 x 7
# Rowwise:
a_x b_x a_y b_y prefix desired_x_value desired_y_value
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 8 k aa a 1 k
2 2 9 b bb b 9 bb
3 3 10 a cc NA NA NA
4 4 11 d dd a 4 d
5 5 12 z ee a 5 z

Filter groups using a lagged column

I'm working on creating some error reports and one of the times I'm trying to address is potential errors within the ID column id_1. I've made an alternative id column from various identifying features within the rows that I'm calling id_2. To help, I've also created a date_lag column on date to catch items that were entered within a specific period after the initial entry. The main problem that I'm having is returning the entire group that meets the criteria, including that first entry that would have an NA in the date_lag, or, if I allow the NA values through, I get more than just the items I'm looking for (id_1 1 and 2 below).
Example:
#id_1 where potential errors lie
#id_2 alternative id col I'm using to test
df <- data.table(id_1 = c(1:4, 1:4),
id_2 = c(rep(c("b", "a"), c(2, 2))),
date = c(rep(1,4),rep(20,2), rep(10,2)))
df %>%
group_by(id_2) %>%
mutate(date_lag = date - lag(date)) %>%
filter(between(date_lag, 0, 10) | is.na(date_lag))
# A tibble: 6 x 4
# Groups: id_1 [4]
id_1 id_2 date date_lag
<int> <chr> <dbl> <dbl>
1 b 1 NA
2 b 1 0
3 a 1 NA
4 a 1 0
2 b 20 0
3 a 10 9
4 a 10 0
Expected:
# A tibble: 6 x 4
# Groups: id_2 [4]
id_1 id_2 value val_lag
<int> <chr> <dbl> <dbl>
3 a 1 NA
4 a 1 NA
3 a 10 9
4 a 10 9
Perhaps, we can use diff
library(dplyr)
df %>%
group_by(id_1) %>%
filter(between(diff(date), 0, 10))
-output
# A tibble: 4 x 3
# Groups: id_1 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 3 a 1
#2 4 a 1
#3 3 a 10
#4 4 a 10
Concatenate with NA as the diff returns a length 1 less than the original data
df %>%
group_by(id_2) %>%
filter(between(c(NA, diff(date)), 0, 10))
# A tibble: 5 x 3
# Groups: id_2 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 2 b 1
#2 4 a 1
#3 2 b 20
#4 3 a 10
#5 4 a 10

Assign to a value its column name

i'm working with dplyr, and i have a tibble like this:
df<- data_frame(one= c(1, NA, 1),
two= c(NA,NA,1),
three= c(1, 1,1)
)
----------------------------
# A tibble: 3 x 3
one two three
<dbl> <dbl> <dbl>
1 1 NA 1
2 NA NA 1
3 1 1 1
----------------------------
i need to obtain something like this:
----------------------------
# A tibble: 3 x 3
one two three
<dbl> <dbl> <dbl>
1 one NA three
2 NA NA three
3 one two three
----------------------------
So i can use ifelse function with mutate for each column:
df %>%
one= ifelse(!is.na(one),'one', NA ),
two= ifelse(!is.na(two),'two', NA ),
three= ifelse(!is.na(three),'three', NA ),
But in my real df i have many columns, then this wat is very inefficient.
i need something more elegant, using mutate_at and the column name but it's seem very hard.
I tried to do that in many ways but everytime i get an error.
Any advice?
If there is only 1's and NAs in the dataset, multiply the col(df) with the dataset, unlist and based on the index, just replace it with names of the dataset and assign it back to the original data
df[] <- names(df)[unlist(col(df)*df)]
df
# A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
#1 one <NA> three
#2 <NA> <NA> three
#3 one two three
Or with tidyverse, we can do this for each column (map2_df from purrr)
library(tidyverse)
df %>%
map2_df(names(.), ~replace(., !is.na(.), .y))
# A tibble: 3 x 3
# one two three
# <chr> <chr> <chr>
#1 one <NA> three
#2 <NA> <NA> three
#3 one two three

Resources