fuzzyjoin based on relative difference

fuzzyjoin based on relative difference - r

I have understood that fuzzyjoin::difference will join two tables based on absolute difference between columns. Is there an R function that will join tables based on relative/percentage differences? I could do so using a full_join() + filter() but I suspect there is a more straightforward way.
Minimal example as follows:
library(tidyverse)
library(fuzzyjoin)
df_1 <- tibble(id = c("wombat", "jerry", "akow"), scores = c(10, 50, 75))
df_2 <- tibble(id= c("wombat", "jerry", "akow"), scores = c(14, 45, 82))
# joining based on absolute difference
difference_full_join(df_1, df_2,
by=c("scores"),
max_dist= 5,
distance_col = "abs_diff" )
# A tibble: 4 x 5
id.x scores.x id.y scores.y abs_diff
<chr> <dbl> <chr> <dbl> <dbl>
1 wombat 10 wombat 14 4
2 jerry 50 jerry 45 5
3 akow 75 NA NA NA
4 NA NA akow 82 NA
## joining based on relative difference (setting 10% as a threshold)
full_join(df_1, df_2, "id") %>%
dplyr::filter( (abs(scores.x - scores.y)/scores.x) <=0.10)
# A tibble: 2 x 3
id scores.x scores.y
<chr> <dbl> <dbl>
1 jerry 50 45
2 akow 75 82

Related

R: Joining Tibbles Using Derived Column Values

Consider the following tibbles:
library(tidyverse)
tbl_base_ids = tibble(base_id = c("ABC", "ABCDEF", "ABCDEFGHI"), base_id_length = c(3, 6, 9), record_id_length = c(10, 12, 15))
tbl_records = tibble(record_id = c("ABC1234567", "ABCDEF123456", "ABCDEFGHI123456"))
I'd like to join matching rows to produce a third tibble:
tbl_records_with_base
record_id
base_id
base_id_length
record_id_length
As you can see, this is not a matter of joining one or more variables from each of the first two. This requires matching variable derivatives. In SQL, I'd do this:
SELECT A.record_id,
B.base_id,
B.base_id_length,
B.record_id_length
FROM tbl_records A
JOIN tbl_base_ids B
ON LENGTH(a.record_id) = B.record_id_length
AND LEFT(a.record_id, B.base_id_length) = B.base_id
I've tried variations of dplyr joins and using the match function, to but to no avail. Can someone help? Thank you.

You should come up with some logic to separate base_id from record_id. because joining only on record_id_length would not be enough. For this example we can get base_id if we remove all numbers from record_id. Based on your actual dataset you need to change this if needed.
Once we do that we can join tbl_records with tbl_base_ids by base_id and record_id_length.
library(dplyr)
tbl_records %>%
mutate(base_id = sub('\\d+', '', record_id),
record_id_length = nchar(record_id)) %>%
inner_join(tbl_base_ids, by = c("base_id", "record_id_length")) -> result
result
# record_id base_id record_id_length base_id_length
# <chr> <chr> <dbl> <dbl>
#1 ABC1234567 ABC 10 3
#2 ABCDEF123456 ABCDEF 12 6
#3 ABCDEFGHI123456 ABCDEFGHI 15 9

I suggest using the fuzzyjoin package.
library(dplyr)
library(fuzzyjoin)
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(tbl_records, ., by = c("record_id" = "record_ptn"))
# # A tibble: 3 x 5
# record_id base_id base_id_length record_id_length record_ptn
# <chr> <chr> <dbl> <dbl> <chr>
# 1 ABC1234567 ABC 3 10 ^ABC.{7}$
# 2 ABCDEF123456 ABCDEF 6 12 ^ABCDEF.{6}$
# 3 ABCDEFGHI123456 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$
A note about this: the order of tables matters, where the regex must reside on the RHS of the by= settings. For instance, this does not work if we reverse it:
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(., tbl_records, by = c("record_ptn" = "record_id"))
# # A tibble: 6 x 5
# base_id base_id_length record_id_length record_ptn record_id
# <chr> <dbl> <dbl> <chr> <chr>
# 1 ABC 3 10 ^ABC.{7}$ <NA>
# 2 ABCDEF 6 12 ^ABCDEF.{6}$ <NA>
# 3 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$ <NA>
# 4 <NA> NA NA <NA> ABC1234567
# 5 <NA> NA NA <NA> ABCDEF123456
# 6 <NA> NA NA <NA> ABCDEFGHI123456

Calculating a value in R to tie the next-highest ranking value

Given the following tibble, I'd like to mutate a new column indicating the additional count necessary to tie the next-highest in rank.
v <- tribble(
~rank, ~name, ~count,
1, "Mary", 100,
2, "Fred", 96,
3, "Sue", 90,
3, "Michelle", 90,
4, "Tom", 72
)
I've tried dplyr's lag function (v %>% mutate(toTie = lag(count) - count)). This works, but not when there are ties, as any tied observation after the first is compared to an observation with the same value. For example, after the mutate I have this:
rank name count toTie
<dbl> <chr> <dbl> <dbl>
1 1 Mary 100 NA
2 2 Fred 96 4
3 3 Sue 90 6
4 3 Michelle 90 0
5 4 Tom 72 18
This output correctly says Sue, who is ranked third, needs 6 to tie Fred, who is second. But because it compares Michelle to Sue (and not Fred), it says Michelle needs none to tie Sue. This is true, but not the intent. Michelle, like Sue, needs 6 to tie second-place Fred.
Any thoughts on a better approach would be most appreciated.

We can get the difference on the lag of the 'distinct' values of 'count' and do a right_join
library(dplyr)
v %>%
distinct(count) %>%
mutate(ToTie = lag(count)- count) %>%
right_join(v) %>%
select(names(v), ToTie)
-output
# A tibble: 5 x 4
# rank name count ToTie
# <dbl> <chr> <dbl> <dbl>
#1 1 Mary 100 NA
#2 2 Fred 96 4
#3 3 Sue 90 6
#4 3 Michelle 90 6
#5 4 Tom 72 18
Or another option is fill
library(tidyr)
v %>%
mutate(toTie = lag(count) - count,
toTie = na_if(toTie, 0)) %>%
fill(toTie)

You can use match() to index the difference at first occurrence.
library(dplyr)
v %>%
mutate(toTie = c(NA, diff(-count))[match(count, count)])
# A tibble: 5 x 4
rank name count toTie
<dbl> <chr> <dbl> <dbl>
1 1 Mary 100 NA
2 2 Fred 96 4
3 3 Sue 90 6
4 3 Michelle 90 6
5 4 Tom 72 18

Merge rows containing similar strings using dplyr

I have a table containing the following data:
df <- tibble(
dose = seq(10, 50, 10),
date = c("2007-12-15", "2007-10-13","2007-10-13","2007-09-30","2007-09-30"),
response = c(45, 67, 66, 54, 55),
name = c("Peter,Martin", "Gale,Rebecca", "Rebecca,Gale", "Jonathan,Smith", "Smith,Jonathan")
)
The table:
# A tibble: 5 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 20 2007-10-13 67 Gale,Rebecca
3 30 2007-10-13 66 Rebecca,Gale
4 40 2007-09-30 54 Jonathan,Smith
5 50 2007-09-30 55 Smith,Jonathan
One of the columns called name either has a string "FirstName,LastName" or "LastName,FirstName". I wish to merge the rows that contain the same names if they are ordered either way. For example, the rows containing Rebecca,Gale and Gale,Rebecca should merge.
While merging, I wish to get the sums of the columns dose and response and want to keep the first of the date and name entries.
Expected outcome:
# A tibble: 3 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 50 2007-10-13 133 Gale,Rebecca
3 90 2007-09-30 109 Jonathan,Smith
Please note that I always want to merge using the name column and not the date column because even if the example contains the same dates, my bigger table has different dates for the same name.

Here is one idea.
library(tidyverse)
df2 <- df %>%
mutate(date = as.Date(date)) %>%
mutate(name = map_chr(name, ~toString(sort(str_split(.x, ",")[[1]])))) %>%
group_by(name) %>%
summarize(dose = sum(dose),
response = sum(response),
date = first(date)) %>%
select(names(df)) %>%
ungroup()
df2
# # A tibble: 3 x 4
# dose date response name
# <dbl> <date> <dbl> <chr>
# 1 50 2007-10-13 133 Gale, Rebecca
# 2 90 2007-09-30 109 Jonathan, Smith
# 3 10 2007-12-15 45 Martin, Peter

R dplyr::Filter dataframe by group and numeric vector?

I have dataframe df1 containing data and groups, and df2 which stores the same groups, and one value per group.
I want to filter rows of df1 by df2 where lag by group is higher than indicated value.
Dummy example:
# identify the first year of disturbance by lag by group
df1 <- data.frame(year = c(1:4, 1:4),
mort = c(5,16,40,4,5,6,10,108),
distance = rep(c("a", "b"), each = 4))
df2 = data.frame(distance = c("a", "b"),
my.median = c(12,1))
Now calculate the lag between values (creates new column) and filter df1 based on column values of df2:
# calculate lag between years
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > df2$my.median) ##
This however does not produce expected results:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 2 16 a 11
2 3 40 a 24
3 4 108 b 98
Instead, I expect to get:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 3 40 a 24
2 1 5 b 5
3 3 10 b 4
The filter works great while applied to single value, but how to adapt it to vector, and especially vector of groups (as the order of elements can potentially change?)

Is this what you're trying to do?
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
left_join(df2) %>%
filter(yearLag > my.median)
Result:
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance yearLag my.median
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 24 12
2 1 5 b 5 1
3 3 10 b 4 1
4 4 108 b 98 1

here is a data.table approach
library( data.table )
#creatae data.tables
setDT(df1);setDT(df2)
#create yearLag variable
df1[, yearLag := mort - shift( mort, type = "lag", fill = 0 ), by = .(distance) ]
#update join and filter wanted rows
df1[ df2, median.value := i.my.median, on = .(distance)][ yearLag > median.value, ][]
# year mort distance yearLag median.value
# 1: 3 40 a 24 12
# 2: 1 5 b 5 1
# 3: 3 10 b 4 1
# 4: 4 108 b 98 1

Came to the same conclusion. You should left_join the data frames.
df1 %>% left_join(df2, by="distance") %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > my.median)
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance my.median yearLag
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 12 24
2 1 5 b 1 5
3 3 10 b 1 4
4 4 108 b 1 98

For every category I want to make numeric calculations

I want to add a column in r to calculate the difference of scores for every equal characters in the same column (e.g:names)
I tried to group the data using group_by function in dplyr but it didn't work.

Something like this?
library(dplyr)
df %>% group_by(name) %>% mutate(score_diff = c(0, diff(score)))
## A tibble: 6 x 3
## Groups: name [4]
# name score score_diff
# <fct> <dbl> <dbl>
#1 James 83 0
#2 Andrew 84 0
#3 James 87 4
#4 Sonya 40 0
#5 Catherine 50 0
#6 Sonya 55 15
Sample data
df <- data.frame(
name = c("James", "Andrew", "James", "Sonya", "Catherine", "Sonya"),
score = c(83,84,87,40,50,55))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

fuzzyjoin based on relative difference - r

Related

R: Joining Tibbles Using Derived Column Values

Calculating a value in R to tie the next-highest ranking value

Merge rows containing similar strings using dplyr

R dplyr::Filter dataframe by group and numeric vector?

For every category I want to make numeric calculations

Categories

Resources