I'm working on matching last names between two tables. However, there are some variations that I have to take into account. For instance, I found that "Smith" in db1 can potentially have other forms in db2:
Smith, Smith-Whatever, Smith Jr., Smith Sr., Smith III (any Roman numeral)
Lower/uppercase is also an issue.
I'm trying to implement this logic in dplyr. I found the %ilike% operator in data.table, which seems to work kind of like the SQL equivalent. I can use it like this:
match <- db2 %>%
dplyr::filter(last_name %ilike% "^smith$" | last_name %ilike% "^smith-" | last_name %ilike% "^smith .r" | last_name %ilike% "^smith [ivx]")
Of course the strings wouldn't be hardcoded but rather obtained by iterating through db1. Either way, this is unwieldy.
Hence my question:
Is there a way to combine the functionality of %ilike% with something like %in% - by specifying a vector of regexes the 'ilikes' of which I would match against? Is there a smarter way of doing this?
You can combine the pattern with |. You may use grepl (or str_detect if you are using stringr).
library(dplyr)
db2 %>% filter(grepl("smith( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE))
# last_name
#1 Smith
#2 Smith-Whatever
#3 Smith Jr.
#4 Smith Sr.
#5 Smith III
If you want to construct the pattern dynamically you can do -
pat <- c('smith', 'smith-', 'smith .r', 'smith [ivx]')
db2 %>% dplyr::filter(grepl(paste0(pat, collapse = "|"), last_name, ignore.case = TRUE))
Also, would it be enough to filter rows that have only 'smith' in them ?
db2 %>% filter(grepl('smith', last_name, ignore.case = TRUE))
Using RonakShah's pat and my db2 below, ...
Filter
You might try an any operator to iterate through each pattern:
db2 %>%
filter(rowSums(sapply(pat, grepl, name)) > 0)
# name
# 1 smith
# 2 smith-something
And since data.table::%ilike% and data.table::%like% are really using grepl under the hood, this is about the same thing.
Merge/join
If your patterns are in a new frame, you can join them in with:
patdf <- data.frame(ptn = pat, num = seq_along(pat))
patdf
# ptn num
# 1 smith 1
# 2 smith- 2
# 3 smith .r 3
# 4 smith [ivx] 4
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn"))
# name ptn num
# 1 smith smith 1
# 2 jones <NA> NA
# 3 hubert <NA> NA
# 4 smith-something smith 1
# 5 smith-something smith- 2
Granted, this is multiplying rows, since it matches multiple times. This can be reduced. Let's assume your original data has a unique id field:
db2$id <- 10L + seq_len(nrow(db2))
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn")) %>%
filter(!is.na(ptn)) %>%
group_by(id) %>%
slice_min(num) %>%
ungroup()
# # A tibble: 2 x 4
# name id ptn num
# <chr> <int> <chr> <int>
# 1 smith 11 smith 1
# 2 smith-something 14 smith 1
Data
db2 <- structure(list(name = c("smith", "jones", "hubert", "smith-something")), class = "data.frame", row.names = c(NA, -4L))
Related
I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:
df <- data.frame(
id = 1:8,
name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The approach that I envision would look something like this:
use group_by on the code-column,
check if the group contains 2 or more rows,
check if there are any shared words among the different rows. If so, merge those rows and combine the information into a single row.
The final dataset would look like this:
final_df <- data.frame(
id = c("1|2|3", 4:8),
name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.
We could have a condition with if/else after grouping. Extract the words from the 'name' column and check for any intersecting elements, create a flag where the length of intersecting elements are greater than 0 and the group size (n()) is greater than 1 and use this to paste/str_c elements of the other columns
library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
group_by(code = factor(code, levels = unique(code))) %>%
mutate(flag = n() > 1 &
(str_extract_all(name, "\\w+") %>%
reduce(intersect) %>%
length %>%
is_greater_than(0))) %>%
summarise(across(-flag, ~ if(any(flag))
str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 6 × 3
id name code
<chr> <chr> <fct>
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
-OP's expected
> final_df
id name code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
You can use this helper function f(), and apply it to each group:
f <- function(d) {
if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
d = lapply(d,paste0,collapse="|")
}
return(d)
}
library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]
Output:
code id name
<char> <char> <char>
1: 4564AB 1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC 4 rhino
3: 7845BC 5 hippo
4: 6144DE 6 elephant
5: 7845KI 7 elephant bros
6: 7845EG 8 last comp
mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),
name = c("Adam", "Jane", "TJ", "Joyce"))
> mydat
id name
1 372303 Adam
2 KN5232 Jane
3 231244 TJ
4 283472-3822 Joyce
In my dataset, I want to keep the rows where id is a 6 digit number. For those that contain a 6 digit number followed by - and a 4 digit number, I just want to keep the first 6.
My final data should look like this:
> mydat2
id name
1 372303 Adam
3 231244 TJ
2 283472 Joyce
I am using the following grep("^[0-9]{6}$", c("372303", "KN5232", "231244", "283472-3822")) but this does not account for the case where I want to only keep the first 6 digits before the -.
One method would be to split at - and then extract with filter or subset
library(dplyr)
library(tidyr)
library(stringr)
mydat %>%
separate_rows(id, sep = "-") %>%
filter(str_detect(id, '^\\d{6}$'))
-output
# A tibble: 3 × 2
id name
<chr> <chr>
1 372303 Adam
2 231244 TJ
3 283472 Joyce
You can extract the first standalone 6-digit number from each ID and then only keep the items with 6-digit codes only:
mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),name = c("Adam", "Jane", "TJ", "Joyce"))
library(stringr)
mydat$id <- str_extract(mydat$id, "\\b\\d{6}\\b")
mydat[grepl("^\\d{6}$",mydat$id),]
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce
The \b\d{6}\b matches 6-digit codes as standalone numbers since \b are word boundaries.
You could also extract all 6-digit numbers with a very simple regex (\\d{6}), convert to numeric (as I would expect you would anyway) and remove NA's.
E.g.
library(dplyr)
library(stringr)
mydat |>
mutate(id = as.numeric(str_extract_all(id, "\\d{6}"))) |>
na.omit()
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce
I have four data frames:
df01 <- data.frame(ID = c("001","002","003","004"),
Name = c("Ben","Jennifer","Mark","Brad"),
LastName = c("Affleck","Lopez","Anthony","Pitt"))
df02 <- data.frame(ID = c("001","002"),
Age = c(37,41))
df03 <- data.frame(ID = c("003"),
Age = c(28))
df04 <- data.frame(ID = c("004"),
Age = c(48))
I am trying to join using dplyr package with the function left_join like this:
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID")
And my current outcome is
> df
ID Name LastName Age.x Age.y Age
1 001 Ben Affleck 37 NA NA
2 002 Jennifer Lopez 41 NA NA
3 003 Mark Anthony NA 28 NA
4 004 Brad Pitt NA NA 48
But my expected outcome would be:
> df
ID Name LastName Age
1 001 Ben Affleck 37
2 002 Jennifer Lopez 41
3 003 Mark Anthony 28
4 004 Brad Pitt 48
I would like to say, this is a very simplified issue because one solution would be binding rows and next applying left_join like this
dfx <- bind_rows(df02,df03,df04)
df <- df01 %>%
left_join(dfx, by = "ID")
but the real issue includes larger-than-memory and applying that solution would do an error called "Error: cannot allocate vector of size ..."
Thank you very much for your help.
Here's a use of Reduce (or you can use purrr::reduce, effectively the same thing):
fun <- function(a, b) {
out <- left_join(a, b, by = "ID", suffix = c("", ".y"))
if (all(c("Age", "Age.y") %in% names(out))) {
out <- mutate(out, Age = coalesce(Age.y, Age)) %>%
select(-Age.y)
}
out
}
Reduce(fun, list(df01, df02, df03, df04))
# ID Name LastName Age
# 1 001 Ben Affleck 37
# 2 002 Jennifer Lopez 41
# 3 003 Mark Anthony 28
# 4 004 Brad Pitt 48
Quick walk-through:
Reduce calls the function (fun here) on the first two elements of the list provided; it then calls with that return value and the 3rd element in the list; then calls with that return value and the 4th; until the list is exhausted
coalesce is a great function that returns the first non-NA value of the values provided; and it's vectorized, which is great; try coalesce(c(1,NA,3), c(22,33,44)) and get c(1,33,3).
As you have only one value for each new Age column. You could sum all of them after the left joins.
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID") %>%
mutate(Age = sum(c_across(stringr::str_subset(colnames(.), "Age")), na.rm = TRUE)) %>%
select(-stringr::str_subset(colnames(.), "Age."))
With stringr package you can select all columns that have "Age" in their name. Then you could do the same to remove all columns with "Age."
Generally, join operations are used to combine tables with different column sets. Here df02, df03 and df04 have all the same columns and seem to require row binding, rather than joining.
I would do like this:
> bind_rows(df02, df03, df04) %>% left_join(df01, ., by = "ID")
ID Name LastName Age
1 001 Ben Affleck 37
2 002 Jennifer Lopez 41
3 003 Mark Anthony 28
4 004 Brad Pitt 48
In case you are not sure that the IDs in those tables are unique, you need to decide what to do with duplicates. %>% group_by(ID) %>% summarize(Age = first(Age)) before the left join would select the first age among duplicate IDs, if any.
I just read through comparable questions, but found no answering my specific problem.
I have two dataframes,
df1 <- data.frame("name" = c("11-24", "Tim", "Anna", "67-14", "A0839", "A4b", "Lisa", "Selina"))
df2 <- data.frame("abbreviation" = c("11-24", "67-14", "A0839", "A4b"),
"name" = c("Charles", "Nick", "Harry", "Lola"))
Looking like this:
> df1
name
1 11-24
2 Tim
3 Anna
4 67-14
5 A0839
6 A4b
7 Lisa
8 Selina
> df2
abbreviation name
1 11-24 Charles
2 67-14 Nick
3 A0839 Harry
4 A4b Lola
I want to replace the abbreviations found in the column "name" of df1 by the matching name in df2.
So that 11-24 is replaced by Charles or A4b by Lola.
What I tried was:
df1 <- df1 %>%
mutate(name = ifelse(name %in% df2$abbreviation, df2$name, name))
But this give not the result I want.
I want:
> df1
name
1Charles
2 Tim
3 Anna
4 Nick
5 Harry
6 Lola
7 Lisa
8 Selina
My dataframes have a different length.
I am looking for a tidyverse-solution, maybe one of you has something in mind..
This would help me a lot :)
Best,
Kathrin
Using join and coalesce.
library(dplyr)
df1 %>%
left_join(df2, by = c('name' = 'abbreviation')) %>%
transmute(name = coalesce(name.y, name))
# name
#1 Charles
#2 Tim
#3 Anna
#4 Nick
#5 Harry
#6 Lola
#7 Lisa
#8 Selina
Base R solution:
idx <- match(df1$name, df2$abbreviation)
transform(df1, name = ifelse(!is.na(idx), df2$name[idx], name))
In base you can use match to make this update join.
idx <- match(df1$name, df2$abbreviation)
idxn <- which(!is.na(idx))
#idxn <- !is.na(idx) #Alternative
df1$name[idxn] <- df2$name[idx[idxn]]
df1
# name
#1 Charles
#2 Tim
#3 Anna
#4 Nick
#5 Harry
#6 Lola
#7 Lisa
#8 Selina
I'm attempting to create a variable in one long dataset (df1) where the value in each row needs to be based off of matching some conditions in another long dataset (df2). The conditions are:
- match on "name"
- the value for df1 should consider observations for that person that occurred before the observation in df1.
- Then I need the number of rows within that subset that meet a third condition (in the data below called "condition")
I've already tried running a for loop (I know, not preferred in R) to write it for each row in 1:nrow(df1), but I keep running into an issue that in my actual data, df1 and df2 are not the same length or a multiple.
I've also tried writing a function and applying it to df1. I tried applying it using apply, but I can't accept two dataframes in the apply syntax. I tried giving it a list of dataframes and using lapply, but it returns back null values.
Here is some generic data that fits the format of the data I'm working with.
df1 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_b = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4))
df2 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_a = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4),
condition = c("A", "B", "C", "A")
)
I know the way to get the number of rows could look something like this:
num_conditions <- nrow(df2[which(df1$nam== df2$name & df2$date_a < df1$date_b & df2$condition == "A"), ])
What I would like to see in df1 would would be a column called "num_conditions" that would show the number of observations in df2 for that person that occurred before date_b in df1 and met condition "A".
df1 should look like this:
name date_b num_conditions
John Smith 10/1/15 1
John Smith 11/15/16 0
John Smith 9/19/19 0
I'm sure there are better ways to approach including data.table, but here is one using dplyr:
library(dplyr)
set.seed(12)
df2 %>%
filter(condition == "A") %>%
right_join(df1, by = "name") %>%
group_by(name, date_b) %>%
filter(date_a < date_b) %>%
mutate(num_conditions = n()) %>%
right_join(df1, by = c("name", "date_b")) %>%
mutate(num_conditions = coalesce(num_conditions, 0L)) %>%
select(-c(date_a, condition)) %>%
distinct()
# A tibble: 4 x 3
# Groups: name, date_b [4]
name date_b num_conditions
<fct> <date> <int>
1 John Smith 2016-10-13 2
2 John Smith 2015-11-10 2
3 Jane Smith 2016-07-18 1
4 Jane Smith 2018-03-13 1
R> df1
name date_b
1 John Smith 2016-10-13
2 John Smith 2015-11-10
3 Jane Smith 2016-07-18
4 Jane Smith 2018-03-13
R> df2
name date_a condition
1 John Smith 2015-04-16 A
2 John Smith 2014-09-27 A
3 Jane Smith 2017-04-25 C
4 Jane Smith 2015-08-20 A
Maybe the following is what the question is asking for.
library(tidyverse)
df1 %>%
left_join(df2 %>% filter(condition == 'A'), by = 'name') %>%
filter(date_a < date_b) %>%
group_by(name) %>%
mutate(num_conditions = n()) %>%
select(-date_a, -condition) %>%
full_join(df1) %>%
mutate(num_conditions = ifelse(is.na(num_conditions), 0, num_conditions))
#Joining, by = c("name", "date_b")
## A tibble: 4 x 3
## Groups: name [2]
# name date_b num_conditions
# <fct> <date> <dbl>
#1 John Smith 2019-05-07 2
#2 John Smith 2019-02-05 2
#3 Jane Smith 2016-05-03 0
#4 Jane Smith 2018-06-23 0