R - conditional pattern matching using grepl - r

I have two data frames, like so:
name <- c("joe", "kim", "kerry", "david")
name2 <- c("kim", "david", "joe", "kerry")
school <- c("cambridge", "south carolina", "vermont binghamton", "delaware")
school2 <- c("south carolina", "delaware", "cambridge magdalene", "vermont")
df1 <- data.frame(name, school)
df2 <- data.frame(name2, school2)
What I would like to do is the following:
Search df2$name2 for a match in df1$name.
If a match is found, compare df2$school2 to df1$school from the matching row.
If no match is found for df2$school2 in df1$school, return FALSE in column df2$perfect.match
So for example, since "joe" in df2 matches "joe" in df1, there's a match. However, since the values for "school" in both aren't the same, the would be a column in df2 with a value of FALSE in the third row. Same for 4th row in df2.
I have tried using grep and grepl. I figure grepl would be best, since it returns a logical value. What I tried was:
df2$perfect.match <- ifelse(grepl(paste(df2$name2, collapse = "|"),
df1$name, fixed = F) & grepl(paste(df2$school2, collapse = "|"), df1$school, fixed = F), "", "FALSE")
however, all I get is this:
name2 school2 perfect.match
1 kim south carolina FALSE
2 david delaware
3 joe cambridge magdalene
4 kerry vermont
When my desired result is:
df2
name2 school2 perfect.match
1 kim south carolina
2 david delaware
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE
If possible, something speedy would be best. The real dataframe are quite large. Thanks.
UPDATE:
I would like to also be able to force the rows that are false to have the same value for df2$school as their corresponding name match in df1$school Like so:
name2 school2
1 kim south carolina
2 david delaware
3 joe cambridge
4 kerry vermont binghamton

You can just do...
df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)
df2
name2 school2 perfect.match
1 kim south carolina TRUE
2 david delaware TRUE
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE

We can use match and %in%. grepl wouldn't be right here since this is exact matching and not pattern matching.
df2$perfect_match <- df2$school2 %in% df1$school[match(df2$name2, df1$name)]
df2
# name2 school2 perfect_match
#1 kim south carolina TRUE
#2 david delaware TRUE
#3 joe cambridge magdalene FALSE
#4 kerry vermont FALSE

Slightly faster than pasting the columns together:
matches <- df2$name2 %in% df1$name
df2$perfect.match <- df2$school2[matches] %in% df1$school
microbenchmark::microbenchmark(
v1 = {matches <- df2$name2 %in% df1$name
df2$perfect.match <- df2$school2[matches] %in% df1$school
},
v2 = {df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)}
)

Using dplyr, you can do:
dfX <- df1 %>%
bind_rows(.,df2) %>%
group_by(name) %>%
distinct(school) %>%
count(name, name = "perfect.matched") %>%
left_join(df2,.,by = 'name') %>%
mutate(., perfect.matched = ifelse(perfect.matched ==1,"","FALSE"))
And to get the following output:
> dfX
name school perfect.matched
1 kim south carolina
2 david delaware
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE

Related

Joining Dataframes in R, Matching Patterns in Strings

Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
You can try fuzzy_join::regex_join():
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
join_by does not support inexact join (but unequal), but you can use fuzzyjoin:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday
It can be placed in a pipeline like this:
library(magrittr)
library(sqldf)
df_1 %>%
{ sqldf("select * from [.] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
}

New Column Based on Conditions

To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin

R Identifying Dataframe Change Patterns by Groups

I have a dataframe looks like below:
person year location salary
Harry 2002 Los Angeles $2000
Harry 2006 Boston $3000
Harry 2007 Los Angeles $2500
Peter 2001 New York $2000
Peter 2002 New York $2300
Lily 2007 New York $7000
Lily 2008 Boston $2300
Lily 2011 New York $4000
Lily 2013 Boston $3300
I want to identify a pattern at the person level. I want to know who moves out of a location and came back later. For example, Harry moves out of Los Angeles and came back later. Lily moved out of new York and came back later. Also for Lily, we can say she also moved out of Boston and came back later. I only am interested in who has this pattern and does not care the number of back and forth. Therefore, ideally, the output can look like:
person move_back (yes/no)
Harry 1
Peter 0
Lily 1
With the help of data.table rleid you can do -
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(val = data.table::rleid(location)) %>%
arrange(person, location) %>%
group_by(location, .add = TRUE) %>%
summarise(move_back = any(val != lag(val, default = first(val)))) %>%
summarise(move_back = as.integer(any(move_back)))
# person move_back
# <chr> <int>
#1 Harry 1
#2 Lily 1
#3 Peter 0
You could use rle to identify situations where the are one or more instances of repeats. (I think your item Lily had two repeats.)
lapply( split(dat, dat$person), function(x) duplicated( rle(x$location)$values))
$Harry
[1] FALSE FALSE TRUE
$Lily
[1] FALSE FALSE TRUE TRUE
$Peter
[1] FALSE
You could use sapply with sum or any to determine the number of move-backs or whether any move-backs occurred. If you only want to know if there's a move-back to the first site then the logic would be different.
A slightly different data.table method, based on joins and row number (.I).
Basically I'm flagging all the times that a location for a person matches a row that is not the next row, then aggregating.
library(data.table)
setDT(dat)
dat[, rn := .I]
dat[, rnp1 := .I + 1]
dat[dat, on=.(person, location, rn > rnp1), back := TRUE]
dat[, .(move_back = any(back, na.rm=TRUE)), by=person]
# person move_back
#1: Harry TRUE
#2: Peter FALSE
#3: Lily TRUE
Where dat was:
dat <- read.csv(text="person,year,location,salary
Harry,2002,Los Angeles,$2000
Harry,2006,Boston,$3000
Harry,2007,Los Angeles,$2500
Peter,2001,New York,$2000
Peter,2002,New York,$2300
Lily,2007,New York,$7000
Lily,2008,Boston,$2300
Lily,2011,New York,$4000
Lily,2013,Boston,$3300", header=TRUE)

Comparing answers and solutions in R (i.e., comparing two table)

I have two tables, 1. answers table and 2. solution table.
The answer table is a list of Name+Answer.
name=c("Jenns","Amy","Jake","Alison","Tommy","Jason","Alex","Vivian")
guess_answer=c("sdgf23894011","lp98ung67543","pwerugji22im","21loop98un89","9580ik8584sf","awe25f6ty788","k0o2jgpo146i","rgyhuj87630l")
answer=data.frame(cbind(name,guess_answer))
> answer
name guess_answer
1 Jenns sdgf23894011
2 Amy lp98ung67543
3 Jake pwerugji22im
4 Alison 21loop98un89
5 Tommy 9580ik8584sf
6 Jason awe25f6ty788
7 Alex k0o2jgpo146i
8 Vivian rgyhuj87630l
The solution table is lists of country with a corresponding (digit+alphabet).
corresponding_number=c("2341rg4524gr","9580ik7584sf","pp0or9rjg7n2","g0o2jgpo146i","lp98ung67543","pwerugji22im","lokibh678901")
country=c("US","UK","CN","AU","JP","KR", "NP")
counry_name=c("United State","United Kingdom","China","Australia","Japan","Korea","North Pole")
solution = cbind(country, corresponding_number,counry_name)
solution = data.frame(solution)
> solution
country corresponding_number counry_name
1 US 2341rg4524gr United State
2 UK 9580ik7584sf United Kingdom
3 CN pp0or9rjg7n2 China
4 AU g0o2jgpo146i Australia
5 JP lp98ung67543 Japan
6 KR pwerugji22im Korea
7 NP lokibh678901 North Pole
I would like to compare the answer table to the solution table, in which if the guess_number is the exact same or 1 digit/alphabet different, it is consider as correct. Then I want to create a table with the country, corresponding_number, and the counry_name.
For example:
> newtable
name corresponding_number country_name
[1,] "xxx" "sdgf23894011" "xxx"
[2,] "JP" "lp98ung67543" "Japan"
[3,] "KR" "pwerugji22im" "Korea"
[4,] "xxx" "21loop98un89" "xxx"
[5,] "UK" "9580ik8584sf" "United Kingdom"
[6,] "xxx" "awe25f6ty788" "xxx"
[7,] "AU" "k0o2jgpo146i" "Australia"
[8,] "xxx" "rgyhuj87630l" "xxx"
The name needs to be: either replaced by "xxx" if answer is wrong, or the "country abbreviations" if answer is wrong.
whether the answer is correct or wrong is based on the guess_answer; the guess_answer is correct if it is a)exactly the same as one of the corresponding_number, or b)1 digit/alphabet different.
the guess_answer will not change but the colname will become "corresponding_number"
Include a third columns showing the full country name, if the guess_answer is wrong, the responding full country name will be "xxx" as well .
edit: first condition.
Here one option is stringdist_left_join, after the join and mutate to replace the NA elements with 'xxx'
library(fuzzyjoin)
library(dplyr)
stringdist_left_join(answer, solution,
by = c("guess_answer" = "corresponding_number"))%>%
mutate(corresponding_number = case_when(is.na(corresponding_number)
~ guess_answer, TRUE ~ corresponding_number),
name = case_when(is.na(country) ~ 'xxx', TRUE ~ country),
counry_name = replace(counry_name, is.na(counry_name), 'xxx')) %>%
select(name, corresponding_number = guess_answer, counry_name)
# name corresponding_number counry_name
#1 xxx sdgf23894011 xxx
#2 JP lp98ung67543 Japan
#3 KR pwerugji22im Korea
#4 xxx 21loop98un89 xxx
#5 UK 9580ik8584sf United Kingdom
#6 xxx awe25f6ty788 xxx
#7 AU k0o2jgpo146i Australia
#8 xxx rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,
counry_name, stringsAsFactors = FALSE)
In base R, we can use adist.
#Calculate distance between guess_answer and corresponding_number
mat <- adist(answer$guess_answer, solution$corresponding_number)
#assign default value to result column
answer$country_name <- 'xxx'
#select values with distance of less than or equal to 1
mat1 <- which(mat <= 1, arr.ind = TRUE)
#Order them by row
ord <- order(mat1[, 1])
#Assign values to the column
answer$country_name[mat1[ord, 1]] <- solution$counry_name[mat1[ord, 2]]
answer
# name guess_answer country_name
#1 Jenns sdgf23894011 xxx
#2 Amy lp98ung67543 Japan
#3 Jake pwerugji22im Korea
#4 Alison 21loop98un89 xxx
#5 Tommy 9580ik8584sf United Kingdom
#6 Jason awe25f6ty788 xxx
#7 Alex k0o2jgpo146i Australia
#8 Vivian rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,counry_name,
stringsAsFactors = FALSE)

Merge two datasets

I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.

Resources