Fuzzy matching strings within a single column and documenting possible matches

Fuzzy matching strings within a single column and documenting possible matches - r

I have a relatively large dataset of ~ 5k rows containing titles of journal/research papers. Here is a small sample of the dataset:
dt = structure(list(Title = c("Community reinforcement approach in the treatment of opiate addicts",
"Therapeutic justice: Life inside drug court", "Therapeutic justice: Life inside drug court",
"Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comprehensive approach for integrated care",
"An ecosystem for improving the quality of personal health records",
"Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders",
"A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders",
"A model for the assessment of static and dynamic factors in sexual offenders",
"The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse, and depression",
"Co-occurring disorders among mentally ill jail detainees. Implications for public policy",
"Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudinal Study",
"Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure",
"Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure",
"Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0",
"Diagnosis of active and latent tuberculosis: summary of NICE guidance",
"Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium"
)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))
You can see that there are some duplicates of titles in there, but with formatting/case differences. I want to identify titles that are duplicated and create a new variable that documents which rows are possibly matching. To do this, I have attempted to use the agrep function as suggested here :
dt$is.match <- sapply(dt$Title,agrep,dt$Title)
This identifies matches, but saves the results as a list in the new variable column. Is there a way to do this (preferably using base r or data.table) where the results of agrep are not saved as a list, but only identifying which rows are matches (e.g., 6:7)?
Thanks in advance - hope I have provided enough information.

Do you need something like this?
dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)
dt
# A tibble: 16 x 2
# Title is.match
# <chr> <chr>
# 1 Community reinforcement approach in the treatment of opiate addicts 1
# 2 Therapeutic justice: Life inside drug court 2, 3
# 3 Therapeutic justice: Life inside drug court 2, 3
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4
# 5 An ecosystem for improving the quality of personal health records 5
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders 6
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders 7, 8
# 8 A model for the assessment of static and dynamic factors in sexual offenders 7, 8
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy 10
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure 12, 13
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure 12, 13
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance 15
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium 16

This isn't base r nor data.table, but here's one way using tidyverse to detect duplicates:
library(janitor)
library(tidyverse)
dt %>%
mutate(row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 2 x 3
Title dupe_count row
<chr> <int> <int>
1 Therapeutic justice: Life inside drug court 2 2
2 Therapeutic justice: Life inside drug court 2 3
If you wanted to pick out duplicates that aren't case-sensitive, try this:
dt %>%
mutate(Title = str_to_lower(Title),
row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 6 x 3
Title dupe_count row
<chr> <int> <int>
1 a model for the assessment of static and dynamic factors in sexual offend… 2 7
2 a model for the assessment of static and dynamic factors in sexual offend… 2 8
3 behavioral health and adult milestones in young adults with perinatal hiv… 2 12
4 behavioral health and adult milestones in young adults with perinatal hiv… 2 13
5 therapeutic justice: life inside drug court 2 2
6 therapeutic justice: life inside drug court 2 3

Related

I am trying to filter on two conditions, but I keep removing all patients with either condition

I'm a beginner on R so apologies for errors, and thank you for helping.
I have a dataset (liver) where rows are patient ID numbers, and columns include what region the patient resides in (London, Yorkshire etc) and what unit the patient was treated in (hospital name). Some of the units are private units. I've identified 120 patients from London, of whom 100 were treated across three private units. I want to remove the 100 London patients treated in private units but I keep accidentally removing all patients treated in the private units (around 900 patients). I'd be grateful for advice on how to just remove the London patients treated privately.
I've tried various combinations of using subset and filter with different exclamation points and brackets in different places including for example:
liver <- filter(liver, region_name != "London" & unit_name!="Primrose Hospital" & unit_name != "Oak Hospital" & unit_name != "Wilson Hospital")
Thank you very much.

Your unit_name condition is zeroing your results. Try using the match function which is more commonly seen in its infix form %in%:
liver <- filter(liver,
region_name != "London",
! unit_name %in% c("Primrose Hospital",
"Oak Hospital",
"Wilson Hospital"))
Also you can separate logical AND conditions using a comma.

Building on Pariksheet's great start (still drops outside-London private hospital patients). Here we need to use the OR | operator within the filter function. I've made an example dataframe which demonstrates how this works for your case. The example tibble contains your three private London hospitals plus one non-private hospital that we want to keep. Plus, it has Manchester patients who attend both Manch and one of the private hospitals, all of whom we want to keep.
EDITED: Now includes character vectors to allow generalisation of combinations to exclude.
liver <- tibble(region_name = rep(c('London', 'Liverpool', 'Glasgow', 'Manchester'), each = 4),
unit_name = c(rep(c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital',
'State Hospital'), times = 3),
rep(c('Manch General', 'Primrose Hospital'), each = 2)))
liver
# A tibble: 16 x 2
region_name unit_name
<chr> <chr>
1 London Primrose Hospital
2 London Oak Hospital
3 London Wilson Hospital
4 London State Hospital
5 Liverpool Primrose Hospital
6 Liverpool Oak Hospital
7 Liverpool Wilson Hospital
8 Liverpool State Hospital
9 Glasgow Primrose Hospital
10 Glasgow Oak Hospital
11 Glasgow Wilson Hospital
12 Glasgow State Hospital
13 Manchester Manch General
14 Manchester Manch General
15 Manchester Primrose Hospital
16 Manchester Primrose Hospital
excl.private.regions <- c('London',
'Liverpool',
'Glasgow')
excl.private.hospitals <- c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital')
liver %>%
filter(! region_name %in% excl.private.regions |
! unit_name %in% excl.private.hospitals)
# A tibble: 7 x 2
region_name unit_name
<chr> <chr>
1 London State Hospital
2 Liverpool State Hospital
3 Glasgow State Hospital
4 Manchester Manch General
5 Manchester Manch General
6 Manchester Primrose Hospital
7 Manchester Primrose Hospital

Estimating likelihood of survey attrition relative to treatment

I have a panel survey data where each row represents an individual, their interview date, and labor market status during that period. However, it's an unbalanced panel data where some observations appear more than others (i.e. because some individuals stopped responding to the survey's organizers). Data was collected on individuals before and after some observations were randomly given a cash assistance benefit.
I am interested in knowing whether some individuals stopped responding to our survey specifically after they received the cash benefit (i.e. the treatment date which is on 2019-09-03)?
In other words, I am interested in testing the probability of leaving the survey relative to the "date" variable but I am not sure how to do that.
Here is a data example. For instance, we can see that some individuals like Cartman who received treatment in Sept 2019 stopped responding to the survey in following years and thus their job market status is recorded as "N/A"
Other observations in the control group like Mackey who did not receive the treatment continued responding to the survey in the following years.
individual date labor_status cash_ benefit
Kenny 2018-09-02. unemployed 0
Kenny 2019-09-03. unemployed 1
Kenny 2020-09-07. employed 1
Kenny 2021-09-13. employed 1
Cartman 2018-09-03. unemployed 0
Cartman 2019-09-06. unemployed 1
Cartman 2020-09-08. N/A 1
Cartman 2021-09-08. N/A 1
Mackey 2018-09-03. employed 0
Mackey 2019-09-04. unemployed 0
Mackey 2020-09-08. employed 0
Mackey 2021-09-13. employed 0

If you’re looking to test this statistically, you should ask on Cross Validated. But if you just want the probability of dropout after 2019 conditional on receiving benefit:
library(dplyr)
library(lubridate)
dat %>%
group_by(individual) %>%
summarize(
benefit = any(cash_benefit == 1),
dropout_after_2019 = all(
year(date) < 2019 |
(year(date) == 2019 & !is.na(labor_status)) |
is.na(labor_status)
)
) %>%
group_by(benefit) %>%
summarize(p_dropout_after_2019 = mean(dropout_after_2019))
# A tibble: 2 × 2
benefit p_dropout_after_2019
<lgl> <dbl>
1 FALSE 0
2 TRUE 0.5

Scraping Amazon Prices and matching against entries in spreadsheet

In my job I currently compare wholesale items (sent to us from suppliers in CSV format) to Amazon listings to find profitable items for the business.
I want to build a tool to help me with this very manual process. I know basic Python but what other languages or packages could help me with this?
I imagine I'd need to load the CSV, do column matching (as these will differ across suppliers) then somehow scan amazon for the same products and scrape pricing information.
I'd also like whatever I build to look nice/be really user friendly so I can share it with my colleague.
I'm willing to put in the work to learn myself and know I might have to put in a fair amount of learning but a nudge in the right direction / a list of key skills I should research would be much appreciated.

Here is a function in R to get all matching products and prices from amazon via their search terms.
library(rvest)
library(xml2)
library(tidyverse)
price_xpath <- ".//parent::a/parent::h2/parent::div/following-sibling::div//a/span[not(#class='a-price a-text-price')]/span[#class='a-offscreen']"
description_xpath <- "//div/h2/a/span"
get_amazon_info <- function(item) {
source_html <- read_html(str_c("https://www.amazon.com/s?k=", str_replace_all(item, " ", "+")))
root_nodes <- source_html %>%
html_elements(xpath = description_xpath)
prices <- xml_find_all(root_nodes, xpath = price_xpath, flatten = FALSE)
prices <- lapply(prices, function(x) html_text(x)[1])
prices[lengths(prices) == 0] <- NA
tibble(product = html_text(root_nodes),
price = unlist(prices, use.names = FALSE)) %>%
mutate(price = parse_number(str_remove(price, "\\$")))
}
get_amazon_info("dog food")
# # A tibble: 67 × 2
# product price
# <chr> <dbl>
# 1 Purina Pro Plan High Protein Dog Food with Probiotics for Dogs, Shredded Blend Turkey & Rice Formula - 17 lb. Bag 45.7
# 2 Purina Pro Plan High Protein Dog Food With Probiotics for Dogs, Shredded Blend Chicken & Rice Formula - 35 lb. Bag 64.0
# 3 Rachael Ray Nutrish Premium Natural Dry Dog Food, Real Chicken & Veggies Recipe, 28 Pounds (Packaging May Vary) 40.0
# 4 Blue Buffalo Life Protection Formula Natural Adult Dry Dog Food, Chicken and Brown Rice 34-lb 69.0
# 5 Blue Buffalo Life Protection Formula Natural Adult Dry Dog Food, Chicken and Brown Rice 30-lb 61.0
# 6 Purina ONE Natural Dry Dog Food, SmartBlend Lamb & Rice Formula - 8 lb. Bag 14.0
# 7 Purina ONE High Protein Senior Dry Dog Food, +Plus Vibrant Maturity Adult 7+ Formula - 31.1 lb. Bag 44.4
# 8 Purina Pro Plan Weight Management Dog Food, Shredded Blend Chicken & Rice Formula - 34 lb. Bag 66.0
# 9 NUTRO NATURAL CHOICE Large Breed Adult Dry Dog Food, Chicken & Brown Rice Recipe Dog Kibble, 30 lb. Bag 63.0
# 10 NUTRO NATURAL CHOICE Healthy Weight Adult Dry Dog Food, Chicken & Brown Rice Recipe Dog Kibble, 30 lb. Bag 63.0
# # … with 57 more rows
If you are new to R, you will have to install.packages("tidyverse") first.
After library(tidyverse), you can read_csv the products file, and execute the following commands as so. Replace the sample data with the read_csv command.
products <- tribble(~item,
"best books",
"flour",
"dog food") %>%
rowwise() %>%
mutate(amazon = map(item, get_amazon_info)) %>%
unnest(everything())
# # A tibble: 208 × 3
# item product price
# <chr> <chr> <dbl>
# 1 best books Takeaway Quotes for Coaching Champions for Life: The Process of Mentoring the Person, Athlete and Player 15.0
# 2 best books The Art of War (Deluxe Hardbound Edition) 15.3
# 3 best books Turbulent: A Post Apocalyptic EMP Survival Thriller (Days of Want Series Book 1) 13.0
# 4 best books Revenge at Sea (Quint Adler Thrillers Book 1) 0
# 5 best books The Family Across the Street: A totally unputdownable psychological thriller with a shocking twist 9.89
# 6 best books Where the Crawdads Sing 9.98
# 7 best books The Seven Husbands of Evelyn Hugo: A Novel 9.42
# 8 best books Addlestone: The Addlestone Chronicles Book 1 October 1St 1934 - July 20Th 1935 4.99
# 9 best books The Wife Before: A Spellbinding Psychological Thriller with a Shocking Twist 13.6
# 10 best books Wish You Were Here: A Novel 11.0
# # … with 198 more rows
As you are new to StackOverflow, please remember to hit the green check mark if this helps. Please add some more tags to your post, and change the title to "Scraping Amazon Prices".

How to group similar strings together in a database in R

I have a tibble of just 1 column called 'title'.
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
2 zostavax shingles vaccine
3 xray operator
4 workplace mental health wellbeing workshop
5 zostavax recall toolkit
6 xray meetint
7 workplace mental health and wellbeing
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
I wish to find similar records and group them together as such (all the while keeping their indices):
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
2 zostavax shingles vaccine
5 zostavax recall toolkit
3 xray operator
6 xray meetint
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
4 workplace mental health wellbeing workshop
7 workplace mental health and wellbeing
I'm using the below function to find strings that are close enough to each other (cutoff = 0.75)
compareJW <- function(string1, string2, cutoff)
{
require(RecordLinkage)
jarowinkler(string1, string2) > cutoff
}
I've implemented the loop below to 'send' similar records together in a new dataframe but it's not working properly, I've tried a few variations but nothing is working yet.
# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)
for(i in 1:nrow(dat))
{
# print(dat$title[i])
for(j in 1:nrow(dat))
{
print(dat$title[i])
print(dat$title[j])
# score <- jarowinkler(dat$title[i], dat$title[j])
if(dat$title[i] != dat$title[j]
&&
compareJW(dat$title[i], dat$title[j], 0.75))
{
print("if")
# newDB <- rbind(newDB,
# dat$title[i],
# dat$title[j])
}
else
{
print("else")
# newDB <- rbind(newDB, dat$title[i])
}
}
}
(I've inserted prints in the loop 'to see what's happening')
REPRODUCIBLE DAT:
dat <-
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine",
"xray operator", "workplace mental health wellbeing workshop",
"zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing",
"lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses",
"xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
Any suggestions please?
EDIT: I'd also like a new index column called 'group' as below:
> dat
# A tibble: 13 x 1
index group title
<chr>
1 1 lymphoedema clinic
8 1 lymphoedema early intervenstion
9 1 lymphoedema expo
10 1 lymphoedema for breast care nurses
2 2 zostavax shingles vaccine
5 2 zostavax recall toolkit
3 3 xray operator
6 3 xray meetint
11 3 xray meeting and case studies
12 3 xray online examination
13 3 xray operator in service paediatric extremities
4 4 workplace mental health wellbeing workshop
7 4 workplace mental health and wellbeing

I'm afraid I've never tried RecordLinkage, but if you're just using the Jaro-Winkler distance it should also be fairly easy to cluster similar strings with the stringdist package. Using your dput above:
library(tidyverse)
library(stringdist)
map_dfr(dat$title, ~ {
i <- which(stringdist(., dat$title, "jw") < 0.40)
tibble(index = i, title = dat$title[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group))
Explanation:
map_dfr iterates over each string in dat$title, extracts the indices of the closest matches computed by stringdist (constrained by 0.40, i.e. your "threshold"), creates a tibble with the indices and matches, then stacks these tibbles with a group variable corresponding to the integer position (and row number) of the original string. distinct then drops any cluster duplicates based on repeats of index.
Output:
# A tibble: 13 x 3
group index title
<int> <int> <chr>
1 1 1 lymphoedema clinic
2 1 8 lymphoedema early intervenstion
3 1 9 lymphoedema expo
4 1 10 lymphoedema for breast care nurses
5 2 2 zostavax shingles vaccine
6 2 5 zostavax recall toolkit
7 2 11 xray meeting and case studies
8 3 3 xray operator
9 3 6 xray meetint
10 3 12 xray online examination
11 3 13 xray operator in service paediatric extremities
12 4 4 workplace mental health wellbeing workshop
13 4 7 workplace mental health and wellbeing
An interesting alternative would be to use tidytext with widyr to tokenize by word and compute the cosine similarity of the titles based on similar words, rather than characters as above.

Combining 2 columns in R prioriziting one of them

I know nothing of R, and I have a data.frame with 2 columns, both of them are about the sex of the animals, but one of them have some corrections and the other doesn't.
My desired data.frame would be like this:
id sex father mother birth.date farm
0 1 john ray 05/06/94 1
1 1 doug ana 18/02/93 NA
2 2 bryan kim 21/03/00 3
But i got to this data.frame by using merge on 2 others data.frames
id sex.x father mother birth.date sex.y farm
0 2 john ray 05/06/94 1 1
1 1 doug ana 18/02/93 NA NA
2 2 bryan kim 21/03/00 2 3
data.frame 1 or Animals (Has the wrong sex for some animals)
id sex father mother birth.date
0 2 john ray 05/06/94
1 1 doug ana 18/02/93
2 2 bryan kim 21/03/00
data.frame 2 or Farm (Has the correct sex):
id farm sex
0 1 1
2 3 2
The code i used was: Animals_Farm <- merge(Animals , Farm, by="id", all.x=TRUE)
I need to combine the 2 sex columns into one, prioritizing sex.y. How do I do that?

If I correctly understand you example you have a situation similar to what I show below based on the example from the merge function.
> (authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 3), "yes")))
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia yes
> (books <- data.frame(
name = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...", "LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
deceased = c("yes", rep("no", 6))))
name title deceased
1 Tukey Exploratory Data Analysis yes
2 Venables Modern Applied Statistics ... no
3 Tierney LISP-STAT no
4 Ripley Spatial Statistics no
5 Ripley Stochastic Simulation no
6 McNeil Interactive Data Analysis no
7 R Core An Introduction to R no
> (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
surname nationality deceased.x title deceased.y
1 McNeil Australia yes Interactive Data Analysis no
2 Ripley UK no Spatial Statistics no
3 Ripley UK no Stochastic Simulation no
4 Tierney US no LISP-STAT no
5 Tukey US yes Exploratory Data Analysis yes
6 Venables Australia no Modern Applied Statistics ... no
Where authors might represent your first dataframe and books your second and deceased might be the value that is in both dataframe but only up to date in one of them (authors).
The easiest way to only include the correct value of deceased would be to simply exclude the incorrect one from the merge.
> (m2 <- merge(authors, books[names(books) != "deceased"],
by.x = "surname", by.y = "name"))
surname nationality deceased title
1 McNeil Australia yes Interactive Data Analysis
2 Ripley UK no Spatial Statistics
3 Ripley UK no Stochastic Simulation
4 Tierney US no LISP-STAT
5 Tukey US yes Exploratory Data Analysis
6 Venables Australia no Modern Applied Statistics ...
The line of code books[names(books) != "deceased"] simply subsets the dataframe books to remove the deceased column leaving only the correct deceased column from authors in the final merge.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex